16
Alerting · signals worth waking someone for
Alerts are asks for action, not status updates.
Most teams confuse alerts with logs. Logs tell you what happened. Alerts tell you who needs to do something right now. If an alert goes off and nobody knows what to do, it's just noise. If it goes off and the answer is always "click the button to dismiss it", that's worse than noise: you're teaching people to ignore alerts altogether.
Four alert tiers
Tier 1 · INFO
Goes to a dashboard or log file. Nobody gets paged. Useful for spotting trends over time. Things like "workflow finished", "agent started", "tool called".
Tier 2 · WARN
Pings the on-call channel during work hours. Things like "agent retried 3 times", "response slower than usual", "token budget 80% used". Action: look into it when you have time.
Tier 3 · HIGH
Pages the on-call engineer within minutes, even at 3am. Things like "failure rate just spiked", "more than 5% of requests are getting blocked", "fraud detector is climbing fast". Action: look now.
Tier 4 · CRITICAL
Pages on-call and their manager, and automatically pauses the affected workflow. Things like "personal data leaked", "cost spiraling out of control", "an agent might be compromised". Action: stop the bleeding first, figure out why later.
What kinds of things to alert on
- Behavior alerts. "Agent ran more iterations than allowed", "agent tried to call a tool it isn't allowed to", "agent's output didn't match the expected format". Easy to write, easy to test.
- Statistical alerts. "Failure rate is way above the usual baseline", "response time doubled", "cost per task tripled". You need a baseline first; collect numbers from your first 1,000 production runs to set one.
- Meaning-based alerts. "Output looks similar to known-bad responses", "agent is repeating itself", "agent has wandered off the goal". Needs embedding-based comparisons.
- Combination alerts. "Two specific things happened together", e.g. "the fraud detector says risky AND it's a high-value transaction". Looking at conditions one at a time misses these; combining them catches them.
- Missing event alerts. "An event that should happen, didn't". For example, a workflow that normally emits a "done" message hasn't in 10 minutes. These are easy to overlook and often the most useful kind.
A typical day's alert stream
A realistic alert stream from production
What you'd see across a 30-minute window in a busy multi-agent system. Mix of routine, suspicious, and serious.
Alert routing logic in code
from dataclasses import dataclass
from enum import Enum
import time
class Tier(Enum):
INFO = 1
WARN = 2
HIGH = 3
CRITICAL = 4
@dataclass
class Alert:
tier: Tier
rule: str
workflow_id: str
agent: str
details: dict
auto_action: str | None # 'halt' | 'pause' | 'rollback' | None
class AlertEngine:
def __init__(self, baselines: dict, dedup_window_s: int = 300):
self.baselines = baselines # per-rule p99 etc.
self.dedup = {} # suppress duplicates
self.dedup_window = dedup_window_s
def evaluate(self, event: dict) -> Alert | None:
# Compositional rule: high-value AND high fraud score
if (event.get("value", 0) > 5000
and event.get("fraud_score", 0) > 0.6):
return self._fire(Alert(
tier=Tier.CRITICAL,
rule="high_value_high_fraud",
workflow_id=event["wf"],
agent=event["agent"],
details=event,
auto_action="halt",
))
# Behavioral: tool not in allow-list
if event.get("event") == "tool_denied":
return self._fire(Alert(Tier.HIGH, "tool_allow_violation",
event["wf"], event["agent"], event, None))
# Statistical: latency above baseline + 3σ
latency = event.get("latency_ms")
baseline = self.baselines.get(event.get("agent"), {})
if latency and baseline:
threshold = baseline["mean"] + 3 * baseline["std"]
if latency > threshold:
return self._fire(Alert(Tier.WARN, "latency_anomaly",
event["wf"], event["agent"], event, None))
# Absence: expected heartbeat missing
if event.get("event") == "heartbeat_missing":
age = event.get("age_s", 0)
tier = Tier.HIGH if age > 600 else Tier.WARN
return self._fire(Alert(tier, "heartbeat_absent",
event["wf"], event["agent"], event, None))
return None
def _fire(self, alert: Alert) -> Alert | None:
# Dedup: suppress same rule for same workflow within window
key = (alert.rule, alert.workflow_id)
now = time.time()
if key in self.dedup and now - self.dedup[key] < self.dedup_window:
return None
self.dedup[key] = now
self._dispatch(alert)
return alert
def _dispatch(self, alert: Alert):
if alert.tier == Tier.CRITICAL:
page_oncall(alert); page_manager(alert); slack("#incidents", alert)
if alert.auto_action == "halt":
halt_workflow(alert.workflow_id)
elif alert.tier == Tier.HIGH:
page_oncall(alert); slack("#agent-alerts", alert)
elif alert.tier == Tier.WARN:
slack("#agent-alerts", alert)
else:
log(alert)
Common alerting mistakes to avoid
- No deduplication. One broken workflow can fire 200 identical alerts and bury the next real issue under noise.
- Alerting on fixed numbers when the baseline shifts. "Alert when latency over 500ms" stops working when 500ms is the new normal. Alert on changes from baseline instead.
- No automatic response on critical alerts. By the time a human acknowledges the page, the damage is done. Critical alerts should also pause the affected workflow automatically.
- Alerts that don't tell you what to do. "Agent X had an error" with no context is useless. Every alert should link to a runbook or include a clear next step.
- Alerting every time a guardrail blocks something. Guardrails blocking is normal; that's their job. Alert when the rate of blocks suddenly changes, not on each block.
A useful alert tells you three things: what's wrong, what it affects, and what to do next. Drop any of these three and what's left is noise.