Alerting

16 Alerting · signals worth waking someone for

Alerts are asks for action, not status updates.

Most teams confuse alerts with logs. Logs tell you what happened. Alerts tell you who needs to do something right now. If an alert goes off and nobody knows what to do, it's just noise. If it goes off and the answer is always "click the button to dismiss it", that's worse than noise: you're teaching people to ignore alerts altogether.

Four alert tiers

Tier 1 · INFO

Goes to a dashboard or log file. Nobody gets paged. Useful for spotting trends over time. Things like "workflow finished", "agent started", "tool called".

Tier 2 · WARN

Pings the on-call channel during work hours. Things like "agent retried 3 times", "response slower than usual", "token budget 80% used". Action: look into it when you have time.

Tier 3 · HIGH

Pages the on-call engineer within minutes, even at 3am. Things like "failure rate just spiked", "more than 5% of requests are getting blocked", "fraud detector is climbing fast". Action: look now.

Tier 4 · CRITICAL

Pages on-call and their manager, and automatically pauses the affected workflow. Things like "personal data leaked", "cost spiraling out of control", "an agent might be compromised". Action: stop the bleeding first, figure out why later.

What kinds of things to alert on

Behavior alerts. "Agent ran more iterations than allowed", "agent tried to call a tool it isn't allowed to", "agent's output didn't match the expected format". Easy to write, easy to test.
Statistical alerts. "Failure rate is way above the usual baseline", "response time doubled", "cost per task tripled". You need a baseline first; collect numbers from your first 1,000 production runs to set one.
Meaning-based alerts. "Output looks similar to known-bad responses", "agent is repeating itself", "agent has wandered off the goal". Needs embedding-based comparisons.
Combination alerts. "Two specific things happened together", e.g. "the fraud detector says risky AND it's a high-value transaction". Looking at conditions one at a time misses these; combining them catches them.
Missing event alerts. "An event that should happen, didn't". For example, a workflow that normally emits a "done" message hasn't in 10 minutes. These are easy to overlook and often the most useful kind.

A typical day's alert stream

A realistic alert stream from production

What you'd see across a 30-minute window in a busy multi-agent system. Mix of routine, suspicious, and serious.

Alert routing logic in code

from dataclasses import dataclass
from enum import Enum
import time

class Tier(Enum):
    INFO = 1
    WARN = 2
    HIGH = 3
    CRITICAL = 4

@dataclass
class Alert:
    tier: Tier
    rule: str
    workflow_id: str
    agent: str
    details: dict
    auto_action: str | None     # 'halt' | 'pause' | 'rollback' | None

class AlertEngine:
    def __init__(self, baselines: dict, dedup_window_s: int = 300):
        self.baselines = baselines           # per-rule p99 etc.
        self.dedup = {}                       # suppress duplicates
        self.dedup_window = dedup_window_s

    def evaluate(self, event: dict) -> Alert | None:
        # Compositional rule: high-value AND high fraud score
        if (event.get("value", 0) > 5000
            and event.get("fraud_score", 0) > 0.6):
            return self._fire(Alert(
                tier=Tier.CRITICAL,
                rule="high_value_high_fraud",
                workflow_id=event["wf"],
                agent=event["agent"],
                details=event,
                auto_action="halt",
            ))

        # Behavioral: tool not in allow-list
        if event.get("event") == "tool_denied":
            return self._fire(Alert(Tier.HIGH, "tool_allow_violation",
                event["wf"], event["agent"], event, None))

        # Statistical: latency above baseline + 3σ
        latency = event.get("latency_ms")
        baseline = self.baselines.get(event.get("agent"), {})
        if latency and baseline:
            threshold = baseline["mean"] + 3 * baseline["std"]
            if latency > threshold:
                return self._fire(Alert(Tier.WARN, "latency_anomaly",
                    event["wf"], event["agent"], event, None))

        # Absence: expected heartbeat missing
        if event.get("event") == "heartbeat_missing":
            age = event.get("age_s", 0)
            tier = Tier.HIGH if age > 600 else Tier.WARN
            return self._fire(Alert(tier, "heartbeat_absent",
                event["wf"], event["agent"], event, None))

        return None

    def _fire(self, alert: Alert) -> Alert | None:
        # Dedup: suppress same rule for same workflow within window
        key = (alert.rule, alert.workflow_id)
        now = time.time()
        if key in self.dedup and now - self.dedup[key] < self.dedup_window:
            return None
        self.dedup[key] = now
        self._dispatch(alert)
        return alert

    def _dispatch(self, alert: Alert):
        if alert.tier == Tier.CRITICAL:
            page_oncall(alert); page_manager(alert); slack("#incidents", alert)
            if alert.auto_action == "halt":
                halt_workflow(alert.workflow_id)
        elif alert.tier == Tier.HIGH:
            page_oncall(alert); slack("#agent-alerts", alert)
        elif alert.tier == Tier.WARN:
            slack("#agent-alerts", alert)
        else:
            log(alert)

Common alerting mistakes to avoid

No deduplication. One broken workflow can fire 200 identical alerts and bury the next real issue under noise.
Alerting on fixed numbers when the baseline shifts. "Alert when latency over 500ms" stops working when 500ms is the new normal. Alert on changes from baseline instead.
No automatic response on critical alerts. By the time a human acknowledges the page, the damage is done. Critical alerts should also pause the affected workflow automatically.
Alerts that don't tell you what to do. "Agent X had an error" with no context is useless. Every alert should link to a runbook or include a clear next step.
Alerting every time a guardrail blocks something. Guardrails blocking is normal; that's their job. Alert when the rate of blocks suddenly changes, not on each block.

A useful alert tells you three things: what's wrong, what it affects, and what to do next. Drop any of these three and what's left is noise.