Tone Dark
Tint
16 Alerting · signals worth waking someone for

Alerts are asks for action, not status updates.

Most teams confuse alerts with logs. Logs tell you what happened. Alerts tell you who needs to do something right now. If an alert goes off and nobody knows what to do, it's just noise. If it goes off and the answer is always "click the button to dismiss it", that's worse than noise: you're teaching people to ignore alerts altogether.

Four alert tiers

Tier 1 · INFO
Goes to a dashboard or log file. Nobody gets paged. Useful for spotting trends over time. Things like "workflow finished", "agent started", "tool called".
Tier 2 · WARN
Pings the on-call channel during work hours. Things like "agent retried 3 times", "response slower than usual", "token budget 80% used". Action: look into it when you have time.
Tier 3 · HIGH
Pages the on-call engineer within minutes, even at 3am. Things like "failure rate just spiked", "more than 5% of requests are getting blocked", "fraud detector is climbing fast". Action: look now.
Tier 4 · CRITICAL
Pages on-call and their manager, and automatically pauses the affected workflow. Things like "personal data leaked", "cost spiraling out of control", "an agent might be compromised". Action: stop the bleeding first, figure out why later.

What kinds of things to alert on

A typical day's alert stream

A realistic alert stream from production
What you'd see across a 30-minute window in a busy multi-agent system. Mix of routine, suspicious, and serious.

Alert routing logic in code

from dataclasses import dataclass
from enum import Enum
import time

class Tier(Enum):
    INFO = 1
    WARN = 2
    HIGH = 3
    CRITICAL = 4

@dataclass
class Alert:
    tier: Tier
    rule: str
    workflow_id: str
    agent: str
    details: dict
    auto_action: str | None     # 'halt' | 'pause' | 'rollback' | None

class AlertEngine:
    def __init__(self, baselines: dict, dedup_window_s: int = 300):
        self.baselines = baselines           # per-rule p99 etc.
        self.dedup = {}                       # suppress duplicates
        self.dedup_window = dedup_window_s

    def evaluate(self, event: dict) -> Alert | None:
        # Compositional rule: high-value AND high fraud score
        if (event.get("value", 0) > 5000
            and event.get("fraud_score", 0) > 0.6):
            return self._fire(Alert(
                tier=Tier.CRITICAL,
                rule="high_value_high_fraud",
                workflow_id=event["wf"],
                agent=event["agent"],
                details=event,
                auto_action="halt",
            ))

        # Behavioral: tool not in allow-list
        if event.get("event") == "tool_denied":
            return self._fire(Alert(Tier.HIGH, "tool_allow_violation",
                event["wf"], event["agent"], event, None))

        # Statistical: latency above baseline + 3σ
        latency = event.get("latency_ms")
        baseline = self.baselines.get(event.get("agent"), {})
        if latency and baseline:
            threshold = baseline["mean"] + 3 * baseline["std"]
            if latency > threshold:
                return self._fire(Alert(Tier.WARN, "latency_anomaly",
                    event["wf"], event["agent"], event, None))

        # Absence: expected heartbeat missing
        if event.get("event") == "heartbeat_missing":
            age = event.get("age_s", 0)
            tier = Tier.HIGH if age > 600 else Tier.WARN
            return self._fire(Alert(tier, "heartbeat_absent",
                event["wf"], event["agent"], event, None))

        return None

    def _fire(self, alert: Alert) -> Alert | None:
        # Dedup: suppress same rule for same workflow within window
        key = (alert.rule, alert.workflow_id)
        now = time.time()
        if key in self.dedup and now - self.dedup[key] < self.dedup_window:
            return None
        self.dedup[key] = now
        self._dispatch(alert)
        return alert

    def _dispatch(self, alert: Alert):
        if alert.tier == Tier.CRITICAL:
            page_oncall(alert); page_manager(alert); slack("#incidents", alert)
            if alert.auto_action == "halt":
                halt_workflow(alert.workflow_id)
        elif alert.tier == Tier.HIGH:
            page_oncall(alert); slack("#agent-alerts", alert)
        elif alert.tier == Tier.WARN:
            slack("#agent-alerts", alert)
        else:
            log(alert)

Common alerting mistakes to avoid

A useful alert tells you three things: what's wrong, what it affects, and what to do next. Drop any of these three and what's left is noise.