Tone Dark
Tint
15 Risk modeling · what can go wrong, scored

Risk is just how likely × how bad × how far it spreads.

You can't lock down every possible failure with maximum care; if you tried, you'd never ship anything. Risk modeling is how you decide which failures need strict safety checks, which just need monitoring, and which are fine to accept. There are three pieces to think about:

Failures specific to agent systems

What can go wrongHow oftenHow badWhat helps
Agent confidently states something falseHighVariesHave it look things up; cross-check; validate the output format
Hidden instructions inside a documentMediumHighStrict output formats, separate roles, sandboxing
Agent calls a destructive toolLowCriticalRestrict tools per agent, require human approval, allow dry-run mode
Infinite loops or runaway token spendMediumHighCap iterations, set a token budget, detect when no progress is happening
Agent forgets the original goal mid-taskMediumMediumRe-state the goal periodically, check final answer against the original ask
Personal data or secrets leak in outputMediumCriticalOutput filters, redaction, separate sensitive data into its own roles
One agent's bad output poisons anotherHighMediumCurated handoffs, typed message formats
An agent earlier in the chain has been compromisedLowCriticalLayered defenses, have a different agent verify critical outputs
Agent gaming its own metricsMediumVariesUse multiple metrics, have a separate auditor agent
One failure causing many moreLowHighCircuit breakers, isolation between agents, graceful fallback

The 5×5 risk grid

A simple way to put each risk in one of five buckets. Lay how likely it is along the rows, how bad it would be along the columns. The cell tells you what to do about it.

Trivial
Minor
Moderate
Major
Severe
Frequent
Log
Monitor
Alert
Block
Block
Likely
Log
Monitor
Alert
Block
Block
Possible
Accept
Log
Monitor
Alert
Alert
Unlikely
Accept
Log
Log
Monitor
Monitor
Rare
Accept
Accept
Accept
Log
Log

The grid is built from one simple score: likelihood × how-bad, with both rated 1 to 5. Score ranges become actions: 1 to 3 means accept the risk, 4 to 6 means just log it, 7 to 10 means watch a dashboard, 11 to 15 means alert someone, 16 to 25 means block it before it can happen. The bigger the score, the bigger the response. The grid never goes the other way.

Use the grid when you need a quick answer in a meeting. Use the more detailed formula below when you need to think about how far it spreads and whether you'd notice, which the grid doesn't capture.

A more detailed scoring formula

The grid above assumes the failure only affects one user and that you'd reliably catch it. The formula below is the same idea but adds two extra factors that matter for agent systems: how many things one bad event affects, and how likely you are to notice it. When all four factors are filled in, the formula gives you a precise number; the grid is the quick check. They should usually agree within one tier.

from dataclasses import dataclass
from enum import Enum

class Severity(Enum):
    TRIVIAL = 1
    MINOR = 2
    MODERATE = 3
    MAJOR = 4
    SEVERE = 5

@dataclass
class RiskScore:
    likelihood: float        # 0..1 expected per-run probability
    impact: Severity         # consequence if it happens
    blast_radius: int        # number of users/records affected
    detectability: float     # 0..1 how likely we'll catch it

    @property
    def composite(self) -> float:
        # Higher = more dangerous
        # Detectability inversely scales risk: undetectable = scary
        return (self.likelihood
                * self.impact.value
                * (1 + self.blast_radius / 1000)
                * (2 - self.detectability))

    def control_tier(self) -> str:
        c = self.composite
        if c >= 8:    return "BLOCK"      # pre-execution guardrail
        if c >= 4:    return "ALERT"      # real-time alerting
        if c >= 1.5:  return "MONITOR"    # dashboard tracking
        if c >= 0.5:  return "LOG"        # audit only
        return "ACCEPT"

# ─── Worked example ───
risks = {
    "hallucinated_price": RiskScore(0.30, Severity.MAJOR, 10, 0.6),
    "prompt_injection": RiskScore(0.20, Severity.SEVERE, 2000, 0.4),
    "infinite_loop": RiskScore(0.20, Severity.MODERATE, 1, 0.90),
    "pii_leak": RiskScore(0.10, Severity.SEVERE, 10000, 0.5),
}

for name, r in risks.items():
    print(f"{name}: composite={r.composite:.2f} control={r.control_tier()}")

# Output:
# hallucinated_price: composite=1.70  control=MONITOR
# prompt_injection:   composite=4.80  control=ALERT
# infinite_loop:      composite=0.66  control=LOG
# pii_leak:           composite=8.25  control=BLOCK

The exact numbers in this formula aren't sacred; tune them for your situation. What matters is having a clear, repeatable way to decide which risks deserve which kinds of safeguards. Otherwise every meeting becomes an argument from gut feeling and whoever talks loudest wins.

Risks to people who didn't sign up for this

Most safety conversations are about two parties: the team running the agent, and the user it's serving. Don't harm the user. Don't expose the operator. That's the whole frame. But agents have a third party that shows up almost every time and almost never gets named: everyone else affected by what the agent does, who never agreed to its existence and didn't get a chance to opt out.

Some examples make this concrete:

None of this is malicious. It's just what happens when something acts on behalf of one party in a world that contains other parties. The risk frameworks that ignore this aren't wrong; they're incomplete.

You can't measure every downstream effect. What you can do is identify the top two or three categories of third-party effect for your specific agent and put rough numbers on them. Even crude numbers are better than the current zero. A flight-booking agent should know what fraction of its searches contribute to price discovery. An API-calling agent should know what its share of the upstream rate limit looks like. A decision-making agent for a user should think about which decisions are also development opportunities for that user, and not casually take all of them away.

The point isn't to refuse to act. It's to make the costs visible. A risk register that lists "harm to user" and "harm to operator" but has nothing under "harm to people downstream" is missing the row that shows up most often in practice.

How risk changes by pattern

The same risk can be much bigger or much smaller depending on which pattern you picked in the previous chapter:

RiskOrchestratorSwarmBlackboardPipeline
Infinite loopsLowHighMediumVery low
Forgetting the goal mid-taskLowHighMediumLow
Bad output spreading between agentsLowHighHighMedium
One failure causing many moreMediumLowLowHigh
Cost spiralingMediumHighMediumLow

How to read this: pipelines are cheap and predictable, but if one stage breaks everything downstream goes too. Swarms are flexible but easily blow the budget. Pick the pattern whose risks you can pay for.