Risk modeling

15 Risk modeling · what can go wrong, scored

Risk is just how likely × how bad × how far it spreads.

You can't lock down every possible failure with maximum care; if you tried, you'd never ship anything. Risk modeling is how you decide which failures need strict safety checks, which just need monitoring, and which are fine to accept. There are three pieces to think about:

How likely is it? How often does this failure happen per 1,000 runs? You'll learn the real number from your own logs once you're in production. Before that, look at similar systems.
How bad is it when it happens? Money lost, customers hurt, regulatory trouble, damage to your reputation.
How far does it spread? One user? One team? One company? All your customers? A small problem that hits everyone can be worse than a big problem that hits one person.

Failures specific to agent systems

What can go wrong	How often	How bad	What helps
Agent confidently states something false	High	Varies	Have it look things up; cross-check; validate the output format
Hidden instructions inside a document	Medium	High	Strict output formats, separate roles, sandboxing
Agent calls a destructive tool	Low	Critical	Restrict tools per agent, require human approval, allow dry-run mode
Infinite loops or runaway token spend	Medium	High	Cap iterations, set a token budget, detect when no progress is happening
Agent forgets the original goal mid-task	Medium	Medium	Re-state the goal periodically, check final answer against the original ask
Personal data or secrets leak in output	Medium	Critical	Output filters, redaction, separate sensitive data into its own roles
One agent's bad output poisons another	High	Medium	Curated handoffs, typed message formats
An agent earlier in the chain has been compromised	Low	Critical	Layered defenses, have a different agent verify critical outputs
Agent gaming its own metrics	Medium	Varies	Use multiple metrics, have a separate auditor agent
One failure causing many more	Low	High	Circuit breakers, isolation between agents, graceful fallback

The 5×5 risk grid

A simple way to put each risk in one of five buckets. Lay how likely it is along the rows, how bad it would be along the columns. The cell tells you what to do about it.

Trivial

Minor

Moderate

Major

Severe

Frequent

Log

Monitor

Alert

Block

Likely

Log

Monitor

Alert

Block

Possible

Log

Monitor

Alert

Unlikely

Log

Monitor

Rare

Log

The grid is built from one simple score: likelihood × how-bad, with both rated 1 to 5. Score ranges become actions: 1 to 3 means accept the risk, 4 to 6 means just log it, 7 to 10 means watch a dashboard, 11 to 15 means alert someone, 16 to 25 means block it before it can happen. The bigger the score, the bigger the response. The grid never goes the other way.

Use the grid when you need a quick answer in a meeting. Use the more detailed formula below when you need to think about how far it spreads and whether you'd notice, which the grid doesn't capture.

A more detailed scoring formula

The grid above assumes the failure only affects one user and that you'd reliably catch it. The formula below is the same idea but adds two extra factors that matter for agent systems: how many things one bad event affects, and how likely you are to notice it. When all four factors are filled in, the formula gives you a precise number; the grid is the quick check. They should usually agree within one tier.

from dataclasses import dataclass
from enum import Enum

class Severity(Enum):
    TRIVIAL = 1
    MINOR = 2
    MODERATE = 3
    MAJOR = 4
    SEVERE = 5

@dataclass
class RiskScore:
    likelihood: float        # 0..1 expected per-run probability
    impact: Severity         # consequence if it happens
    blast_radius: int        # number of users/records affected
    detectability: float     # 0..1 how likely we'll catch it

    @property
    def composite(self) -> float:
        # Higher = more dangerous
        # Detectability inversely scales risk: undetectable = scary
        return (self.likelihood
                * self.impact.value
                * (1 + self.blast_radius / 1000)
                * (2 - self.detectability))

    def control_tier(self) -> str:
        c = self.composite
        if c >= 8:    return "BLOCK"      # pre-execution guardrail
        if c >= 4:    return "ALERT"      # real-time alerting
        if c >= 1.5:  return "MONITOR"    # dashboard tracking
        if c >= 0.5:  return "LOG"        # audit only
        return "ACCEPT"

# ─── Worked example ───
risks = {
    "hallucinated_price": RiskScore(0.30, Severity.MAJOR, 10, 0.6),
    "prompt_injection": RiskScore(0.20, Severity.SEVERE, 2000, 0.4),
    "infinite_loop": RiskScore(0.20, Severity.MODERATE, 1, 0.90),
    "pii_leak": RiskScore(0.10, Severity.SEVERE, 10000, 0.5),
}

for name, r in risks.items():
    print(f"{name}: composite={r.composite:.2f} control={r.control_tier()}")

# Output:
# hallucinated_price: composite=1.70  control=MONITOR
# prompt_injection:   composite=4.80  control=ALERT
# infinite_loop:      composite=0.66  control=LOG
# pii_leak:           composite=8.25  control=BLOCK

The exact numbers in this formula aren't sacred; tune them for your situation. What matters is having a clear, repeatable way to decide which risks deserve which kinds of safeguards. Otherwise every meeting becomes an argument from gut feeling and whoever talks loudest wins.

Risks to people who didn't sign up for this

Most safety conversations are about two parties: the team running the agent, and the user it's serving. Don't harm the user. Don't expose the operator. That's the whole frame. But agents have a third party that shows up almost every time and almost never gets named: everyone else affected by what the agent does, who never agreed to its existence and didn't get a chance to opt out.

Some examples make this concrete:

An agent buying flights for its user contributes to demand-driven pricing that makes the same flight more expensive for the next buyer.
An agent calling APIs at scale contributes to rate-limit pressure that degrades service for everyone else using the same API.
An agent generating SEO content contributes to information pollution that everyone searching the web inherits.
An agent making low-friction decisions on the user's behalf shifts cognitive load away from the user. Over time, this affects the user's own future ability to make those decisions. Their future self is a third party in the most literal sense.

None of this is malicious. It's just what happens when something acts on behalf of one party in a world that contains other parties. The risk frameworks that ignore this aren't wrong; they're incomplete.

You can't measure every downstream effect. What you can do is identify the top two or three categories of third-party effect for your specific agent and put rough numbers on them. Even crude numbers are better than the current zero. A flight-booking agent should know what fraction of its searches contribute to price discovery. An API-calling agent should know what its share of the upstream rate limit looks like. A decision-making agent for a user should think about which decisions are also development opportunities for that user, and not casually take all of them away.

The point isn't to refuse to act. It's to make the costs visible. A risk register that lists "harm to user" and "harm to operator" but has nothing under "harm to people downstream" is missing the row that shows up most often in practice.

How risk changes by pattern

The same risk can be much bigger or much smaller depending on which pattern you picked in the previous chapter:

Risk	Orchestrator	Swarm	Blackboard	Pipeline
Infinite loops	Low	High	Medium	Very low
Forgetting the goal mid-task	Low	High	Medium	Low
Bad output spreading between agents	Low	High	High	Medium
One failure causing many more	Medium	Low	Low	High
Cost spiraling	Medium	High	Medium	Low

How to read this: pipelines are cheap and predictable, but if one stage breaks everything downstream goes too. Swarms are flexible but easily blow the budget. Pick the pattern whose risks you can pay for.