Risk is just how likely × how bad × how far it spreads.
You can't lock down every possible failure with maximum care; if you tried, you'd never ship anything. Risk modeling is how you decide which failures need strict safety checks, which just need monitoring, and which are fine to accept. There are three pieces to think about:
- How likely is it? How often does this failure happen per 1,000 runs? You'll learn the real number from your own logs once you're in production. Before that, look at similar systems.
- How bad is it when it happens? Money lost, customers hurt, regulatory trouble, damage to your reputation.
- How far does it spread? One user? One team? One company? All your customers? A small problem that hits everyone can be worse than a big problem that hits one person.
Failures specific to agent systems
| What can go wrong | How often | How bad | What helps |
|---|---|---|---|
| Agent confidently states something false | High | Varies | Have it look things up; cross-check; validate the output format |
| Hidden instructions inside a document | Medium | High | Strict output formats, separate roles, sandboxing |
| Agent calls a destructive tool | Low | Critical | Restrict tools per agent, require human approval, allow dry-run mode |
| Infinite loops or runaway token spend | Medium | High | Cap iterations, set a token budget, detect when no progress is happening |
| Agent forgets the original goal mid-task | Medium | Medium | Re-state the goal periodically, check final answer against the original ask |
| Personal data or secrets leak in output | Medium | Critical | Output filters, redaction, separate sensitive data into its own roles |
| One agent's bad output poisons another | High | Medium | Curated handoffs, typed message formats |
| An agent earlier in the chain has been compromised | Low | Critical | Layered defenses, have a different agent verify critical outputs |
| Agent gaming its own metrics | Medium | Varies | Use multiple metrics, have a separate auditor agent |
| One failure causing many more | Low | High | Circuit breakers, isolation between agents, graceful fallback |
The 5×5 risk grid
A simple way to put each risk in one of five buckets. Lay how likely it is along the rows, how bad it would be along the columns. The cell tells you what to do about it.
The grid is built from one simple score: likelihood × how-bad, with both rated 1 to 5. Score ranges become actions: 1 to 3 means accept the risk, 4 to 6 means just log it, 7 to 10 means watch a dashboard, 11 to 15 means alert someone, 16 to 25 means block it before it can happen. The bigger the score, the bigger the response. The grid never goes the other way.
Use the grid when you need a quick answer in a meeting. Use the more detailed formula below when you need to think about how far it spreads and whether you'd notice, which the grid doesn't capture.
A more detailed scoring formula
The grid above assumes the failure only affects one user and that you'd reliably catch it. The formula below is the same idea but adds two extra factors that matter for agent systems: how many things one bad event affects, and how likely you are to notice it. When all four factors are filled in, the formula gives you a precise number; the grid is the quick check. They should usually agree within one tier.
from dataclasses import dataclass
from enum import Enum
class Severity(Enum):
TRIVIAL = 1
MINOR = 2
MODERATE = 3
MAJOR = 4
SEVERE = 5
@dataclass
class RiskScore:
likelihood: float # 0..1 expected per-run probability
impact: Severity # consequence if it happens
blast_radius: int # number of users/records affected
detectability: float # 0..1 how likely we'll catch it
@property
def composite(self) -> float:
# Higher = more dangerous
# Detectability inversely scales risk: undetectable = scary
return (self.likelihood
* self.impact.value
* (1 + self.blast_radius / 1000)
* (2 - self.detectability))
def control_tier(self) -> str:
c = self.composite
if c >= 8: return "BLOCK" # pre-execution guardrail
if c >= 4: return "ALERT" # real-time alerting
if c >= 1.5: return "MONITOR" # dashboard tracking
if c >= 0.5: return "LOG" # audit only
return "ACCEPT"
# ─── Worked example ───
risks = {
"hallucinated_price": RiskScore(0.30, Severity.MAJOR, 10, 0.6),
"prompt_injection": RiskScore(0.20, Severity.SEVERE, 2000, 0.4),
"infinite_loop": RiskScore(0.20, Severity.MODERATE, 1, 0.90),
"pii_leak": RiskScore(0.10, Severity.SEVERE, 10000, 0.5),
}
for name, r in risks.items():
print(f"{name}: composite={r.composite:.2f} control={r.control_tier()}")
# Output:
# hallucinated_price: composite=1.70 control=MONITOR
# prompt_injection: composite=4.80 control=ALERT
# infinite_loop: composite=0.66 control=LOG
# pii_leak: composite=8.25 control=BLOCK
The exact numbers in this formula aren't sacred; tune them for your situation. What matters is having a clear, repeatable way to decide which risks deserve which kinds of safeguards. Otherwise every meeting becomes an argument from gut feeling and whoever talks loudest wins.
Risks to people who didn't sign up for this
Most safety conversations are about two parties: the team running the agent, and the user it's serving. Don't harm the user. Don't expose the operator. That's the whole frame. But agents have a third party that shows up almost every time and almost never gets named: everyone else affected by what the agent does, who never agreed to its existence and didn't get a chance to opt out.
Some examples make this concrete:
- An agent buying flights for its user contributes to demand-driven pricing that makes the same flight more expensive for the next buyer.
- An agent calling APIs at scale contributes to rate-limit pressure that degrades service for everyone else using the same API.
- An agent generating SEO content contributes to information pollution that everyone searching the web inherits.
- An agent making low-friction decisions on the user's behalf shifts cognitive load away from the user. Over time, this affects the user's own future ability to make those decisions. Their future self is a third party in the most literal sense.
None of this is malicious. It's just what happens when something acts on behalf of one party in a world that contains other parties. The risk frameworks that ignore this aren't wrong; they're incomplete.
You can't measure every downstream effect. What you can do is identify the top two or three categories of third-party effect for your specific agent and put rough numbers on them. Even crude numbers are better than the current zero. A flight-booking agent should know what fraction of its searches contribute to price discovery. An API-calling agent should know what its share of the upstream rate limit looks like. A decision-making agent for a user should think about which decisions are also development opportunities for that user, and not casually take all of them away.
The point isn't to refuse to act. It's to make the costs visible. A risk register that lists "harm to user" and "harm to operator" but has nothing under "harm to people downstream" is missing the row that shows up most often in practice.
How risk changes by pattern
The same risk can be much bigger or much smaller depending on which pattern you picked in the previous chapter:
| Risk | Orchestrator | Swarm | Blackboard | Pipeline |
|---|---|---|---|---|
| Infinite loops | Low | High | Medium | Very low |
| Forgetting the goal mid-task | Low | High | Medium | Low |
| Bad output spreading between agents | Low | High | High | Medium |
| One failure causing many more | Medium | Low | Low | High |
| Cost spiraling | Medium | High | Medium | Low |
How to read this: pipelines are cheap and predictable, but if one stage breaks everything downstream goes too. Swarms are flexible but easily blow the budget. Pick the pattern whose risks you can pay for.