The thing brains do that most agents do not.
The previous chapter covered predicting the agent's own behavior. This chapter covers the prospective complement: predicting the system the agent operates in or observes. The two are different problems with different math. Predicting the agent is about constraining what the agent might do next. Predicting the environment is about constraining what the world is about to look like, before the agent has to act on it.
Open with a fact from neuroscience because it makes the engineering point cleanly. Your brain does not passively receive what your eyes see. It runs a continuous predictive model of what the next visual frame should look like, and only the differences from prediction get full attention. The brain is fundamentally a prediction machine, not a passive receiver. There is an instructive corollary: people who are congenitally blind never build a visual predictive model, and one prominent hypothesis (Pollak and Corlett, 2020) is that this is one reason their rates of schizophrenia are dramatically lower than in the sighted population. Schizophrenia in the predictive-coding view is a hierarchical predictive system that has started trusting its own predictions more than reality. No predictive system, no hierarchical predictive failure.
Take the engineering analogy seriously. A reactive agent with no predictive model is bounded by ground-truth latency but cannot hallucinate predictions. An agent with a strong predictive model is fast but can drift into prediction-driven confabulation. The whole engineering content of this chapter is in that tradeoff: building the predictive model is the easy part; designing against the failure mode is the hard part.
What a predictive world model actually gives you
Four concrete things, in order of how often they pay off:
- Anomaly detection for free. A model that predicts the next observation produces a prediction error on every step. Large prediction errors are by definition the surprising events. You did not have to write a rule for "what counts as anomalous"; you got it as a side effect of running the predictor. This is the single biggest reason to build a predictive layer for an observer agent: you replace hand-written alerting rules with one number, the surprise.
- Lower latency through speculative execution. If you predict the user's next request before they make it, you can pre-fetch the data, pre-compute the answer, and have it ready when the request arrives. The user's perceived latency drops from network plus model plus retrieval to network only. CPUs have done this for decades; it is called branch prediction. Agents can do the same at the action level.
- Compressed memory. Storing every observation is expensive at scale. Storing the predictions plus the deltas where prediction failed is much cheaper, because reality usually matches the prediction. A clinical monitoring agent watching a patient's vitals does not need to store every heartbeat; it stores the model and the moments where the model was wrong.
- Faster failure detection. When the world diverges from prediction, you know on the first step that something is off, not three steps later when the consequences have cascaded. This is the single most useful property in production: prediction errors are an early warning system that fires before any other monitor.
A common confusion is worth clearing up before going further. LLMs are next-token predictors, but next-token prediction is not what this chapter is about. The LLM predicts the next token in a language sequence. A world model predicts the next state of an external system: the next reading from a sensor, the next row in a database, the next request from a user, the next price in a feed. The two can be combined (an LLM can be the implementation of a world model if you give it the right prompt and history), but they are conceptually distinct. Mixing them up causes operators to assume their agent already has a world model when it does not.
Three kinds of predictive models, by what they predict
Different agents need different things predicted. Naming the three kinds keeps the engineering choices honest.
| Type | What it predicts | Used by | Typical implementation |
|---|---|---|---|
| State-transition model | Given the current state and the action the agent is about to take, what state will the world be in next? | Agents that act and need to plan: trading agents, robots, multi-step workflow agents | Markov chain, neural state-space model, foundation-model-with-history |
| Observation model | Given a stream of past observations, what is the next observation likely to be? | Pure observer agents: monitoring, clinical, industrial sensors, fraud detection | Time-series forecaster, LSTM, lightweight transformer |
| Intent model | Given the user's recent behavior, what are they likely to want or do next? | Conversational agents, support agents, autonomous assistants | LLM with conversation history, learned-preference classifier, clustering over user trajectories |
Most production agent systems benefit from at least one of these; many benefit from two. The mistake is trying to build a single predictor that does all three. They have different inputs, different update rates, and different acceptable error tolerances, and bundling them produces a model that does each of them poorly. Pick the kind you actually need, build it well, ship it, then decide whether to add another.
The math, kept simple
Most of the math for predictive models is one idea written four different ways. The idea is surprise minimization: a good predictor assigns high probability to what actually happens, and low probability to what does not. Over time the predictor learns by adjusting itself to reduce surprise on future observations.
The single equation you need to make this operational is the surprise of a single observation given the predictor's distribution:
surprise(o) = -log p(o | model, history)
Read it like this. If your model said this observation was likely (probability close to 1), the log is close to 0 and the surprise is small. If your model said this observation was very unlikely (probability close to 0), the log goes very negative and the negative of it is very large; the surprise is large. Surprise is just how unexpected the actual observation was, in units of "how many bits the model lost." This is the same as cross-entropy, the same as negative log-likelihood, and the same as what training loss is computing every time you fine-tune a model. The vocabulary changes by field; the math is one equation.
Three useful corollaries fall out of this:
- Bayesian update. When a new observation arrives, the model's belief about the world's state updates. The posterior is proportional to the prior times the likelihood:
p(state | observation) ∝ p(state) · p(observation | state). In production this is rarely computed in closed form; it is approximated by a learned model or by sampling. The discipline that matters is keeping the prior separable from the likelihood, so you can audit which one was wrong when the prediction misses. - Information gain. The value of a new observation is how much it changed the model's beliefs. The KL divergence between the prior and the posterior is the formal version of this. Operationally: an observation with high information gain is one your model did not see coming. Track the running mean of information gain and you have a usable proxy for "how surprising the system has been lately."
- Free energy. The unifying framing from Friston is that brains and adaptive systems minimize a quantity called free energy, which is approximately the expected surprise over future observations. You do not need the full theory to build an agent. You need the practical translation: choose actions and update beliefs so that the model is rarely surprised. When it is surprised, learn fast; when it is not, act fast.
Three production patterns, lightest to heaviest
A predictive model is not a single technique. Three patterns cover almost every real production case; pick the lightest one that gets you the property you need.
- Lightweight: rolling statistics plus a small forecaster. Keep a fixed-size buffer of recent observations. Fit a small model (linear regression, exponential smoothing, ARIMA, a one-layer LSTM, an n-gram over discretized state tokens) on the buffer at low cost. Predict the next observation; compute surprise. Re-fit every few minutes. Total memory and compute are tiny, and this is enough for most observation-model use cases (industrial sensors, system metrics, low-stakes fraud detection). The chapter 15 NGramWorldModel is an example of this pattern. Start here.
- Medium: foundation model as predictor. Give a frontier LLM a structured prompt containing the last N observations plus a system prompt that says "predict the next observation as JSON conforming to this schema." The LLM produces a prediction; you compute surprise against reality when it arrives. This works because LLMs have absorbed enormous priors about how systems behave; you get a strong predictor without training a custom model. The cost is per-call inference latency (typically 200ms to 2s) and the small risk of hallucination, which you defend against with structured output (chapter 02). Use this pattern when the observation space is high-dimensional or semantic and a small forecaster would not capture the structure.
- Heavy: dedicated world-model training. The Dreamer line of work (DreamerV3 in particular, Hafner et al., 2023) trains a recurrent state-space model jointly with a policy, in an end-to-end loop. The agent learns to imagine futures and pick actions whose imagined outcomes are good. This is the right answer for robotics, control problems with expensive real-world interaction, and any setting where the agent must plan rather than react. It is not the right answer for most production software agents in 2026, because the training infrastructure is heavyweight and the simpler patterns above usually suffice.
Two patterns operators sometimes mistake for predictive models, and what makes them different:
- Caching is not prediction. A cache returns the same answer for the same input. A predictive model returns its best guess about what is about to happen, even for inputs it has never seen. They both reduce latency, but caching is a memoization technique and prediction is a generalization technique. Production systems often need both.
- RAG is not prediction. Retrieval-augmented generation pulls relevant past information into context. A predictive model produces a forward-looking guess about future information. Both can use the same backing store, but they serve opposite directions in time. RAG looks backward; prediction looks forward.
Pure observer agents: when prediction is the entire product
Most of this manual covers agents that act. A growing class of agents only watches, and for those the predictive model is not an optimization, it is the agent. Three examples worth naming, because they teach the pattern:
- Clinical monitoring. An agent watches a patient's continuous vitals (heart rate, SpO2, blood pressure, respiratory rate) and predicts the next reading every five seconds. When prediction error exceeds a threshold for more than a few seconds, the agent alerts. The threshold is patient-specific because each patient's baseline differs. This is essentially what a good ICU nurse does cognitively. The model does not need to understand cardiology; it needs to know what is normal for this patient, right now, and notice when reality stops matching.
- Industrial sensor streams. An agent watches a few thousand sensors on a manufacturing line and predicts the next reading on each. Most predictions are boring (the system is stable). A small fraction are large prediction errors, which the agent surfaces to a human operator. This is anomaly detection without anomaly rules: the agent learned what normal looks like and treats everything that is not normal as worth attention. The shipped product is the rank-ordered list of "weirdest things happening on the line right now."
- Network traffic monitoring. An agent predicts the next minute of traffic per host based on the last few minutes. Hosts that diverge from prediction are flagged. This catches both attacks (sudden surge to an unusual destination) and outages (sudden silence where there should be traffic). The classification rules are not in the agent; they are downstream of the surprise signal.
Pure observer agents have a clean architecture. The decide step in the perceive-decide-act loop is replaced by "compute surprise; if surprise exceeds threshold, route to a human or to a higher-level acting agent." The act step is just "emit alert" or "do nothing." The whole loop is dominated by the predictor and the threshold logic. This is the cleanest production deployment of a predictive model, and the manual's existing chapter on agents that watch instead of answer is the right reading after this one.
Agent schizophrenia: the failure mode you must design against
Return to the brain analogy from the opening, because this is where it earns its keep. The predictive-coding view of schizophrenia argues that the brain's positive symptoms (hallucinations, delusions) arise when the predictive system places too much weight on its predictions and not enough on the sensory data contradicting them. The brain "knows" what should be there and supplies it, even when the eyes and ears disagree. Confabulation is the system trusting its priors over its inputs.
Agents with predictive models can fail the same way. Concretely:
- The agent predicts the user wants X; pre-fetches X; never actually checks what the user said; serves X confidently when the user wanted Y. The prediction was the input.
- The agent's monitoring model predicts the system is healthy; reads the next observation; finds it slightly anomalous; explains it away with "this matches my prediction within noise"; misses an actual outage. The prediction was a filter on reality.
- The agent caches predictions and serves them on subsequent calls without re-checking, because the cache hit rate is high. The prediction has replaced ground truth in the data flow.
The defenses against this are not exotic. They are discipline, applied at the right places:
- Reality-check ratio. Set a minimum frequency at which the agent must ignore the prediction and actually read the world. For low-stakes monitoring, "every Nth observation, ground-truth regardless of prediction confidence" is enough. For high-stakes settings, ground-truth more often. The rule is non-negotiable: an agent that never reality-checks has stopped being an observer.
- Prediction confidence intervals. The predictor should not just emit a point estimate; it should emit a range with a confidence (a credible interval). Actions that rely on the prediction should require the interval to be narrow. A wide interval means "I don't know," and the agent should fall back to reading reality rather than acting on a low-confidence prediction.
- Surprise threshold escalation. When prediction error exceeds a high threshold for several consecutive observations, the agent should not just alert; it should temporarily stop trusting the predictor and act reactively until the predictor is re-fit. The system has changed; the old model is wrong; do not let it filter the new reality.
- Audit trail on every predicted action. Every action the agent takes based on a prediction (rather than a direct observation) must be recorded as "predicted, not observed." When the agent is wrong six months later, the audit log says which decisions were on predictions and which were on reality. Without this, post-incident analysis becomes guesswork. This is the same audit-log discipline as the rest of the manual; it just gets a new field.
- Prediction-quality as a reputation signal. The trust engine from chapter 13 already accepts signals from many sources. The running mean of the predictor's surprise is a clean reputation signal: a predictor whose surprise is rising is one whose predictions should be trusted less. This closes the loop: the predictor's own reputation is part of how much weight the rest of the system gives to its outputs.
The honest framing is that adding a predictive model adds a new privileged sub-system. Every privileged sub-system needs its own audit, its own reputation accounting, and its own fail-safe. The defenses above are not optional; they are what makes the predictive layer safe to deploy. Skip them and you have built an agent that can hallucinate consistency with predictions that no longer hold, which is precisely the failure mode the brain analogy warned about.
How prediction connects to the rest of the manual
A predictive layer touches almost every earlier chapter. Naming the connections keeps the architecture coherent.
- Memory (chapter 08). Memory is what happened; prediction is what is about to happen. Together they compute surprise (memory of prediction versus observation of reality), which is the cleanest learning signal for the agent. The same memory chapter's warning about memory poisoning applies doubly to prediction: a poisoned memory shapes the predictor, which shapes future actions; the audit trail must record both.
- Generalists and specialists (chapter 09). A specialist agent has narrower expected observations and is therefore easier to predict for. The same logic that makes specialists easier to guard makes them easier to predict. Fine-tuned specialists (chapter 10) can have their predictor fine-tuned on the same domain corpus and get sharper predictions cheaply.
- Predictability (chapter 15). That chapter predicts the agent's own behavior; this one predicts the world. They feed each other: knowing the agent's likely action narrows the space of next world states; knowing the next world state narrows the agent's likely next action. The drift-detection world model in chapter 15 is a specific instance of the general pattern in this chapter.
- Alerting (chapter 17). Prediction error is the cleanest alerting signal you can get because it requires no rules. Hand-written alert thresholds drift; a predictor's surprise signal adapts to the world automatically. Both are useful; predictors usually catch novel failures earlier than rules do.
- Trust (chapter 13). The predictor itself is a sub-agent with a reputation. Track surprise as a signal; let the trust engine downgrade a predictor whose predictions are not holding up. This is how the system avoids agent schizophrenia at the architectural layer, not just the runtime layer.
- Adversarial scenarios (chapter 21). An attacker who can shape what the predictor sees can shape what the agent expects to see. Memory poisoning attacks (chapter 08) generalize directly to predictor poisoning. The defense is the same: provenance on every training observation; consistency checks on retrieval-time predictions; reality-check ratio above zero so the attacker cannot fully replace ground truth with prediction.
Practical guidance
- Build a reactive agent first. The predictive layer is an add-on, not a foundation. Get the perceive-decide-act loop working with ground truth on every step. When it is solid, then ask whether prediction would lower latency or surface anomalies that the reactive loop cannot.
- Start with rolling statistics. A buffer of recent observations plus exponential smoothing or a one-layer model is enough for most production use cases. Build it, run it for a week, look at the surprise distribution. If the answers are useful, stay there. If they are not, climb to a foundation-model-as-predictor next.
- Predict one thing, well. Resist the urge to build a model that predicts state and observations and intent simultaneously. Pick one type; ship it; learn from it; add another later if the first one earned its keep.
- Treat the predictor as an agent in its own right. It has a profile, a fingerprint that changes when you re-train it, a reputation that rises and falls with surprise, and an audit log of what it predicted and what reality did. The kit's
AgentProfileandverificationmodules apply to it as much as to any other agent. - Set the reality-check ratio before you ship. Decide upfront how often the agent must ignore the predictor and read the world. Make it a deployment-time configuration, not a runtime decision the agent can override. The whole point is that the agent does not get to vote on whether it has gone schizophrenic.
- Surface the surprise to humans. Even when the agent acts autonomously, the running surprise of the predictor should be on a dashboard a human watches. It is the cheapest single number you can track for "is this system behaving normally," and it requires zero domain knowledge to interpret.
- Re-fit the predictor on a fixed cadence. The world drifts. A predictor fit three months ago is predicting a world that no longer exists. Schedule re-fits; record the surprise distribution before and after; alert if the new fit's held-out surprise is much higher than the old fit's was. The predictor has its own staleness, separate from the agent's.
- The blind agent is fine. If your agent works as a pure reactive loop, you do not have to add prediction. The blind person never developed schizophrenia. A reactive agent never confabulates a prediction. The cost of "blindness" is latency and lack of free anomaly detection; the benefit is structural simplicity. For some agents that is the right tradeoff. Build prediction when its specific benefits earn the specific failure mode it introduces.