Architecture & flow

02 Architecture · anatomy & flow

An agent system is mostly routing, state, and rules.

Once you start building real agent systems, the same handful of pieces show up in every project: who the agent is, where requests get routed, where state lives, how messages flow, what the agent remembers, what tools it can use, and how you watch it work. The way of breaking these apart shown here is informed by surveys including Architectures & Taxonomies, arXiv 2026 and Prompt-Response to Goal-Directed, arXiv 2026, plus the design choices in AutoGen, Wu et al. 2023 and MetaGPT, Hong et al. ICLR 2024. There's no agreed-upon standard the way networking has the OSI model, so the seven layers below are one working model that has been useful in practice. The value isn't the specific layers; it's having names for things. Once each piece has a name, problems become easier to find. Without names, every bug feels like a mystery.

A practical seven-layer way to think about it

Layered architecture diagram

Each layer has one responsibility. Cross-cutting concerns (guardrails, observability) thread through all of them.

Reading the diagram. The layers are concerns, not request hops. Each one names a thing your system has to handle.

L1 Identity: who each agent is and what it's allowed to do.
L2 Router: which agent handles a given task.
L3 State: the canonical record.
L4 Bus: how messages move between agents.
L5 Memory: what agents read and write across turns.
L6 Tool Gateway: the audited boundary to the outside world.
L7 Observer: what watches all of it.

A real request touches most of these layers, often more than once, in whatever order the workflow needs.

Guardrails (the right-hand ribbon) are checkpoints between layers, not a layer of their own. When one layer fails, the observer sees it and the router picks the next move; the system doesn't crash.

The smallest version that's still real

Here's the simplest honest version of an orchestrator. The big frameworks (LangGraph, CrewAI, AutoGenWu 2023, OpenAI Swarm) are all richer versions of this same idea.

from dataclasses import dataclass, field
from typing import Callable
import uuid, time

@dataclass
class Message:
    sender: str
    recipient: str
    kind: str            # 'task' | 'result' | 'error' | 'vote'
    payload: dict
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    ts: float = field(default_factory=time.time)

@dataclass
class WorkflowState:
    """L3: single source of truth."""
    goal: str
    decisions: list = field(default_factory=list)
    artifacts: dict = field(default_factory=dict)
    history: list = field(default_factory=list)
    errors: list = field(default_factory=list)
    iteration: int = 0
    max_iterations: int = 25
    token_budget: int = 100_000
    tokens_used: int = 0

class Agent:
    """L1: identity + scoped capability."""
    def __init__(self, name, role, tools, llm):
        self.name = name
        self.role = role
        self.tools = {t.__name__: t for t in tools}   # allow-list
        self.llm = llm

    def handle(self, msg, state):
        ctx = self._project_context(msg, state)        # curated, not full
        decision = self.llm(ctx)
        # Schema matches the tutorial chapter: {"action": "call_tool" | "final_answer", ...}
        if decision["action"] == "call_tool":
            if decision["tool"] not in self.tools:
                raise PermissionError(f"{self.name} denied: {decision['tool']}")
            result = self.tools[decision["tool"]](**decision["args"])
            return Message(self.name, msg.sender, "result", {"data": result})
        # action == "final_answer"
        return Message(self.name, msg.sender, "result", decision["payload"])

    def _project_context(self, msg, state):
        # Each role declares which artifact keys it cares about (role isolation)
        keymap = {"researcher": ["sources"],
                  "writer": ["outline", "sources"],
                  "reviewer": ["draft"]}
        keys = keymap.get(self.role, [])
        return {
            "role": self.role,
            "goal": state.goal,
            "task": msg.payload,
            "artifacts": {k: state.artifacts.get(k) for k in keys},
            "tools": list(self.tools.keys()),
        }

class Orchestrator:
    """L2 router + L7 observer wired together."""
    def __init__(self, agents, router, observer=None):
        self.agents = agents
        self.router = router         # state -> next agent name | 'done'
        self.observer = observer

    def run(self, goal):
        state = WorkflowState(goal=goal)
        while state.iteration < state.max_iterations:
            state.iteration += 1
            # Token-budget halt (the field is enforced, not decorative)
            if state.tokens_used >= state.token_budget:
                state.errors.append({"reason": "token_budget_exhausted"})
                break
            next_agent = self.router(state)
            if next_agent == "done": break
            agent = self.agents[next_agent]
            msg = Message("orchestrator", next_agent, "task",
                          {"instruction": self._task_for(next_agent, state)})
            try:
                reply = agent.handle(msg, state)
                self._merge(state, next_agent, reply)
            except Exception as e:
                state.errors.append({"agent": next_agent, "error": str(e)})
            if self.observer:
                self.observer(state)        # trace · alert · halt
        return state

    def _task_for(self, agent_name, state):
        # Build the natural-language instruction for the next agent.
        # Real implementations templatize per-role; this is the simplest version.
        last_artifact = next(iter(reversed(state.history)), None)
        return {
            "goal": state.goal,
            "prior_step": last_artifact,
            "iteration": state.iteration,
        }

    def _merge(self, state, agent_name, reply):
        # Persist the agent's reply into shared state.
        state.history.append({"agent": agent_name, "payload": reply.payload})
        if isinstance(reply.payload, dict) and "artifact_key" in reply.payload:
            state.artifacts[reply.payload["artifact_key"]] = reply.payload.get("data")
        # Account for token usage if the agent reports it
        state.tokens_used += int(reply.payload.get("tokens_used", 0)) if isinstance(reply.payload, dict) else 0

The idea of separating "decide what to do" from "actually do it" comes from two well-known papers: ReAct, Yao et al. ICLR 2023 and Toolformer, Schick et al. NeurIPS 2023. Four habits to keep: each agent gets its own list of tools, not access to everything. Show only the context the agent actually needs, not the whole history. Errors are normal events, not crashes. The observer is a separate piece, not mixed into the agents.

Deterministic or stochastic? The choice shapes everything downstream

The loop runs a model. Models are not pure functions. Given the same prompt, the same model can produce different outputs on two calls, and on most setups it does. This single property of the underlying engine reaches into every later chapter, so it is worth being clear about what it means before composition makes the question harder to think about.

The thing that changes is the decide step. Perceive reads from a tool or a state store; that is deterministic up to whatever the source returns. Act calls a tool; the tool itself may be deterministic or not, but the call is intentional. Only decide is where stochasticity comes from in the loop, because decide is where the model gets invoked and the model samples from a probability distribution rather than picking the single most likely token.

Operators have three knobs that control how stochastic decide actually is.

Temperature. A scalar that controls how peaked the sampling distribution is. Temperature zero means "pick the highest-probability token every time," which makes the model effectively deterministic given a fixed prompt. Higher temperatures spread probability across more tokens, and the model produces different outputs from run to run.
Top-p (or top-k) sampling. Cuts off the long tail of the distribution. Even at non-zero temperature, top-p of 0.9 means the model only ever samples from the tokens that together cover 90% of the probability mass. Reduces variance without eliminating it.
Structured output mode. Constrained decoding that forces the model's output to match a schema. Reduces variance dramatically for the parts of the output that the schema constrains. The structured fields are deterministic in shape; the values inside them still vary.

Most production systems use a fourth pattern that the three knobs above do not name: mixed determinism. Different decisions in the same loop run with different settings.

Tool calls run at temperature zero. When the model decides whether to call issue_refund with these arguments, you want the same answer every time given the same context.
Goal restatement (the confirmation step from chapter 03) runs at temperature zero. You want the agent's restatement to be stable so a similarity check has something stable to compare against.
Prose generation runs at modest temperature (often 0.3 to 0.7). A reply to a customer should not be word-for-word identical across sessions; that reads as canned and triggers fatigue in human reviewers.
Brainstorming and exploration run at higher temperature. When the model is generating candidates that another step will filter, variety is what you want.

Mixed determinism is implemented as multiple model calls in the same loop iteration, each with different decoding settings, with the outputs composed by orchestration code. It is not a setting on a single call; it is a discipline about which decisions get which settings.

What changes downstream when the loop is stochastic

Six places where this matters, with the implication for each.

Subsystem	What changes	What you do about it
Goal confirmation (ch 03)	The agent's restatement of the goal is not deterministic, so hash equality is brittle	Run confirmation at temperature zero, or use semantic similarity above a threshold rather than hash equality
Reputation math (ch 12)	The same agent on the same task can produce different outcomes; the Beta counter measures the agent's output distribution, not a single deterministic ability	Treat reputation as a distribution estimate. Set thresholds aware of the agent's variance, not just its mean. The credible-lower-bound rule already does this; it works correctly here, but you should be intentional about the variance
Caching	You cannot cache stochastic outputs by input alone. Two calls with identical input may legitimately produce different outputs	Cache only at the deterministic boundary (temperature zero tool-call decisions, structured output schemas). Cache the prose only if your operators have agreed that one canned reply is acceptable
Reproducing failures	"Run the same input again" does not reproduce a failure. The audit log has to record the actual output, not the intent to produce one	Log the full model response on every call, not just the parsed action. The audit log from chapter 12 already does this if implemented correctly. If yours does not, fix that first
Evaluation harness (ch 17)	Pass rates are distributions, not numbers. "95% accuracy" with a stochastic agent and "95% accuracy" with a deterministic agent are different claims	Run each evaluation case multiple times (5 to 20 typically). Report mean and confidence interval, not a single number. Block merges on the lower bound dropping, not on the mean
Adversarial robustness (ch 20)	"The agent always refuses this prompt" cannot be proven from a single test. A 1-in-50 refusal failure looks like full compliance over five tests	Adversarial test suites run each case at least 50 times and flag any failure, not just majority-rule. Set the bar for refusal at "never failed in 50 runs" rather than "passed once"

The pattern across the table is that every place you treated the agent's output as a single value, you now have to treat it as a sample from a distribution. The math gets a little more careful; the practice gets a lot more honest about what the system is actually doing.

Common mistakes

Hash-equality on stochastic output. Comparing two model outputs for exact byte-equality fails as soon as temperature is non-zero. Use semantic similarity above a threshold for prose, exact match only for outputs that ran at temperature zero or through a constrained schema.
Single-run regression tests. A test that passes once with a stochastic agent has not really passed. Run it n times; require it to pass on every run, or on at least k of n with the failure mode being acceptable.
Caching prose by input. Even if the prose looks deterministic for the first hundred runs, it is not. Cache decisions, schemas, and tool calls; do not cache prose unless you genuinely want every user to see the exact same words.
Treating temperature zero as fully deterministic. Temperature zero is deterministic given a fixed prompt and a fixed model version. A model upgrade can change behavior under temperature zero too. Pin the model version where this matters, and treat any model version change as a regression risk.

The AgentProfile from the kit's agent_profile/ module records this choice as a DecodingPolicy field, with separate settings for tool-call decisions, confirmation steps, and prose. Operators set the policy at deploy time; the audit log records which policy was active when each action was taken. When something goes wrong, the policy is the first thing to check after the model version.

Composing two agents creates a new system

Two agents that pass their own tests don't automatically pass the test of being chained together. This trips up almost every team the first time. Three things go wrong predictably:

Reliability multiplies, not adds. A 95% reliable agent feeding a 95% reliable agent gives you 90.25%, not 92.5%. Stack three of them and you're under 86%. Most teams don't feel this until production traffic shows them.
The handoff breaks the test set. Agent A's outputs become Agent B's inputs. The shape of A's actual outputs is almost always wider than the shape of inputs B was tested on. B did fine on the test set; the test set didn't include "what A actually produces."
Attackers turn one agent into a tool against the next. An input A correctly rejects can be subtly reshaped by an upstream agent into something B accepts. Each agent's safety boundary is not the system's safety boundary. The system has its own boundary that has to be tested as one piece.

The rule: every composed system is a new system. Test it as one. "It's just A plus B plus a connector" is wrong in the same way "two single-process programs glued together is just a single-process program twice" was wrong twenty years ago.

The orchestrator doesn't think. It routes. The agents think. Mixing the two up is the biggest reason agent code becomes unmaintainable.

The loop is correct as far as it goes, but it leaves three questions unanswered: where does the work come from in the first place, how does the agent know its world, and where does the agent really live as a software entity? Chapter 03 (Where the work comes from) is the honest answer to all three. Read it before chapter 04 (Protocols), because protocols only make sense once you know what is being routed and who is asking.