Common questions, with follow-ups.
Click to expand.
Short answer: when the task has (a) genuinely different sub-skills, (b) too much context for one window, or (c) needs adversarial cross-checking. Outside those, a single capable model + tools usually wins on cost, latency, and complexity.
Honest test: can you describe each agent's job in one sentence and that sentence isn't a euphemism for "and other stuff"? If not, you have one agent in a trench coat pretending to be five.
Three layers of safety: (1) hard iteration limit: a maximum number of turns. If the agents hit it, the workflow stops. (2) progress detection: if nothing has changed in the last N turns, stop or escalate. (3) budget tracking: limits on tokens, time, and tool calls per workflow.
Keep one shared store of all the data, then give each agent only the slice it needs. The shared object has all the fields; each agent is handed a smaller view based on its role.
Customer support example: the full record has customer_personal_info, conversation_history, internal_notes, billing_data. The empathy agent sees only conversation_history. The billing agent sees billing_data plus the customer ID. The supervisor sees internal_notes. Personal info never reaches an agent that doesn't need it.
{"need": "customer_billing", "reason": "..."}. The orchestrator decides whether to grant, log, or refuse. Now context creep is visible and you can put policy on it.Use multiple layers, in roughly this order of effectiveness:
(1) Strict output formats: even an injected agent can only fill in the fields you defined. This alone reduces damage a lot.
(2) Restrict tools per agent: an agent that only reads things has no write tools to abuse, so the injection has nowhere to go.
(3) Curated handoffs: when one agent passes work to the next, only specific fields move forward, not the full untrusted text.
(4) Cross-check critical actions: for important decisions, run a second agent and compare answers.
(5) Sandboxing: any code execution runs in an isolated environment with no credentials and no network.
Disagreement is data. Three resolution paths:
Factual disagreement ("is this API rate-limited?") → ground in a tool. Run the call, read the docs, check the spec.
Judgment disagreement ("is this risky enough to escalate?") → debate pattern with a judge agent or a human.
Persistent disagreement after both → human review. Disagreement that survives evidence and debate signals you're outside the system's competence.
The wrong move: silently picking one or averaging. Both lose information and bury the issue for later.
Per-workflow token budget: observer-enforced. Alert at 80%, halt at 100%.
Right-size the model per role. Intake parser = small/fast model. Adjudicator = strongest model. Match capability to cost.
Cache aggressively for stable lookups (policy data, product info).
Compress between handoffs: pass summaries + structured artifacts, not full transcripts.
For reasoning-model calls specifically: batch related queries into a single shared-context call rather than firing them sequentially. Empirical work on DeepSeek-R1 and OpenAI-o1 across thirteen benchmarks reports roughly 76% fewer reasoning tokens at preserved or slightly improved accuracy Srivastava et al., ICLR 2026. The mechanism is that shared context suppresses the recursive self-doubt loops ("wait, let me reconsider") that drive overthinking. Applies to chain-of-thought reasoning models; less relevant if your agents call non-reasoning models.
Both, on purpose. Three levels: per-agent scratchpad (used during one task and then discarded), workflow memory (shared during one full run, then ends), long-term memory (kept across many runs).
Common mistake to avoid: letting agents write to long-term memory whenever they want. One bad write becomes a "fact" that future agents trust forever. Treat long-term memory writes the way you'd treat database migrations: versioned, reviewed, possible to roll back.
Three levels of checking. Per-agent tests: small sets of inputs with the outputs you'd consider correct, run for each agent. End-to-end tests: success rate, iterations, cost, and time for the whole workflow. Production monitoring: log every workflow, sample some for human review, watch for behavior changes after model upgrades.
The trap: tuning each agent's individual numbers until they look great while the overall workflow success rate drops. Each agent doing its part well doesn't mean the team wins. Always have one whole-system metric you trust above all the others.
The same reason TCP/IP, HTTP, and TLS aren't patented: at the protocol layer, network effects beat patent rents by a wide margin. A patented messaging protocol that nobody adopts is worth zero. An open one that the whole industry implements becomes the substrate everyone builds on, and the money flows one layer up.
The right analogy isn't a patentable invention, it's payments. ACH as a protocol is free and boring; Cash App, Venmo, and Zelle all sit on top of it and make money on flow, fees, network effects, and adjacent services (debit cards, savings, lending). The protocol enables the market; the participants on it are where value accrues. A2A is shaping up the same way. Expect to see money in: identity and reputation registries (who is this agent, can I trust them), payment rails for agent-to-agent transactions (who handles the actual money movement), marketplaces of specialized agents (where do I find an agent that does X), auditing and compliance services (was this agent allowed to do that), and infrastructure for running agents reliably. None of these require owning the protocol.
That a syndicate of model providers published it openly is itself a signal that they expect the value to be in the upper layers (their models, their hosting, their tooling) and that fragmenting the protocol layer into competing standards would shrink the market for everyone.
Architecturally, no, and this is worth understanding clearly. The wrappers we discuss (input filters, output validators, tool gates, capability tokens, audit logs) all run on your infrastructure, not the model provider's. The LLM receives a prompt, returns a reply, and that's the whole interaction it's aware of. Everything before the prompt goes in and after the reply comes out is invisible to the provider. There's no API surface they could lock down to prevent this without breaking their product for every legitimate user.
The deeper insurance against gatekeeping is that the constraint stack is model-agnostic. The same wrapper code that runs against a hosted Claude or GPT also runs against open-weight models like Llama, Mistral, Qwen, or DeepSeek, where there's no provider in the loop at all. If a hosted provider made wrapping their model harder, the rational response is to switch to an open model that performs nearly as well. This dynamic constrains how much friction a provider can introduce.
The "verified agent" framing in the manual is more like ISO certifications or SOC 2 audits than like App Store review: a market signal that auditors, insurers, and counterparties will start asking for, not a permission a single company grants. The closer parallel isn't Apple's App Store, it's TLS, where the protocol is open and the verification ecosystem grew around it without anyone owning it.
This manual is a manual. It's a focused attempt to write down what's actually working in agent systems in 2025-2026, in plain language, with the math where the math matters and the citations where claims need backing. The companion code is companion code: small, runnable, learning-oriented. Whether cenpie eventually builds something on top of these ideas is a separate question from whether the ideas are useful, and we're trying to keep the manual honest about the second question regardless of how the first one resolves.
That said, the patterns you'd expect to show up in any serious agent-economy infrastructure (capability tokens, reputation tracking, layered guardrails, low-code composition, evaluation harnesses) are the same patterns the manual describes, because those are the patterns the field is converging on. If we (or anyone) build something in this space, the bones will look like what's in here. That's not a coincidence; it's why the manual exists.