Real agents live in different environments. Each has its own rules.
Most tutorials end with "the agent works on my laptop". This chapter is about everything that happens after that: how agents move through dev, test, staging, and production; what happens when the infrastructure underneath fails; how to coordinate agents across regions; and what changes when the traffic grows by 1000x.
The four environments
A real agent system runs in at least three environments, sometimes four. Each one has different goals, different data, and different things at stake. Mixing them up is one of the fastest ways to ship a production outage.
What changes between environments
| What | Dev | Test | Staging | Prod |
|---|---|---|---|---|
| Model | Cheap, fast | Pinned version, sometimes mocked | Same as prod | Pinned with a fallback ready |
| Tools | Stubs are fine | Mocked, repeatable | Real, in a sandbox | Real, behind safety checks |
| Data | Made up | Fixed test data | Anonymized real-shape data | Real data |
| Safety checks | Optional | Same as prod | Same as prod | All on; fail closed |
| Human approval steps | Skipped | Mocked | Mocked | Real humans |
| Token budget | Loose | Tight, enforced | Production limits | Production limits |
| How fast you can roll back | N/A | N/A | Minutes (re-deploy) | Seconds (feature flag) |
How to roll out agent changes safely
Standard service deployment patterns (blue-green, canary, dark launch) all apply, with one twist: agent outputs aren't fully predictable. The same input on a new model version might produce slightly different outputs. Your rollout plan needs to account for that.
Run two full copies of production, called "blue" (the current version) and "green" (the new version). Deploy to green, run quick checks, then flip the load balancer to point at green. If green misbehaves, flip back to blue in seconds.
For agents, the gotcha is shared state. If both versions read the same workflow state, the new version might choke on data the old version wrote. Two fixes: tag each piece of stored data with a schema version, or run separate state stores during the transition window.
# Kubernetes service definition. Flip the version selector to switch traffic.
apiVersion: v1
kind: Service
metadata:
name: agent-orchestrator
spec:
selector:
app: agent-orchestrator
version: blue # change to "green" to cut over
ports:
- port: 80
targetPort: 8080
Start with the new version handling 5% of traffic. Watch your key metrics. If everything looks good, ramp to 25%, then 100%. The point is to limit how many users get hit if something is wrong.
For agents, watch output quality too, not just uptime and errors. Response time and error rate can look fine while the actual answers get worse. Run a held-out test set against both versions and alert if quality starts to diverge.
import hashlib
def route_to_version(user_id: str, canary_pct: float) -> str:
"""Use a stable hash so the same user always hits the same version."""
h = hashlib.sha256(user_id.encode()).hexdigest()
bucket = int(h[:8], 16) / 0xFFFFFFFF
return "canary" if bucket < canary_pct else "stable"
# Increase gradually as you gain confidence
# Day 1: canary_pct = 0.05 (5%)
# Day 3: canary_pct = 0.25 (if metrics still look good)
# Day 7: canary_pct = 1.0 (full rollout)
Run the new agent version alongside the current one, but only the current version's output goes to the user. Log both. Compare them offline to find quality problems before any user sees the new version.
This is especially useful when switching the underlying LLM (e.g., from GPT-4 to Claude). Run shadow mode for two weeks and you'll have thousands of real-world comparisons to look at. Costs roughly double during the window, but it catches silent regressions you'd otherwise discover from user complaints.
For agent systems, feature flags aren't just "show this button to that user". They control which version of the prompt, which set of tools, which safety checks, which model. Every meaningful agent change ships behind a flag. The flag history doubles as an audit log of what was running for whom, and when.
def get_agent_config(user_id: str, tenant: str) -> AgentConfig:
flags = feature_flags.evaluate(user_id, tenant)
return AgentConfig(
prompt_version = flags.get("agent_prompt", "v3"),
model = flags.get("agent_model", "gpt-4o-2024-08"),
tools_enabled = flags.get("tools", ["search", "calc"]),
max_iterations = flags.get("max_iter", 10),
guardrails = flags.get("guardrails", ["intent", "output", "approval"]),
)
Coordination problems when you scale up
Once you're running more than one instance of your agent system, a different category of problems shows up. These aren't about the agents disagreeing (we cover that in the adversarial chapter). They're about the underlying machinery: who's in charge, who has the latest state, what happens when networks fail.
Problem 1: who's the orchestrator right now?
- You run 3 copies of the orchestrator for reliability. But only one of them should be making decisions for a given workflow at a time. Otherwise you get duplicated work and race conditions.
- What works: elect one as the leader, and have it make all the decisions. The others stay in sync but wait. If the leader dies, hold a quick election to pick a new one. RAFT is a popular algorithm for this.
- Tools that do this for you: etcd, ZooKeeper, Consul, or your cloud provider's coordination service (DynamoDB conditional writes, Spanner, etc).
Problem 2: two workflows updating the same data
- Two workflows both want to update the same record (a customer profile, a shared document). Without some kind of coordination, the second write silently overwrites the first and you lose data.
- What works: distributed locks (Redis Redlock is common), or version numbers with "compare-and-swap" updates. Compare-and-swap means: "update this record only if its version is still the one I read; otherwise tell me and I'll re-read and retry".
- Trade-off: locks are simpler but slower, and you have to handle "what if the lock holder crashes?" Compare-and-swap is faster on average but pushes retry logic into your code.
# Compare-and-swap update for workflow state
def update_state_with_retry(workflow_id: str, mutator, max_retries=5):
for attempt in range(max_retries):
state, version = state_store.get_with_version(workflow_id)
new_state = mutator(state)
ok = state_store.compare_and_swap(workflow_id, version, new_state)
if ok:
return new_state
# Someone else updated first; reload and try again
raise ConflictError(f"could not update {workflow_id} after {max_retries} tries")
Problem 3: messages getting duplicated or lost
- Your agents talk to each other through a message system (Kafka, RabbitMQ, SQS). When the network has a hiccup, some messages get sent twice; rarely, some get lost. State across agents starts to drift.
- What works: make every message handler safe to run twice (idempotent). Use whatever "exactly-once" features your message system offers. Add background reconciliation jobs that periodically compare states and fix drift.
- The pattern: give every event a unique ID. Each handler checks "have I already processed this ID?" If yes, skip. If no, process and remember. Two deliveries of the same event become safe.
Problem 4: hitting your LLM provider's rate limits
- Your LLM provider limits how many requests you can send per minute. At peak load, requests start getting queued or rejected. If all your instances retry at the same moment, you make the recovery worse.
- What works: a shared rate limiter (a Redis-backed token bucket works) so all your instances coordinate. Exponential backoff with random jitter so retries don't pile up. A circuit breaker that gives up quickly when the provider is fully down.
- Be defensive: have a fallback provider. If GPT is rate-limited, fall through to Claude, or to a smaller local model. The quality might be a little lower, but the system stays up.
from dataclasses import dataclass
@dataclass
class ModelProvider:
name: str
weight: float # preference; primary gets 1.0
healthy: bool = True
consecutive_errors: int = 0
class ModelRouter:
def __init__(self, providers: list[ModelProvider]):
self.providers = providers
self.error_threshold = 5
self.cooldown_s = 30
async def call(self, prompt: str) -> str:
for provider in sorted(self.providers, key=lambda p: -p.weight):
if not provider.healthy: continue
try:
result = await call_provider(provider.name, prompt)
provider.consecutive_errors = 0
return result
except RateLimitError:
provider.consecutive_errors += 1
if provider.consecutive_errors >= self.error_threshold:
provider.healthy = False
schedule_recovery(provider, self.cooldown_s)
continue # try the next provider
raise AllProvidersDownError()
Multi-region deployment
Once you serve users globally, latency forces you to deploy in multiple regions. This brings new problems: agents in EU shouldn't see US data (compliance), state stores are now geo-distributed (consistency vs latency), and a regional outage shouldn't take the whole system down.
- Active-active: each region serves its own users from local infra. State is replicated, but writes prefer local. Best latency, complex consistency story.
- Active-passive: primary region serves all traffic; secondary takes over on failover. Simpler, but cross-region latency for the secondary's users during failover.
- Region-pinned data: GDPR and similar regs require EU user data stays in EU. Your routing layer must know each user's region and pin their workflows there.
- Cross-region consensus: if you need globally-consistent state (rare for agents), use Spanner-class systems with TrueTime or accept that consistency operations cost 100ms+.
Real-world case studies
Five short studies from different industries, each showing the deployment and infra choices that mattered most. These are composites drawn from public reports and common patterns; specifics adjusted for clarity.
Deployment shape: single region (engineers were US-only), Kubernetes pods auto-scaling on queue depth, shared Redis for rate-limiting against the LLM provider, feature flags per team to roll out new prompts.
Critical decision: shadow mode for 3 weeks when swapping models. Caught a regression where the new model was 30% faster but produced subtly wrong code suggestions about 4% of the time. The old model stayed primary for another month while prompts were tuned.
Lesson: for internal tools, latency matters less than developer trust. Once devs stop trusting the agent, adoption craters and never recovers. Shadow-then-canary is worth the extra cost.
Deployment shape: active-active across 3 regions, RAFT-elected orchestrator per region, geo-pinned customer data for regulatory compliance (PCI DSS), kill switch tested weekly.
What they did: separate the agents that decide what to do from the agents that actually do it. The deciding agents use the LLM, so their outputs vary. The doing side is plain Python that takes a structured decision and runs it. This kept the regulatory boundary clean: only the predictable Python side touches money.
Lesson: in regulated industries, regulators want to know exactly what your system did and why. Variable LLM outputs make audits painful; clear separation between "decide" and "act" makes them manageable. Don't let an LLM directly call a financial API. Have it produce a typed action that a fixed-logic gate checks and runs.
Deployment shape: ran as nightly batch jobs initially; promoted to hourly real-time after 6 months. Single region (US only). Used the blackboard pattern: each agent posted forecasts and proposals to a shared store; coordinator read all and decided.
Critical decision: bound the agent's authority. The agents recommended transfers; humans approved batches above $50k. After 3 months of high agreement rate, the threshold was raised to $200k. Trust earned by demonstrated quality, not declared.
Lesson: for high-stakes operational decisions, start with the agent as advisor, not actor. Track its recommendations vs human decisions for weeks. Where they consistently agree, automate that range. The trust-building period is worth months.
Deployment shape: Kafka-backed message bus, agents as consumer groups, exactly-once semantics on writes via idempotent handlers. Throughput of 500k descriptions per hour at peak. Cost-driven model selection: a small model for the drafter, a larger model only for the fact-checker.
Critical decision: the SEO agent was disabled in 4 markets where its outputs occasionally contained banned keyword stuffing patterns. Per-market feature flags let them roll forward in stable markets while iterating in problem markets, instead of blocking the entire feature.
Lesson: at scale, the cost ratio between model sizes dominates the architecture. A 10x-cheaper model on 90% of the path with a smarter model only on quality-critical steps can be the difference between profitable and not. And feature flags per geography save you from worst-case-region deciding global rollout.
Deployment shape: on-premises in each hospital's data center (HIPAA: data never leaves the building). Air-gapped from public LLM APIs; used a self-hosted open-weights model. Strict approval gate: every generated note shown to the doctor before saving to EHR.
Critical decision: the agent's output was framed as a draft, not a recommendation. Doctors edited 70% of notes; this was expected and tracked. The metric was "time saved per visit", not "accept rate". Agents that were "accepted" were the ones doctors didn't have to rewrite from scratch.
Lesson: in high-stakes domains, the right framing is "this saved you typing", not "this is your answer". The metric follows the framing. Track time saved, error reduction, and clinician satisfaction. Don't chase accept rate; that incentivizes generating bland, uncontroversial output that adds no value.
Pre-production checklist
Before any agent system goes live, every item below should have a real answer:
- Identity & access: which agent has which credentials? How often are they rotated? Are they scoped to the bare minimum each agent needs?
- State safety: if one server dies, is your workflow state safe? Replicated and recoverable?
- Rate limiting: per-user, per-tenant, and per-workflow limits in place? Shared across all your servers so they don't all retry at once?
- Cost limits: per-workflow token budget? Daily company-wide budget? Both enforced, with automatic stop at 100%?
- Kill switch: can someone outside the system shut agents down within seconds? Tested at least monthly?
- Rollback: can you undo a deploy in under 60 seconds with a feature flag, without rebuilding code?
- Audit log: is every decision, tool call, and blocked request logged with workflow ID, agent name, timestamp, input, and output?
- Observability: can you trace a request through the whole workflow, including LLM and tool calls, with the cost of each step?
- Drift detection: are you sampling model behavior over time and comparing to a baseline? Alerts when behavior shifts?
- Disaster recovery: documented and rehearsed plans for losing your primary region, your LLM provider, or your state store?