Tone Dark
Tint
19 Infra & deployment · dev, test, prod, consensus

Real agents live in different environments. Each has its own rules.

Most tutorials end with "the agent works on my laptop". This chapter is about everything that happens after that: how agents move through dev, test, staging, and production; what happens when the infrastructure underneath fails; how to coordinate agents across regions; and what changes when the traffic grows by 1000x.

The four environments

A real agent system runs in at least three environments, sometimes four. Each one has different goals, different data, and different things at stake. Mixing them up is one of the fastest ways to ship a production outage.

Dev
Your laptop or a shared dev server. Fake or anonymized data. Cheap models. Code reloads instantly. The point: iterate fast on prompts, tools, and how the agents fit together.
Test / CI
Runs automatically on every code change. Fixed inputs, fixed expected outputs, fixed random seeds so runs are repeatable. Catches problems before they reach staging or production.
Staging
Same shape as production, with anonymized production-like data. Real models and real tools, but isolated from real users. This is where you test the whole flow end-to-end.
Prod
Real users, real data, real consequences. Tight permissions, full monitoring, gradual rollouts behind feature flags, and a kill switch you can hit at any time.

What changes between environments

WhatDevTestStagingProd
ModelCheap, fastPinned version, sometimes mockedSame as prodPinned with a fallback ready
ToolsStubs are fineMocked, repeatableReal, in a sandboxReal, behind safety checks
DataMade upFixed test dataAnonymized real-shape dataReal data
Safety checksOptionalSame as prodSame as prodAll on; fail closed
Human approval stepsSkippedMockedMockedReal humans
Token budgetLooseTight, enforcedProduction limitsProduction limits
How fast you can roll backN/AN/AMinutes (re-deploy)Seconds (feature flag)
A common mistake: safety checks get turned off or loosened in staging because "we want to see what the agent really does". Then production is the first place those checks actually run, and behaviors that staging never noticed start breaking things. Fix: keep safety checks identical between staging and production. Always.

How to roll out agent changes safely

Standard service deployment patterns (blue-green, canary, dark launch) all apply, with one twist: agent outputs aren't fully predictable. The same input on a new model version might produce slightly different outputs. Your rollout plan needs to account for that.

1 Blue-green: switch over with no downtime

Run two full copies of production, called "blue" (the current version) and "green" (the new version). Deploy to green, run quick checks, then flip the load balancer to point at green. If green misbehaves, flip back to blue in seconds.

For agents, the gotcha is shared state. If both versions read the same workflow state, the new version might choke on data the old version wrote. Two fixes: tag each piece of stored data with a schema version, or run separate state stores during the transition window.

# Kubernetes service definition. Flip the version selector to switch traffic.
apiVersion: v1
kind: Service
metadata:
  name: agent-orchestrator
spec:
  selector:
    app: agent-orchestrator
    version: blue   # change to "green" to cut over
  ports:
    - port: 80
      targetPort: 8080
2 Canary: send a small fraction of traffic first

Start with the new version handling 5% of traffic. Watch your key metrics. If everything looks good, ramp to 25%, then 100%. The point is to limit how many users get hit if something is wrong.

For agents, watch output quality too, not just uptime and errors. Response time and error rate can look fine while the actual answers get worse. Run a held-out test set against both versions and alert if quality starts to diverge.

import hashlib

def route_to_version(user_id: str, canary_pct: float) -> str:
    """Use a stable hash so the same user always hits the same version."""
    h = hashlib.sha256(user_id.encode()).hexdigest()
    bucket = int(h[:8], 16) / 0xFFFFFFFF
    return "canary" if bucket < canary_pct else "stable"

# Increase gradually as you gain confidence
# Day 1: canary_pct = 0.05  (5%)
# Day 3: canary_pct = 0.25  (if metrics still look good)
# Day 7: canary_pct = 1.0   (full rollout)
3 Shadow mode: run the new version silently

Run the new agent version alongside the current one, but only the current version's output goes to the user. Log both. Compare them offline to find quality problems before any user sees the new version.

This is especially useful when switching the underlying LLM (e.g., from GPT-4 to Claude). Run shadow mode for two weeks and you'll have thousands of real-world comparisons to look at. Costs roughly double during the window, but it catches silent regressions you'd otherwise discover from user complaints.

4 Feature flags: turn things on for specific users

For agent systems, feature flags aren't just "show this button to that user". They control which version of the prompt, which set of tools, which safety checks, which model. Every meaningful agent change ships behind a flag. The flag history doubles as an audit log of what was running for whom, and when.

def get_agent_config(user_id: str, tenant: str) -> AgentConfig:
    flags = feature_flags.evaluate(user_id, tenant)
    return AgentConfig(
        prompt_version = flags.get("agent_prompt", "v3"),
        model          = flags.get("agent_model", "gpt-4o-2024-08"),
        tools_enabled  = flags.get("tools", ["search", "calc"]),
        max_iterations = flags.get("max_iter", 10),
        guardrails     = flags.get("guardrails", ["intent", "output", "approval"]),
    )

Coordination problems when you scale up

Once you're running more than one instance of your agent system, a different category of problems shows up. These aren't about the agents disagreeing (we cover that in the adversarial chapter). They're about the underlying machinery: who's in charge, who has the latest state, what happens when networks fail.

Problem 1: who's the orchestrator right now?

Problem 2: two workflows updating the same data

# Compare-and-swap update for workflow state
def update_state_with_retry(workflow_id: str, mutator, max_retries=5):
    for attempt in range(max_retries):
        state, version = state_store.get_with_version(workflow_id)
        new_state = mutator(state)
        ok = state_store.compare_and_swap(workflow_id, version, new_state)
        if ok:
            return new_state
        # Someone else updated first; reload and try again
    raise ConflictError(f"could not update {workflow_id} after {max_retries} tries")

Problem 3: messages getting duplicated or lost

Problem 4: hitting your LLM provider's rate limits

from dataclasses import dataclass

@dataclass
class ModelProvider:
    name: str
    weight: float           # preference; primary gets 1.0
    healthy: bool = True
    consecutive_errors: int = 0

class ModelRouter:
    def __init__(self, providers: list[ModelProvider]):
        self.providers = providers
        self.error_threshold = 5
        self.cooldown_s = 30

    async def call(self, prompt: str) -> str:
        for provider in sorted(self.providers, key=lambda p: -p.weight):
            if not provider.healthy: continue
            try:
                result = await call_provider(provider.name, prompt)
                provider.consecutive_errors = 0
                return result
            except RateLimitError:
                provider.consecutive_errors += 1
                if provider.consecutive_errors >= self.error_threshold:
                    provider.healthy = False
                    schedule_recovery(provider, self.cooldown_s)
                continue          # try the next provider
        raise AllProvidersDownError()

Multi-region deployment

Once you serve users globally, latency forces you to deploy in multiple regions. This brings new problems: agents in EU shouldn't see US data (compliance), state stores are now geo-distributed (consistency vs latency), and a regional outage shouldn't take the whole system down.

Real-world case studies

Five short studies from different industries, each showing the deployment and infra choices that mattered most. These are composites drawn from public reports and common patterns; specifics adjusted for clarity.

Case study 1 · Internal developer assistant Tech & SaaS
A B2B SaaS company built a coding-assistant agent for their 800 internal engineers. Started as a side project on one engineer's laptop. Within 6 months it served 10k requests/day across the company.

Deployment shape: single region (engineers were US-only), Kubernetes pods auto-scaling on queue depth, shared Redis for rate-limiting against the LLM provider, feature flags per team to roll out new prompts.

Critical decision: shadow mode for 3 weeks when swapping models. Caught a regression where the new model was 30% faster but produced subtly wrong code suggestions about 4% of the time. The old model stayed primary for another month while prompts were tuned.

Lesson: for internal tools, latency matters less than developer trust. Once devs stop trusting the agent, adoption craters and never recovers. Shadow-then-canary is worth the extra cost.
Case study 2 · Fraud detection on payment flows Fintech
A digital bank deployed an agent system to triage suspicious transactions in real time. Three agents per transaction: a profile agent (reads user history), a risk agent (scores), an adjudicator (combines plus calls APIs to verify).

Deployment shape: active-active across 3 regions, RAFT-elected orchestrator per region, geo-pinned customer data for regulatory compliance (PCI DSS), kill switch tested weekly.

What they did: separate the agents that decide what to do from the agents that actually do it. The deciding agents use the LLM, so their outputs vary. The doing side is plain Python that takes a structured decision and runs it. This kept the regulatory boundary clean: only the predictable Python side touches money.

Lesson: in regulated industries, regulators want to know exactly what your system did and why. Variable LLM outputs make audits painful; clear separation between "decide" and "act" makes them manageable. Don't let an LLM directly call a financial API. Have it produce a typed action that a fixed-logic gate checks and runs.
Case study 3 · Inventory rebalancing across stores Retail
A multi-brand retailer used a swarm of agents to recommend hourly inventory transfers between stores: one agent per region forecasting demand, a coordinator proposing transfer plans, and a logistics agent costing the moves.

Deployment shape: ran as nightly batch jobs initially; promoted to hourly real-time after 6 months. Single region (US only). Used the blackboard pattern: each agent posted forecasts and proposals to a shared store; coordinator read all and decided.

Critical decision: bound the agent's authority. The agents recommended transfers; humans approved batches above $50k. After 3 months of high agreement rate, the threshold was raised to $200k. Trust earned by demonstrated quality, not declared.

Lesson: for high-stakes operational decisions, start with the agent as advisor, not actor. Track its recommendations vs human decisions for weeks. Where they consistently agree, automate that range. The trust-building period is worth months.
Case study 4 · Product description generation at scale E-commerce
A marketplace with 12 million listings deployed agents to generate localized product descriptions in 14 languages. A pipeline pattern: extractor agent reads structured product data, drafter agent writes in target language, fact-checker agent verifies, SEO agent tunes for search.

Deployment shape: Kafka-backed message bus, agents as consumer groups, exactly-once semantics on writes via idempotent handlers. Throughput of 500k descriptions per hour at peak. Cost-driven model selection: a small model for the drafter, a larger model only for the fact-checker.

Critical decision: the SEO agent was disabled in 4 markets where its outputs occasionally contained banned keyword stuffing patterns. Per-market feature flags let them roll forward in stable markets while iterating in problem markets, instead of blocking the entire feature.

Lesson: at scale, the cost ratio between model sizes dominates the architecture. A 10x-cheaper model on 90% of the path with a smarter model only on quality-critical steps can be the difference between profitable and not. And feature flags per geography save you from worst-case-region deciding global rollout.
Case study 5 · Clinical documentation assistant Healthcare
A hospital network deployed an agent to draft post-visit clinical notes from doctor-patient conversation transcripts. Two agents: a transcription agent (medical-tuned ASR), a structured-note agent (turns transcript into SOAP format).

Deployment shape: on-premises in each hospital's data center (HIPAA: data never leaves the building). Air-gapped from public LLM APIs; used a self-hosted open-weights model. Strict approval gate: every generated note shown to the doctor before saving to EHR.

Critical decision: the agent's output was framed as a draft, not a recommendation. Doctors edited 70% of notes; this was expected and tracked. The metric was "time saved per visit", not "accept rate". Agents that were "accepted" were the ones doctors didn't have to rewrite from scratch.

Lesson: in high-stakes domains, the right framing is "this saved you typing", not "this is your answer". The metric follows the framing. Track time saved, error reduction, and clinician satisfaction. Don't chase accept rate; that incentivizes generating bland, uncontroversial output that adds no value.

Pre-production checklist

Before any agent system goes live, every item below should have a real answer:

The takeaway: agent systems multiply the failure modes of normal services. They have all the usual problems (network, state, scale) plus model variability, model outages, prompt drift, and cost surprises. Don't ship one to production until every line on the checklist above has a real answer, not a "we'll figure that out later".
The hardest production agent systems aren't the ones with the cleverest agents. They're the ones with boring, dependable infrastructure underneath unpredictable models. Make the infra dull. The agents will look much smarter for it.
Further reading on production agent systems: Prompt-Response to Goal-Directed, arXiv 2026 surveys reference architectures used by Kore.ai, Salesforce Agentforce, TrueFoundry, ZenML, and LangChain. Agentic AI Frameworks, IEEE arXiv 2025 catalogs framework choices and the design problems they solve. MAS Outlook, arXiv 2025 sizes the agent market ($3.66B today growing to $139B by 2033) and lays out the trade-offs between making agents work well and making them safe.