Infra & deployment

19 Infra & deployment · dev, test, prod, consensus

Real agents live in different environments. Each has its own rules.

Most tutorials end with "the agent works on my laptop". This chapter is about everything that happens after that: how agents move through dev, test, staging, and production; what happens when the infrastructure underneath fails; how to coordinate agents across regions; and what changes when the traffic grows by 1000x.

The four environments

A real agent system runs in at least three environments, sometimes four. Each one has different goals, different data, and different things at stake. Mixing them up is one of the fastest ways to ship a production outage.

Dev

Your laptop or a shared dev server. Fake or anonymized data. Cheap models. Code reloads instantly. The point: iterate fast on prompts, tools, and how the agents fit together.

Test / CI

Runs automatically on every code change. Fixed inputs, fixed expected outputs, fixed random seeds so runs are repeatable. Catches problems before they reach staging or production.

Staging

Same shape as production, with anonymized production-like data. Real models and real tools, but isolated from real users. This is where you test the whole flow end-to-end.

Prod

Real users, real data, real consequences. Tight permissions, full monitoring, gradual rollouts behind feature flags, and a kill switch you can hit at any time.

What changes between environments

What	Dev	Test	Staging	Prod
Model	Cheap, fast	Pinned version, sometimes mocked	Same as prod	Pinned with a fallback ready
Tools	Stubs are fine	Mocked, repeatable	Real, in a sandbox	Real, behind safety checks
Data	Made up	Fixed test data	Anonymized real-shape data	Real data
Safety checks	Optional	Same as prod	Same as prod	All on; fail closed
Human approval steps	Skipped	Mocked	Mocked	Real humans
Token budget	Loose	Tight, enforced	Production limits	Production limits
How fast you can roll back	N/A	N/A	Minutes (re-deploy)	Seconds (feature flag)

A common mistake: safety checks get turned off or loosened in staging because "we want to see what the agent really does". Then production is the first place those checks actually run, and behaviors that staging never noticed start breaking things. Fix: keep safety checks identical between staging and production. Always.

How to roll out agent changes safely

Standard service deployment patterns (blue-green, canary, dark launch) all apply, with one twist: agent outputs aren't fully predictable. The same input on a new model version might produce slightly different outputs. Your rollout plan needs to account for that.

1 Blue-green: switch over with no downtime

Run two full copies of production, called "blue" (the current version) and "green" (the new version). Deploy to green, run quick checks, then flip the load balancer to point at green. If green misbehaves, flip back to blue in seconds.

For agents, the gotcha is shared state. If both versions read the same workflow state, the new version might choke on data the old version wrote. Two fixes: tag each piece of stored data with a schema version, or run separate state stores during the transition window.

# Kubernetes service definition. Flip the version selector to switch traffic.
apiVersion: v1
kind: Service
metadata:
  name: agent-orchestrator
spec:
  selector:
    app: agent-orchestrator
    version: blue   # change to "green" to cut over
  ports:
    - port: 80
      targetPort: 8080

2 Canary: send a small fraction of traffic first

Start with the new version handling 5% of traffic. Watch your key metrics. If everything looks good, ramp to 25%, then 100%. The point is to limit how many users get hit if something is wrong.

For agents, watch output quality too, not just uptime and errors. Response time and error rate can look fine while the actual answers get worse. Run a held-out test set against both versions and alert if quality starts to diverge.

import hashlib

def route_to_version(user_id: str, canary_pct: float) -> str:
    """Use a stable hash so the same user always hits the same version."""
    h = hashlib.sha256(user_id.encode()).hexdigest()
    bucket = int(h[:8], 16) / 0xFFFFFFFF
    return "canary" if bucket < canary_pct else "stable"

# Increase gradually as you gain confidence
# Day 1: canary_pct = 0.05  (5%)
# Day 3: canary_pct = 0.25  (if metrics still look good)
# Day 7: canary_pct = 1.0   (full rollout)

3 Shadow mode: run the new version silently

Run the new agent version alongside the current one, but only the current version's output goes to the user. Log both. Compare them offline to find quality problems before any user sees the new version.

This is especially useful when switching the underlying LLM (e.g., from GPT-4 to Claude). Run shadow mode for two weeks and you'll have thousands of real-world comparisons to look at. Costs roughly double during the window, but it catches silent regressions you'd otherwise discover from user complaints.

4 Feature flags: turn things on for specific users

For agent systems, feature flags aren't just "show this button to that user". They control which version of the prompt, which set of tools, which safety checks, which model. Every meaningful agent change ships behind a flag. The flag history doubles as an audit log of what was running for whom, and when.

def get_agent_config(user_id: str, tenant: str) -> AgentConfig:
    flags = feature_flags.evaluate(user_id, tenant)
    return AgentConfig(
        prompt_version = flags.get("agent_prompt", "v3"),
        model          = flags.get("agent_model", "gpt-4o-2024-08"),
        tools_enabled  = flags.get("tools", ["search", "calc"]),
        max_iterations = flags.get("max_iter", 10),
        guardrails     = flags.get("guardrails", ["intent", "output", "approval"]),
    )

Coordination problems when you scale up

Once you're running more than one instance of your agent system, a different category of problems shows up. These aren't about the agents disagreeing (we cover that in the adversarial chapter). They're about the underlying machinery: who's in charge, who has the latest state, what happens when networks fail.

Problem 1: who's the orchestrator right now?

You run 3 copies of the orchestrator for reliability. But only one of them should be making decisions for a given workflow at a time. Otherwise you get duplicated work and race conditions.
What works: elect one as the leader, and have it make all the decisions. The others stay in sync but wait. If the leader dies, hold a quick election to pick a new one. RAFT is a popular algorithm for this.
Tools that do this for you: etcd, ZooKeeper, Consul, or your cloud provider's coordination service (DynamoDB conditional writes, Spanner, etc).

Problem 2: two workflows updating the same data

Two workflows both want to update the same record (a customer profile, a shared document). Without some kind of coordination, the second write silently overwrites the first and you lose data.
What works: distributed locks (Redis Redlock is common), or version numbers with "compare-and-swap" updates. Compare-and-swap means: "update this record only if its version is still the one I read; otherwise tell me and I'll re-read and retry".
Trade-off: locks are simpler but slower, and you have to handle "what if the lock holder crashes?" Compare-and-swap is faster on average but pushes retry logic into your code.

# Compare-and-swap update for workflow state
def update_state_with_retry(workflow_id: str, mutator, max_retries=5):
    for attempt in range(max_retries):
        state, version = state_store.get_with_version(workflow_id)
        new_state = mutator(state)
        ok = state_store.compare_and_swap(workflow_id, version, new_state)
        if ok:
            return new_state
        # Someone else updated first; reload and try again
    raise ConflictError(f"could not update {workflow_id} after {max_retries} tries")

Problem 3: messages getting duplicated or lost

Your agents talk to each other through a message system (Kafka, RabbitMQ, SQS). When the network has a hiccup, some messages get sent twice; rarely, some get lost. State across agents starts to drift.
What works: make every message handler safe to run twice (idempotent). Use whatever "exactly-once" features your message system offers. Add background reconciliation jobs that periodically compare states and fix drift.
The pattern: give every event a unique ID. Each handler checks "have I already processed this ID?" If yes, skip. If no, process and remember. Two deliveries of the same event become safe.

Problem 4: hitting your LLM provider's rate limits

Your LLM provider limits how many requests you can send per minute. At peak load, requests start getting queued or rejected. If all your instances retry at the same moment, you make the recovery worse.
What works: a shared rate limiter (a Redis-backed token bucket works) so all your instances coordinate. Exponential backoff with random jitter so retries don't pile up. A circuit breaker that gives up quickly when the provider is fully down.
Be defensive: have a fallback provider. If GPT is rate-limited, fall through to Claude, or to a smaller local model. The quality might be a little lower, but the system stays up.

from dataclasses import dataclass

@dataclass
class ModelProvider:
    name: str
    weight: float           # preference; primary gets 1.0
    healthy: bool = True
    consecutive_errors: int = 0

class ModelRouter:
    def __init__(self, providers: list[ModelProvider]):
        self.providers = providers
        self.error_threshold = 5
        self.cooldown_s = 30

    async def call(self, prompt: str) -> str:
        for provider in sorted(self.providers, key=lambda p: -p.weight):
            if not provider.healthy: continue
            try:
                result = await call_provider(provider.name, prompt)
                provider.consecutive_errors = 0
                return result
            except RateLimitError:
                provider.consecutive_errors += 1
                if provider.consecutive_errors >= self.error_threshold:
                    provider.healthy = False
                    schedule_recovery(provider, self.cooldown_s)
                continue          # try the next provider
        raise AllProvidersDownError()

Multi-region deployment

Once you serve users globally, latency forces you to deploy in multiple regions. This brings new problems: agents in EU shouldn't see US data (compliance), state stores are now geo-distributed (consistency vs latency), and a regional outage shouldn't take the whole system down.

Active-active: each region serves its own users from local infra. State is replicated, but writes prefer local. Best latency, complex consistency story.
Active-passive: primary region serves all traffic; secondary takes over on failover. Simpler, but cross-region latency for the secondary's users during failover.
Region-pinned data: GDPR and similar regs require EU user data stays in EU. Your routing layer must know each user's region and pin their workflows there.
Cross-region consensus: if you need globally-consistent state (rare for agents), use Spanner-class systems with TrueTime or accept that consistency operations cost 100ms+.

Real-world case studies

Five short studies from different industries, each showing the deployment and infra choices that mattered most. These are composites drawn from public reports and common patterns; specifics adjusted for clarity.

Case study 1 · Internal developer assistant Tech & SaaS

A B2B SaaS company built a coding-assistant agent for their 800 internal engineers. Started as a side project on one engineer's laptop. Within 6 months it served 10k requests/day across the company.

Deployment shape: single region (engineers were US-only), Kubernetes pods auto-scaling on queue depth, shared Redis for rate-limiting against the LLM provider, feature flags per team to roll out new prompts.

Critical decision: shadow mode for 3 weeks when swapping models. Caught a regression where the new model was 30% faster but produced subtly wrong code suggestions about 4% of the time. The old model stayed primary for another month while prompts were tuned.

Lesson: for internal tools, latency matters less than developer trust. Once devs stop trusting the agent, adoption craters and never recovers. Shadow-then-canary is worth the extra cost.

Case study 2 · Fraud detection on payment flows Fintech

A digital bank deployed an agent system to triage suspicious transactions in real time. Three agents per transaction: a profile agent (reads user history), a risk agent (scores), an adjudicator (combines plus calls APIs to verify).

Deployment shape: active-active across 3 regions, RAFT-elected orchestrator per region, geo-pinned customer data for regulatory compliance (PCI DSS), kill switch tested weekly.

What they did: separate the agents that decide what to do from the agents that actually do it. The deciding agents use the LLM, so their outputs vary. The doing side is plain Python that takes a structured decision and runs it. This kept the regulatory boundary clean: only the predictable Python side touches money.

Lesson: in regulated industries, regulators want to know exactly what your system did and why. Variable LLM outputs make audits painful; clear separation between "decide" and "act" makes them manageable. Don't let an LLM directly call a financial API. Have it produce a typed action that a fixed-logic gate checks and runs.

Case study 3 · Inventory rebalancing across stores Retail

A multi-brand retailer used a swarm of agents to recommend hourly inventory transfers between stores: one agent per region forecasting demand, a coordinator proposing transfer plans, and a logistics agent costing the moves.

Deployment shape: ran as nightly batch jobs initially; promoted to hourly real-time after 6 months. Single region (US only). Used the blackboard pattern: each agent posted forecasts and proposals to a shared store; coordinator read all and decided.

Critical decision: bound the agent's authority. The agents recommended transfers; humans approved batches above $50k. After 3 months of high agreement rate, the threshold was raised to $200k. Trust earned by demonstrated quality, not declared.

Lesson: for high-stakes operational decisions, start with the agent as advisor, not actor. Track its recommendations vs human decisions for weeks. Where they consistently agree, automate that range. The trust-building period is worth months.

Case study 4 · Product description generation at scale E-commerce

A marketplace with 12 million listings deployed agents to generate localized product descriptions in 14 languages. A pipeline pattern: extractor agent reads structured product data, drafter agent writes in target language, fact-checker agent verifies, SEO agent tunes for search.

Deployment shape: Kafka-backed message bus, agents as consumer groups, exactly-once semantics on writes via idempotent handlers. Throughput of 500k descriptions per hour at peak. Cost-driven model selection: a small model for the drafter, a larger model only for the fact-checker.

Critical decision: the SEO agent was disabled in 4 markets where its outputs occasionally contained banned keyword stuffing patterns. Per-market feature flags let them roll forward in stable markets while iterating in problem markets, instead of blocking the entire feature.

Lesson: at scale, the cost ratio between model sizes dominates the architecture. A 10x-cheaper model on 90% of the path with a smarter model only on quality-critical steps can be the difference between profitable and not. And feature flags per geography save you from worst-case-region deciding global rollout.

Case study 5 · Clinical documentation assistant Healthcare

A hospital network deployed an agent to draft post-visit clinical notes from doctor-patient conversation transcripts. Two agents: a transcription agent (medical-tuned ASR), a structured-note agent (turns transcript into SOAP format).

Deployment shape: on-premises in each hospital's data center (HIPAA: data never leaves the building). Air-gapped from public LLM APIs; used a self-hosted open-weights model. Strict approval gate: every generated note shown to the doctor before saving to EHR.

Critical decision: the agent's output was framed as a draft, not a recommendation. Doctors edited 70% of notes; this was expected and tracked. The metric was "time saved per visit", not "accept rate". Agents that were "accepted" were the ones doctors didn't have to rewrite from scratch.

Lesson: in high-stakes domains, the right framing is "this saved you typing", not "this is your answer". The metric follows the framing. Track time saved, error reduction, and clinician satisfaction. Don't chase accept rate; that incentivizes generating bland, uncontroversial output that adds no value.

Pre-production checklist

Before any agent system goes live, every item below should have a real answer:

Identity & access: which agent has which credentials? How often are they rotated? Are they scoped to the bare minimum each agent needs?
State safety: if one server dies, is your workflow state safe? Replicated and recoverable?
Rate limiting: per-user, per-tenant, and per-workflow limits in place? Shared across all your servers so they don't all retry at once?
Cost limits: per-workflow token budget? Daily company-wide budget? Both enforced, with automatic stop at 100%?
Kill switch: can someone outside the system shut agents down within seconds? Tested at least monthly?
Rollback: can you undo a deploy in under 60 seconds with a feature flag, without rebuilding code?
Audit log: is every decision, tool call, and blocked request logged with workflow ID, agent name, timestamp, input, and output?
Observability: can you trace a request through the whole workflow, including LLM and tool calls, with the cost of each step?
Drift detection: are you sampling model behavior over time and comparing to a baseline? Alerts when behavior shifts?
Disaster recovery: documented and rehearsed plans for losing your primary region, your LLM provider, or your state store?

The takeaway: agent systems multiply the failure modes of normal services. They have all the usual problems (network, state, scale) plus model variability, model outages, prompt drift, and cost surprises. Don't ship one to production until every line on the checklist above has a real answer, not a "we'll figure that out later".

The hardest production agent systems aren't the ones with the cleverest agents. They're the ones with boring, dependable infrastructure underneath unpredictable models. Make the infra dull. The agents will look much smarter for it.

Further reading on production agent systems: Prompt-Response to Goal-Directed, arXiv 2026 surveys reference architectures used by Kore.ai, Salesforce Agentforce, TrueFoundry, ZenML, and LangChain. Agentic AI Frameworks, IEEE arXiv 2025 catalogs framework choices and the design problems they solve. MAS Outlook, arXiv 2025 sizes the agent market ($3.66B today growing to $139B by 2033) and lays out the trade-offs between making agents work well and making them safe.