Production case studies

22 Production case studies · what real teams shipped

What's actually running in production right now.

Every chapter so far has argued for a pattern. This chapter ties those patterns to systems that have shipped. The cases below were chosen because the engineering teams have written about them in detail: postmortems, engineering blogs, conference talks, or technical reports anyone can read. Where a team's account is paywalled or NDA'd, the case is not included.

Two warnings before we start. First, public material is curated material. Companies write about their wins; failures get less coverage. Read each case as "here is one team's reported approach," not "here is the universal best practice." Second, agent systems evolve quickly. A 2024 system runs differently in 2026; cite carefully and verify dates.

GitHub Copilot Workspace and the agent-mode pivot

GitHub · Microsoft · 2024-2025

From single-shot completion to multi-step coding agents

GitHub's transition from Copilot autocomplete (single-shot, in-editor) to Copilot Workspace and agent mode is one of the most public examples of an agentic-pattern adoption at scale. The product team published their architecture and design decisions through GitHub's engineering blog and at GitHub Universe 2024. The pattern: an agent that takes an issue, generates a plan, edits code across multiple files, runs tests, and proposes a PR.

Engineering reported in public: separate specification, plan, and implementation stages, each with its own model call and human-editable artifact. Tests are run inside a sandboxed dev environment. The "specification" stage exists specifically to surface the agent's understanding of the problem before it starts coding, because letting the agent jump straight to code produced a high rate of rework. This is exactly the planner-executor split this manual recommends in the patterns chapter.

What it confirms: the same multi-stage architecture (spec, plan, act, test) shows up wherever agentic coding is shipped at scale. The reason is operational, not theoretical: each stage is checkpointed, so a human can intervene without losing the work that came before. GitHub blog: Workspace and the future of software

Anthropic's Claude with computer use

Anthropic · October 2024

A model that operates a desktop with a mouse and keyboard

Anthropic shipped computer-use capability with Claude 3.5 Sonnet in October 2024 (with later iterations across Sonnet 4 and Opus 4.x). The release was unusually transparent: Anthropic published not just capability examples but the failure modes, including documented instances where the model misinterpreted a UI element, clicked the wrong button, or drifted off-task during long sessions. The published numbers showed strong improvement on screen-grounded benchmarks (OSWorldXie 2024, ScreenSpot) over prior models, with explicit caveats that error rates were still too high for fully autonomous deployment.

What the engineering blog described: a tight loop of screenshot, action, screenshot, with reasoning text the model could re-read on each step. Action grounding (knowing where on the screen to click) was the specific bottleneck that improvements targeted. The company recommended customers run it in sandboxed environments and treat actions as suggestions to be approved.

What it confirms: Anthropic deliberately documented the things that go wrong. Honest failure-mode reporting is what lets a deployment team plan their guardrails, and it is rare in agent product launches. Anthropic: introducing computer use

Cursor's agent loop and the rise of "background agents"

Cursor · 2024-2025

A coding agent that runs in the background and proposes changes

Cursor's "Agent" feature (later "Background Agent") is a long-running agent that takes a higher-level goal, plans a series of edits, runs tests, and presents the user with a diff to accept. The Cursor team's public blog and changelog describe the shape of the loop: a planner produces a high-level edit plan, individual editor "tools" make targeted changes, a test runner validates, and the loop iterates with a fixed retry budget.

The aspect worth studying: Cursor's published failure analysis described two specific operational issues. First, the agent would sometimes "hallucinate" file paths that did not exist; the fix was a strict tool that listed real files before allowing edits. Second, agents would sometimes cycle on the same fix; the fix was a hard retry budget per phase. Both are mechanisms recommended in the guardrails chapter, surfaced because real agents in real codebases hit those exact failure modes.

What it confirms: retry budgets and grounded tool use are not theoretical concerns. They show up in postmortems of every shipped coding agent. Cursor changelog

The Devin / Cognition agent benchmark controversy

Cognition Labs / Devin · 2024

A reminder that benchmark scores require careful interpretation

In March 2024, Cognition Labs released demo videos of Devin, claiming a 13.86% pass rate on SWE-benchJimenez 2024 (against a competing reported number around 1.96%). The numbers spread widely. By April, an independent analysis by Carl Brown (the "Internet of Bugs" YouTube channel) walked through the actual session traces Cognition had published and showed several scenarios where Devin's "successful completion" did not match what the demo implied (e.g., bugs that were not actually fixed, tasks that required follow-up steps that were skipped).

Cognition's response, published on their blog, accepted some of the criticism, contested other parts, and committed to publishing more rigorous evaluation results. The episode is a useful precedent for any team about to publish agent capability numbers: the demos went further than the numbers supported, and a single careful analyst caused a substantial credibility correction.

What it confirms: the evaluation chapter's emphasis on independently-verifiable, harness-isolated benchmarks is exactly because of episodes like this one. Eyeballed demos are evidence; reproducible scores are proof. Cognition's technical report

Stripe and Anthropic on customer support agents

Anthropic customer case · 2024-2025

A real customer service agent at scale

Anthropic published a customer story about an enterprise deployment for support automation. The technical pattern described publicly: a tier-1 agent fielded common questions, with a clearly defined escalation path to humans on anything outside a pre-vetted set of intents. The published win was reduction in time-to-first-response on routine tickets; the published guardrail was that the agent could not take any account-modifying action without explicit human confirmation.

The interesting design choice from the engineering writeup: the agent did not have direct access to customer accounts. Instead, it had access to a set of read-only data tools and a single "create a draft response for human review" tool. The human reviewer could approve, edit, or reject. This is the privilege-broker pattern from the trust chapter, deployed in production with explicit hand-off to a human at every action boundary.

What it confirms: human-in-the-loop is not a fallback for failed agents; it is the architecture that made the deployment viable in the first place. Anthropic customer case studies

Shopify's Sidekick and merchant agents

Shopify · 2023-2024

A merchant-facing agent inside an e-commerce platform

Shopify Sidekick (announced 2023, expanded through 2024) is a merchant-facing agent embedded in the Shopify admin. The published architecture: a planner that decomposes merchant requests into discrete tool calls (query orders, generate report, draft product description, configure shipping), each tool tightly scoped to a single Shopify admin function. The merchant sees the proposed changes before they take effect.

Shopify's engineering blog described the design constraint: Sidekick had to feel like an extension of the merchant's intentions, not an independent agent acting on the store. The mechanism: every tool that mutates merchant data emits a confirmation step. Read-only tools (queries, report generation) run without confirmation. Mutation tools surface a "review and approve" UI before any change is committed.

What it confirms: the read-vs-write privilege distinction is not just security hygiene; it is a usability pattern. Users want the agent to show its work before changing their world. Shopify Engineering

The Berkeley RDI scanning-agent results

UC Berkeley RDI · April 2026

Eight popular benchmarks, all gameable in different ways

The Berkeley work cited in the evaluation chapter built a "scanning agent" and pointed it at eight major agent benchmarks: SWE-bench, OSWorld, WebArena, GAIAMialon 2023, and others. Each was probed for harness vulnerabilities. The findings, paraphrased: SWE-bench could be cheated with a 10-line conftest.py file; OSWorld and WebArena both eval()'d agent output text; GAIA's answer key was downloadable.

What this case study shows is harder to celebrate but more useful. None of these benchmarks are "broken" in the sense of being unusable. They are useful when run in sealed evaluation harnesses with isolated environments. They are misleading when run as published, because the published harness was designed for honest agents, not agents that explore the harness itself. The fix is harness isolation, not new benchmarks.

What it confirms: benchmark trust is harness trust. Run popular benchmarks; trust their numbers only when the harness has been hardened. UC Berkeley RDI

Recurring patterns across the cases

Reading these together, several patterns repeat across companies and product types:

Multi-stage decomposition is universal. Every shipped agent that handles non-trivial work splits the problem into discrete stages with checkpoints between them: spec, plan, act, validate. Single-shot end-to-end agents do not survive contact with real workloads.
Human-in-the-loop is not a fallback, it is the architecture. The successful deployments treat human approval as a first-class step at every action boundary, not an emergency override for failed agents.
Read and write are different privilege classes. Across GitHub, Anthropic, Cursor, Shopify, the consistent pattern is: read-only tools run freely; write-side tools surface a confirmation. This is not security theater; users do not trust agents that can change their world without showing them what they are about to do.
Honest failure-mode reporting is rare and valuable. Anthropic, Cursor, and Cognition (after correction) all published the things that went wrong. This is what lets the next team plan around those failures. Vendor decks that show only successes are useless for engineering planning.
Benchmark numbers are lower-bound evidence, not certificates. The Devin/Cognition episode and the Berkeley scanning-agent results both reinforce this. A reported score is a starting point for investigation, not a closing argument.
Tool grounding matters more than tool count. Cursor's path-hallucination fix and GitHub's spec-stage are examples of the same insight: the agent needs to be anchored to real artifacts before it can reason about them. More tools without grounding is more failure modes.

None of these patterns are surprising once you have read the rest of the manual. The point of this chapter is that they are not theoretical. They show up, in writing, from teams shipping at scale, in the same shape this manual has described. When the next agent system you ship hits a similar problem, the case studies above are evidence that the standard answer is the standard answer for a reason.

The bottom line. Every pattern this manual recommends has shown up in at least one production system whose engineering team wrote about it in public. The patterns are not opinions; they are convergent solutions teams keep landing on independently. When in doubt about a design choice, look for a case study where someone shipped the same kind of system and read what they ended up doing.