The popular benchmarks have problems. Here's what to do instead.
In April 2026, researchers at Berkeley's RDI group Berkeley RDI, 2026 built a "scanning agent" and pointed it at eight of the most popular AI agent benchmarks. Every single one could be tricked into giving near-perfect scores without actually solving any tasks. The most striking finding: a 10-line Python config file ("conftest.py") was enough to fake a 100% score on SWE-benchJimenez 2024 Verified, the benchmark most labs cite when claiming progress on coding agents. WebArena and OSWorldXie 2024 both call Python's eval() on text the agent produces, which is straightforward to abuse. GAIAMialon 2023's answer key is downloadable from Hugging Face.
This doesn't mean benchmarks are useless. It means single benchmark numbers should not be trusted on their own, and that you need your own evaluation if you care about whether your agent actually works. This chapter covers what the popular benchmarks actually measure, where they break, and how to build evaluation for your own system.
The popular benchmarks at a glance
| Benchmark | What it tests | Year | Current state |
|---|---|---|---|
| SWE-bench Jimenez et al., ICLR 2024 | Can the agent fix real GitHub issues from open-source Python projects? | 2024, refined through 2026 | Top scores are above 70%, but the benchmark can be cheated. Treat individual scores carefully. |
| OSWorld Xie et al., NeurIPS 2024 | Can the agent operate a real Linux desktop the way a human would? | 2024, still active | Scores jumped from 23% to 51% during 2025 thanks to better screen-grounding (the Jedi dataset). The agent can also tamper with the VM state, which makes some scores suspect. |
| GAIA Mialon et al., ICLR 2024 | General assistant tasks: combine search, reasoning, file handling, and tool use. | 2023, still cited | Answers are public on Hugging Face, so an agent that knows where to look can "solve" most of it without any reasoning. |
| TAU-benchYao 2024 Yao et al., 2024 | Customer service style tasks: retail and airline scenarios with simulated users. | 2024 | Harder to game because the simulated user is unpredictable. One of the more honest benchmarks. |
| Terminal-Bench | Complex command-line tasks (e.g., configuring a system, building a small program). | 2026 | New; has the same "agent and grader share an environment" problem as SWE-bench. |
| MCPMCP 2025-Bench | Can the agent pick the right tool from a set of MCP servers? | 2025 | Useful for the new protocol-era agents. Numbers still settling. |
| Windows Agent Arena MS / arXiv 2024 | Same idea as OSWorld but on Windows, easy to run many instances in parallel on Azure. | 2024 | Solid infrastructure; widely used in industry research. |
| OccuBenchOccuBench 2026 arXiv 2026 | Real professional work tasks across a range of jobs. | 2026 | Newer, broader coverage of "what does an actual office worker do all day?" |
Frontier scores at a glance
Treat the table below as a snapshot, not a leaderboard. Numbers move every quarter, sometimes by 10+ percentage points when a new harness lands. Always verify on the official leaderboard before quoting any specific score in a deck. The point of this table is the shape of where each benchmark sits, not its current peak.
| Benchmark | Task type | Reported frontier (2025-26 era) | Where to verify |
|---|---|---|---|
| SWE-bench Verified | Real GitHub issues, Python repos | ~50-65% with frontier models + scaffolding; cheating reports show some leaderboard scores were inflated | swebench.com |
| OSWorld | Linux desktop, GUI use | ~20-50%; jumped during 2025 with screen-grounding improvements | os-world.github.io |
| GAIA | General assistant: search + reason + tool use | High but answer-leakage means treat with care | HF leaderboard |
| TAU-bench | Customer service simulation | Substantially harder; simulated-user variability is the point | sierra-research/tau-bench |
| Windows Agent Arena | Windows desktop tasks at parallel scale | Scaling is the differentiator, not raw score | microsoft.github.io/WindowsAgentArena |
| Terminal-Bench | Complex shell sequences | New, scores still settling; same agent-environment-share problem as SWE-bench | laude-institute/terminal-bench |
| OccuBench | Real professional work across occupations | Broad; reveals where agents are still weak (legal, medical reasoning) | arXiv 2026 |
Two recurring patterns worth naming. First, every benchmark whose harness lets the agent execute Python in the grading environment has been gameable so far. SWE-bench, OSWorld, Terminal-Bench all fall in this category. The Berkeley RDI work cited above showed that a 10-line config file is sufficient to fake a 100% score on SWE-bench Verified. Second, every benchmark whose answer keys are public on Hugging Face has been gamed by retrieval. GAIA is the prominent example. The benchmarks that hold up under stress are the ones with simulated counterparties (TAU-bench) or sealed environments where the agent cannot see the grader (private SWE-bench harness instances).
Why these benchmark scores can mislead you
The Berkeley study found four patterns that show up across most benchmarks. None of them are sophisticated. They're the kinds of mistakes any of us could make when we set up evaluation:
- The agent and the grader share the same environment. If your agent's code runs in the same place the grader is checking for the answer, the agent can just write the answer where the grader expects to find it. Found in SWE-bench, Terminal-Bench, OSWorld.
- The answer key is publicly available. WebArena ships reference answers in the task config. OSWorld puts gold-file URLs right in the task metadata. GAIA's answers are downloadable from Hugging Face. If the agent can find the answer key, the benchmark measures Google skills, not reasoning.
- The grader runs code that the agent controls. WebArena and OSWorld both call Python's
eval()on text the agent generates. That's executing arbitrary code on the grading machine, and the agent gets to choose what code. - Network access isn't blocked. Many evaluation containers default to having internet access. An agent with internet can fetch hints, download solutions, or even ask a human.
What good evaluation actually looks like
Forget the public benchmarks for a moment. If you're building a real agent, you need evaluation that reflects your use case. Most working teams stack three layers, from cheapest to most expensive:
Each agent in your system gets its own small test set: a few dozen example inputs paired with the outputs you'd consider correct. Run these on every code change. They catch regressions in single agents fast, before you notice them in the whole-system tests.
Build them like you'd build unit tests for any other code: pick the inputs deliberately, focus on failure modes you've already seen, and don't try to cover every possible input. Quality over quantity.
These check whether the whole system completes a task: given this user request, did the agent get the right answer? Track success rate, how many iterations it took, total cost, and how long the whole thing ran. Per-component tests don't catch problems like agents handing off bad context to each other; these whole-flow tests do.
A good source of test cases: real user conversations from your production logs, with personal info removed. Real traces find edge cases that synthetic test sets never imagine.
Log every workflow. Sample some for human review. Track per-step cost, latency, and error rate. Build dashboards that answer "is the agent getting better or worse this week?" without anyone having to dig through logs.
The MemMachine team's analysis MemMachine, arXiv 2026 on the LongMemEvalLongMemEval 2025 benchmark Wu et al., ICLR 2025 is a useful template. They listed six things they could change (chunking strategy, query wording, context formatting, etc) and measured the effect of each one separately. Running similar isolated experiments in production gives you a clear answer to "what should I change next?"
One number won't tell you the truth
The three layers above tell you whether your agent works on the inputs you tested. They don't tell you whether it works on inputs that are slightly different. The gap between the two is where most production failures live. A model that scores 92% on your eval set might score 51% on the same questions paraphrased, with names changed, or with one number swapped. The 92% gives you no warning about which world you're in.
The fix is to take each test case and create small variations of it that should have the same answer. Run all of them. Then report three numbers instead of one: how it does on average, how it does on the hardest variations, and how fast accuracy drops as you push the variations further. That third number is the important one. If accuracy stays flat as you vary the inputs, the agent has actually learned the task. If accuracy falls off a cliff, it's matching on surface details that won't survive contact with real users.
Variations that almost always reveal something:
- Reword the request. Same meaning, different sentence. If the answer changes, the agent is matching the wording, not the intent.
- Swap the names. Same question, different people, dates, places. If accuracy drops, the agent has memorized specifics rather than learned a pattern.
- Add a few unrelated sentences. A robust agent ignores them. A fragile one gets distracted.
- Change the numbers. Make a quantity ten times bigger. The agent should still produce a sensibly-shaped answer; surprisingly often it doesn't.
- Run the eval again three months later. The world has moved on, and so has the data. This is the test no one runs and the one production cares about most.
A good internal report looks like: average 0.91, worst case 0.62, accuracy drops about 0.05 per unit of variation. The third number tells you the agent is brittle even though the average looks fine. Without this kind of evaluation, you'd ship on the 0.91 and find out about the 0.62 from a customer. Running fifty variations per case isn't free, but skipping it costs more.
Using one model to grade another (with care)
For tasks where there's no single right answer (summarization quality, helpfulness, code style), the most common approach in 2025 is to use a second LLM as the grader. This is sometimes called "LLM-as-judge". It's cheap and easy to set up. It also has well-documented biases you should know about:
- Position bias. When comparing two answers side-by-side, the model often prefers whichever one is shown first.
- Length bias. Longer responses get rated as more thorough, even when they're not actually better.
- Self-preference. If the grader is the same model family as the candidate, it tends to favor its own outputs.
- Style beats substance. Nicely-formatted wrong answers can score higher than messy correct ones.
What helps: rotate which model does the grading, randomize order in pairwise comparisons, calibrate against a small set of human-labeled examples once a month, and don't grade with the same model that produced the answer.
Reproducibility: boring but critical
If you can't re-run an evaluation tomorrow and get the same answer, you don't know what you measured. A 2026 guide on agent benchmarking Spheron, 2026 listed six things that quietly break reproducibility:
- Container drift. Pin Docker images by digest (the long
sha256:...string), not by tag. Tags like:latestcan silently change between runs and break your scores. - Random seeds. Set
PYTHONHASHSEED, modeltemperature=0, andtop_p=1.0. Otherwise the model produces different output each run. Note that even attemperature=0, a model version change can shift outputs (chapter 02 covers what determinism really means here); pin the model version too. - Model weight hashes. If you're running a local model, record an
md5sumof the weight files. Different checkpoints that share a model name often score differently. - Tool versions. Search APIs, browsers, code runners all have versions. Pin them too. A search-engine update can change your scores tomorrow.
- Internet access. Decide once whether the agent has internet during eval. If yes, your scores depend on the live web (which changes daily). If no, you have to mock everything (which adds its own bugs).
- The current date. Prompts that mention "today is November 5, 2025" produce different reasoning than ones that say "today is April 1, 2026". Pin the date for replay runs.
What to actually emit: OpenTelemetry's gen_ai semantic conventions
Observability used to mean "log whatever feels important." That worked when each team built one LLM application; it does not work now that an agent calls three tools that each call a sub-agent that calls a model. The fix is a shared vocabulary so traces from different libraries, different teams, and different models compose into one debuggable system. The OpenTelemetry project, the same organization behind metric and trace standards across the rest of the software industry, has been standardizing one for AI agents: the gen_ai semantic conventions.
Adopt the names. Two reasons. First, every observability platform worth naming (LangSmith, Langfuse, Arize Phoenix, Braintrust, Datadog, Helicone, MLflow, New Relic, Honeycomb, Grafana) ingests gen_ai.* attributes natively, so you swap platforms without rewriting instrumentation. Second, the names already encode the right design choices. You do not have to invent a token-counting attribute or argue about whether a "step" is a span or an event; the spec made those choices for you.
One important caveat before you adopt this in production. As of OpenTelemetry semantic conventions v1.41.0 (May 2026), every gen_ai.* attribute is still marked Status: Development, not Stable. Names can still change. The conservative move is to use the names today and hide them behind a small adapter so a future rename is a one-line change in your instrumentation library, not a search-and-replace across every service.
The four span types you actually need
The spec defines a small vocabulary that maps directly onto how an agent runs. You do not need all of it. These four spans cover almost every agent system in the wild:
| Span operation | What it represents | When you create it |
|---|---|---|
chat |
One model invocation. Prompt in, completion out. The leaf of most traces. | Every time you call the model API. Auto-instrumented by most SDKs. |
execute_tool |
One tool call. The model decided to call issue_refund; this span wraps the actual execution. |
Every tool dispatched by your loop. Span name pattern: execute_tool issue_refund. |
invoke_agent |
One full run of one agent: perceive, decide, act, repeat, until done. Wraps multiple chat and execute_tool spans. |
The outer boundary of an agent's loop, including any sub-agents it spawns. |
invoke_workflow |
The orchestrator's view: multiple agents collaborating on one task. Wraps multiple invoke_agent spans. |
At the top of a multi-agent run. This is where the orchestrator from chapter 05 lives. |
The shape that falls out is a tree: invoke_workflow at the top, invoke_agent spans for each participating agent, and inside each one a series of chat and execute_tool spans. Sub-agents nest naturally: when one agent calls another, the child's invoke_agent span is a child of the parent's. You can read off "which agent did what, in what order" by walking the tree. This is what makes the failure-attribution problem from chapter 13 tractable: the audit log from the trust engine references the same trace by ID, and the contribution ledger is just a query over the tree.
The metrics that go alongside
Three metric names are required by the spec; track them all. Adding any others on top is fine, but skip these and your dashboards will not compose with anyone else's:
gen_ai.client.operation.durationas a histogram. p50, p95, p99 of how long each model call takes. The first thing any operator wants to see.gen_ai.client.token.usageas a histogram with agen_ai.token.typeattribute set to eitherinputoroutput. Cost per request falls out of this; a sudden change in input tokens is often the first sign of context bloat.gen_ai.client.operation.time_to_first_chunkfor streaming. Time-to-first-token shapes user perception of speed even when total latency is identical.
The same naming logic applies to the next layer up. execute_tool spans should record duration on the same histogram (with gen_ai.operation.name=execute_tool as an attribute) so you can compare model latency to tool latency on the same dashboard. Tool-call success rate is then computable from the standard error.type attribute on those spans, no new metric needed.
Privacy by default
The spec gets one detail right that most home-grown logging gets wrong: prompts and completions are not captured by default. You opt in by setting OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true, and even then the spec recommends uploading content to external object storage and recording only a reference in the span. Two reasons this matters in practice. First, prompts and completions routinely contain personally identifiable information you cannot afford to copy into your observability backend's hot storage. Second, separating the structural trace (which scales to billions of spans cheaply) from the content blob (which needs different retention and access control) is what makes audit logs feasible. See the next section for why those are different things.
Observability traces and audit logs are not the same thing
A trace is for debugging. An audit log is for accountability. They look similar, store similar data, and routinely get confused, but they have different consumers, different retention, different access control, and different correctness guarantees. Conflating them is one of the most common mistakes operators make, and the LangChain community has explicitly called this gap out in issue #35357: callback handlers and observability platforms are designed for debugging, not for regulatory compliance.
A useful split has three sinks, not one:
- Traces. The OpenTelemetry spans described above. Structural attributes, no message bodies by default. Sampled aggressively in high-volume systems. Retention measured in days or weeks. Consumed by engineers debugging an incident.
- Audit log. Append-only, hash-chained or signed (see the audit log section in chapter 13). Records every privilege grant, every action consumed, every outcome. Retention measured in years. Lives in a separate trust domain that the agents themselves cannot write to. Consumed by auditors, compliance officers, and anyone reconstructing what happened weeks after the incident.
- Reputation and evaluation store. The
gen_ai.evaluation.resultevents from the spec, parented to the GenAI span being evaluated. This is the canonical hook for the four reputation signal types from chapter 13 (deterministic, rule-based, model-judged, human). Each evaluation source becomes one event withgen_ai.evaluation.score.valueandgen_ai.evaluation.score.label. Storing them as events means a span can carry many independent evaluations from different sources, and divergence between them is itself a debuggable signal.
The three sinks share an ID scheme so you can join across them when needed. The trace ID lives in the trace, gets referenced from the audit log entry, and is the parent of any evaluation events. But the IAM boundaries are different and the access patterns are different. An engineer chasing a regression only needs traces. An auditor reconstructing a year-old incident needs audit log plus traces (if not yet expired) plus the evaluation history. A reputation engine consuming evaluation events does not need either of the others.
Evaluating multi-agent systems
For a system with multiple agents, the most important metric is still: did the whole thing succeed? But that single number hides a lot. Tracking these in addition gives you better debugging info:
- How well each agent does its part. When you call agent X, how often is its output usable by the next step?
- Handoff quality. When agent A passes work to agent B, does B have what it needs? Watch for things like "asked for clarification" or "fell back to a default" as red flags.
- Iteration count. A workflow that succeeds in 3 turns is healthier than one that succeeds in 15. Climbing iteration count is an early sign something is degrading.
- Cost per task, broken down by agent. Don't just track the average. If one agent's cost suddenly spikes, that's often a regression hiding under healthy-looking overall numbers.
- How often agents disagree (in voting setups). A steady disagreement rate is normal. A sudden change usually means something shifted upstream.