The popular benchmarks have problems. Here's what to do instead.
In April 2026, researchers at Berkeley's RDI group Berkeley RDI, 2026 built a "scanning agent" and pointed it at eight of the most popular AI agent benchmarks. Every single one could be tricked into giving near-perfect scores without actually solving any tasks. The most striking finding: a 10-line Python config file ("conftest.py") was enough to fake a 100% score on SWE-benchJimenez 2024 Verified, the benchmark most labs cite when claiming progress on coding agents. WebArena and OSWorldXie 2024 both call Python's eval() on text the agent produces, which is straightforward to abuse. GAIAMialon 2023's answer key is downloadable from Hugging Face.
This doesn't mean benchmarks are useless. It means single benchmark numbers should not be trusted on their own, and that you need your own evaluation if you care about whether your agent actually works. This chapter covers what the popular benchmarks actually measure, where they break, and how to build evaluation for your own system.
The popular benchmarks at a glance
| Benchmark | What it tests | Year | Current state |
|---|---|---|---|
| SWE-bench Jimenez et al., ICLR 2024 | Can the agent fix real GitHub issues from open-source Python projects? | 2024, refined through 2026 | Top scores are above 70%, but the benchmark can be cheated. Treat individual scores carefully. |
| OSWorld Xie et al., NeurIPS 2024 | Can the agent operate a real Linux desktop the way a human would? | 2024, still active | Scores jumped from 23% to 51% during 2025 thanks to better screen-grounding (the Jedi dataset). The agent can also tamper with the VM state, which makes some scores suspect. |
| GAIA Mialon et al., ICLR 2024 | General assistant tasks: combine search, reasoning, file handling, and tool use. | 2023, still cited | Answers are public on Hugging Face, so an agent that knows where to look can "solve" most of it without any reasoning. |
| TAU-benchYao 2024 Yao et al., 2024 | Customer service style tasks: retail and airline scenarios with simulated users. | 2024 | Harder to game because the simulated user is unpredictable. One of the more honest benchmarks. |
| Terminal-Bench | Complex command-line tasks (e.g., configuring a system, building a small program). | 2026 | New; has the same "agent and grader share an environment" problem as SWE-bench. |
| MCPMCP 2025-Bench | Can the agent pick the right tool from a set of MCP servers? | 2025 | Useful for the new protocol-era agents. Numbers still settling. |
| Windows Agent Arena MS / arXiv 2024 | Same idea as OSWorld but on Windows, easy to run many instances in parallel on Azure. | 2024 | Solid infrastructure; widely used in industry research. |
| OccuBenchOccuBench 2026 arXiv 2026 | Real professional work tasks across a range of jobs. | 2026 | Newer, broader coverage of "what does an actual office worker do all day?" |
Frontier scores at a glance
Treat the table below as a snapshot, not a leaderboard. Numbers move every quarter, sometimes by 10+ percentage points when a new harness lands. Always verify on the official leaderboard before quoting any specific score in a deck. The point of this table is the shape of where each benchmark sits, not its current peak.
| Benchmark | Task type | Reported frontier (2025-26 era) | Where to verify |
|---|---|---|---|
| SWE-bench Verified | Real GitHub issues, Python repos | ~50-65% with frontier models + scaffolding; cheating reports show some leaderboard scores were inflated | swebench.com |
| OSWorld | Linux desktop, GUI use | ~20-50%; jumped during 2025 with screen-grounding improvements | os-world.github.io |
| GAIA | General assistant: search + reason + tool use | High but answer-leakage means treat with care | HF leaderboard |
| TAU-bench | Customer service simulation | Substantially harder; simulated-user variability is the point | sierra-research/tau-bench |
| Windows Agent Arena | Windows desktop tasks at parallel scale | Scaling is the differentiator, not raw score | microsoft.github.io/WindowsAgentArena |
| Terminal-Bench | Complex shell sequences | New, scores still settling; same agent-environment-share problem as SWE-bench | laude-institute/terminal-bench |
| OccuBench | Real professional work across occupations | Broad; reveals where agents are still weak (legal, medical reasoning) | arXiv 2026 |
Two recurring patterns worth naming. First, every benchmark whose harness lets the agent execute Python in the grading environment has been gameable so far. SWE-bench, OSWorld, Terminal-Bench all fall in this category. The Berkeley RDI work cited above showed that a 10-line config file is sufficient to fake a 100% score on SWE-bench Verified. Second, every benchmark whose answer keys are public on Hugging Face has been gamed by retrieval. GAIA is the prominent example. The benchmarks that hold up under stress are the ones with simulated counterparties (TAU-bench) or sealed environments where the agent cannot see the grader (private SWE-bench harness instances).
Why these benchmark scores can mislead you
The Berkeley study found four patterns that show up across most benchmarks. None of them are sophisticated. They're the kinds of mistakes any of us could make when we set up evaluation:
- The agent and the grader share the same environment. If your agent's code runs in the same place the grader is checking for the answer, the agent can just write the answer where the grader expects to find it. Found in SWE-bench, Terminal-Bench, OSWorld.
- The answer key is publicly available. WebArena ships reference answers in the task config. OSWorld puts gold-file URLs right in the task metadata. GAIA's answers are downloadable from Hugging Face. If the agent can find the answer key, the benchmark measures Google skills, not reasoning.
- The grader runs code that the agent controls. WebArena and OSWorld both call Python's
eval()on text the agent generates. That's executing arbitrary code on the grading machine, and the agent gets to choose what code. - Network access isn't blocked. Many evaluation containers default to having internet access. An agent with internet can fetch hints, download solutions, or even ask a human.
What good evaluation actually looks like
Forget the public benchmarks for a moment. If you're building a real agent, you need evaluation that reflects your use case. Most working teams stack three layers, from cheapest to most expensive:
Each agent in your system gets its own small test set: a few dozen example inputs paired with the outputs you'd consider correct. Run these on every code change. They catch regressions in single agents fast, before you notice them in the whole-system tests.
Build them like you'd build unit tests for any other code: pick the inputs deliberately, focus on failure modes you've already seen, and don't try to cover every possible input. Quality over quantity.
These check whether the whole system completes a task: given this user request, did the agent get the right answer? Track success rate, how many iterations it took, total cost, and how long the whole thing ran. Per-component tests don't catch problems like agents handing off bad context to each other; these whole-flow tests do.
A good source of test cases: real user conversations from your production logs, with personal info removed. Real traces find edge cases that synthetic test sets never imagine.
Log every workflow. Sample some for human review. Track per-step cost, latency, and error rate. Build dashboards that answer "is the agent getting better or worse this week?" without anyone having to dig through logs.
The MemMachine team's analysis MemMachine, arXiv 2026 on the LongMemEvalLongMemEval 2025 benchmark Wu et al., ICLR 2025 is a useful template. They listed six things they could change (chunking strategy, query wording, context formatting, etc) and measured the effect of each one separately. Running similar isolated experiments in production gives you a clear answer to "what should I change next?"
One number won't tell you the truth
The three layers above tell you whether your agent works on the inputs you tested. They don't tell you whether it works on inputs that are slightly different. The gap between the two is where most production failures live. A model that scores 92% on your eval set might score 51% on the same questions paraphrased, with names changed, or with one number swapped. The 92% gives you no warning about which world you're in.
The fix is to take each test case and create small variations of it that should have the same answer. Run all of them. Then report three numbers instead of one: how it does on average, how it does on the hardest variations, and how fast accuracy drops as you push the variations further. That third number is the important one. If accuracy stays flat as you vary the inputs, the agent has actually learned the task. If accuracy falls off a cliff, it's matching on surface details that won't survive contact with real users.
Variations that almost always reveal something:
- Reword the request. Same meaning, different sentence. If the answer changes, the agent is matching the wording, not the intent.
- Swap the names. Same question, different people, dates, places. If accuracy drops, the agent has memorized specifics rather than learned a pattern.
- Add a few unrelated sentences. A robust agent ignores them. A fragile one gets distracted.
- Change the numbers. Make a quantity ten times bigger. The agent should still produce a sensibly-shaped answer; surprisingly often it doesn't.
- Run the eval again three months later. The world has moved on, and so has the data. This is the test no one runs and the one production cares about most.
A good internal report looks like: average 0.91, worst case 0.62, accuracy drops about 0.05 per unit of variation. The third number tells you the agent is brittle even though the average looks fine. Without this kind of evaluation, you'd ship on the 0.91 and find out about the 0.62 from a customer. Running fifty variations per case isn't free, but skipping it costs more.
Using one model to grade another (with care)
For tasks where there's no single right answer (summarization quality, helpfulness, code style), the most common approach in 2025 is to use a second LLM as the grader. This is sometimes called "LLM-as-judge". It's cheap and easy to set up. It also has well-documented biases you should know about:
- Position bias. When comparing two answers side-by-side, the model often prefers whichever one is shown first.
- Length bias. Longer responses get rated as more thorough, even when they're not actually better.
- Self-preference. If the grader is the same model family as the candidate, it tends to favor its own outputs.
- Style beats substance. Nicely-formatted wrong answers can score higher than messy correct ones.
What helps: rotate which model does the grading, randomize order in pairwise comparisons, calibrate against a small set of human-labeled examples once a month, and don't grade with the same model that produced the answer.
Reproducibility: boring but critical
If you can't re-run an evaluation tomorrow and get the same answer, you don't know what you measured. A 2026 guide on agent benchmarking Spheron, 2026 listed six things that quietly break reproducibility:
- Container drift. Pin Docker images by digest (the long
sha256:...string), not by tag. Tags like:latestcan silently change between runs and break your scores. - Random seeds. Set
PYTHONHASHSEED, modeltemperature=0, andtop_p=1.0. Otherwise the model produces different output each run. Note that even attemperature=0, a model version change can shift outputs (chapter 02 covers what determinism really means here); pin the model version too. - Model weight hashes. If you're running a local model, record an
md5sumof the weight files. Different checkpoints that share a model name often score differently. - Tool versions. Search APIs, browsers, code runners all have versions. Pin them too. A search-engine update can change your scores tomorrow.
- Internet access. Decide once whether the agent has internet during eval. If yes, your scores depend on the live web (which changes daily). If no, you have to mock everything (which adds its own bugs).
- The current date. Prompts that mention "today is November 5, 2025" produce different reasoning than ones that say "today is April 1, 2026". Pin the date for replay runs.
Evaluating multi-agent systems
For a system with multiple agents, the most important metric is still: did the whole thing succeed? But that single number hides a lot. Tracking these in addition gives you better debugging info:
- How well each agent does its part. When you call agent X, how often is its output usable by the next step?
- Handoff quality. When agent A passes work to agent B, does B have what it needs? Watch for things like "asked for clarification" or "fell back to a default" as red flags.
- Iteration count. A workflow that succeeds in 3 turns is healthier than one that succeeds in 15. Climbing iteration count is an early sign something is degrading.
- Cost per task, broken down by agent. Don't just track the average. If one agent's cost suddenly spikes, that's often a regression hiding under healthy-looking overall numbers.
- How often agents disagree (in voting setups). A steady disagreement rate is normal. A sudden change usually means something shifted upstream.