Tone Dark
Tint
17 Evaluation · how to know if your agent works

The popular benchmarks have problems. Here's what to do instead.

In April 2026, researchers at Berkeley's RDI group Berkeley RDI, 2026 built a "scanning agent" and pointed it at eight of the most popular AI agent benchmarks. Every single one could be tricked into giving near-perfect scores without actually solving any tasks. The most striking finding: a 10-line Python config file ("conftest.py") was enough to fake a 100% score on SWE-benchJimenez 2024 Verified, the benchmark most labs cite when claiming progress on coding agents. WebArena and OSWorldXie 2024 both call Python's eval() on text the agent produces, which is straightforward to abuse. GAIAMialon 2023's answer key is downloadable from Hugging Face.

This doesn't mean benchmarks are useless. It means single benchmark numbers should not be trusted on their own, and that you need your own evaluation if you care about whether your agent actually works. This chapter covers what the popular benchmarks actually measure, where they break, and how to build evaluation for your own system.

The popular benchmarks at a glance

BenchmarkWhat it testsYearCurrent state
SWE-bench Jimenez et al., ICLR 2024 Can the agent fix real GitHub issues from open-source Python projects? 2024, refined through 2026 Top scores are above 70%, but the benchmark can be cheated. Treat individual scores carefully.
OSWorld Xie et al., NeurIPS 2024 Can the agent operate a real Linux desktop the way a human would? 2024, still active Scores jumped from 23% to 51% during 2025 thanks to better screen-grounding (the Jedi dataset). The agent can also tamper with the VM state, which makes some scores suspect.
GAIA Mialon et al., ICLR 2024 General assistant tasks: combine search, reasoning, file handling, and tool use. 2023, still cited Answers are public on Hugging Face, so an agent that knows where to look can "solve" most of it without any reasoning.
TAU-benchYao 2024 Yao et al., 2024 Customer service style tasks: retail and airline scenarios with simulated users. 2024 Harder to game because the simulated user is unpredictable. One of the more honest benchmarks.
Terminal-Bench Complex command-line tasks (e.g., configuring a system, building a small program). 2026 New; has the same "agent and grader share an environment" problem as SWE-bench.
MCPMCP 2025-Bench Can the agent pick the right tool from a set of MCP servers? 2025 Useful for the new protocol-era agents. Numbers still settling.
Windows Agent Arena MS / arXiv 2024 Same idea as OSWorld but on Windows, easy to run many instances in parallel on Azure. 2024 Solid infrastructure; widely used in industry research.
OccuBenchOccuBench 2026 arXiv 2026 Real professional work tasks across a range of jobs. 2026 Newer, broader coverage of "what does an actual office worker do all day?"

Frontier scores at a glance

Treat the table below as a snapshot, not a leaderboard. Numbers move every quarter, sometimes by 10+ percentage points when a new harness lands. Always verify on the official leaderboard before quoting any specific score in a deck. The point of this table is the shape of where each benchmark sits, not its current peak.

BenchmarkTask typeReported frontier (2025-26 era)Where to verify
SWE-bench Verified Real GitHub issues, Python repos ~50-65% with frontier models + scaffolding; cheating reports show some leaderboard scores were inflated swebench.com
OSWorld Linux desktop, GUI use ~20-50%; jumped during 2025 with screen-grounding improvements os-world.github.io
GAIA General assistant: search + reason + tool use High but answer-leakage means treat with care HF leaderboard
TAU-bench Customer service simulation Substantially harder; simulated-user variability is the point sierra-research/tau-bench
Windows Agent Arena Windows desktop tasks at parallel scale Scaling is the differentiator, not raw score microsoft.github.io/WindowsAgentArena
Terminal-Bench Complex shell sequences New, scores still settling; same agent-environment-share problem as SWE-bench laude-institute/terminal-bench
OccuBench Real professional work across occupations Broad; reveals where agents are still weak (legal, medical reasoning) arXiv 2026

Two recurring patterns worth naming. First, every benchmark whose harness lets the agent execute Python in the grading environment has been gameable so far. SWE-bench, OSWorld, Terminal-Bench all fall in this category. The Berkeley RDI work cited above showed that a 10-line config file is sufficient to fake a 100% score on SWE-bench Verified. Second, every benchmark whose answer keys are public on Hugging Face has been gamed by retrieval. GAIA is the prominent example. The benchmarks that hold up under stress are the ones with simulated counterparties (TAU-bench) or sealed environments where the agent cannot see the grader (private SWE-bench harness instances).

Why these benchmark scores can mislead you

The Berkeley study found four patterns that show up across most benchmarks. None of them are sophisticated. They're the kinds of mistakes any of us could make when we set up evaluation:

How to read benchmark numbers: when a paper or company announces "70% on SWE-bench Verified", that could mean a great model solving real problems. It could also mean a team found a clever exploit in the harness. Without seeing the actual agent traces and the harness audit logs, you can't tell which it is. Treat single-number leaderboards skeptically, especially when the numbers jump suddenly.

What good evaluation actually looks like

Forget the public benchmarks for a moment. If you're building a real agent, you need evaluation that reflects your use case. Most working teams stack three layers, from cheapest to most expensive:

1 Per-component tests (run on every commit)

Each agent in your system gets its own small test set: a few dozen example inputs paired with the outputs you'd consider correct. Run these on every code change. They catch regressions in single agents fast, before you notice them in the whole-system tests.

Build them like you'd build unit tests for any other code: pick the inputs deliberately, focus on failure modes you've already seen, and don't try to cover every possible input. Quality over quantity.

2 End-to-end workflow tests (run nightly)

These check whether the whole system completes a task: given this user request, did the agent get the right answer? Track success rate, how many iterations it took, total cost, and how long the whole thing ran. Per-component tests don't catch problems like agents handing off bad context to each other; these whole-flow tests do.

A good source of test cases: real user conversations from your production logs, with personal info removed. Real traces find edge cases that synthetic test sets never imagine.

3 Production monitoring (always on)

Log every workflow. Sample some for human review. Track per-step cost, latency, and error rate. Build dashboards that answer "is the agent getting better or worse this week?" without anyone having to dig through logs.

The MemMachine team's analysis MemMachine, arXiv 2026 on the LongMemEvalLongMemEval 2025 benchmark Wu et al., ICLR 2025 is a useful template. They listed six things they could change (chunking strategy, query wording, context formatting, etc) and measured the effect of each one separately. Running similar isolated experiments in production gives you a clear answer to "what should I change next?"

One number won't tell you the truth

The three layers above tell you whether your agent works on the inputs you tested. They don't tell you whether it works on inputs that are slightly different. The gap between the two is where most production failures live. A model that scores 92% on your eval set might score 51% on the same questions paraphrased, with names changed, or with one number swapped. The 92% gives you no warning about which world you're in.

The fix is to take each test case and create small variations of it that should have the same answer. Run all of them. Then report three numbers instead of one: how it does on average, how it does on the hardest variations, and how fast accuracy drops as you push the variations further. That third number is the important one. If accuracy stays flat as you vary the inputs, the agent has actually learned the task. If accuracy falls off a cliff, it's matching on surface details that won't survive contact with real users.

Variations that almost always reveal something:

A good internal report looks like: average 0.91, worst case 0.62, accuracy drops about 0.05 per unit of variation. The third number tells you the agent is brittle even though the average looks fine. Without this kind of evaluation, you'd ship on the 0.91 and find out about the 0.62 from a customer. Running fifty variations per case isn't free, but skipping it costs more.

Using one model to grade another (with care)

For tasks where there's no single right answer (summarization quality, helpfulness, code style), the most common approach in 2025 is to use a second LLM as the grader. This is sometimes called "LLM-as-judge". It's cheap and easy to set up. It also has well-documented biases you should know about:

What helps: rotate which model does the grading, randomize order in pairwise comparisons, calibrate against a small set of human-labeled examples once a month, and don't grade with the same model that produced the answer.

Reproducibility: boring but critical

If you can't re-run an evaluation tomorrow and get the same answer, you don't know what you measured. A 2026 guide on agent benchmarking Spheron, 2026 listed six things that quietly break reproducibility:

Evaluating multi-agent systems

For a system with multiple agents, the most important metric is still: did the whole thing succeed? But that single number hides a lot. Tracking these in addition gives you better debugging info:

The trap to watch for: tuning each agent until its individual numbers look great, while the overall system success rate drops. Local optimization can hurt the team. Always keep the whole-workflow success rate as the truth metric, and use the per-agent metrics for figuring out why the workflow failed.
Public benchmarks are a fine starting point. Real evaluation is the boring work of building a held-out test set from real traffic, sampling traces for human review, calibrating your judges, and pinning the things that drift. Skip those and a "70% on SWE-bench" tells you almost nothing about whether your agent works in practice.