Evaluation

17 Evaluation · how to know if your agent works

The popular benchmarks have problems. Here's what to do instead.

In April 2026, researchers at Berkeley's RDI group Berkeley RDI, 2026 built a "scanning agent" and pointed it at eight of the most popular AI agent benchmarks. Every single one could be tricked into giving near-perfect scores without actually solving any tasks. The most striking finding: a 10-line Python config file ("conftest.py") was enough to fake a 100% score on SWE-benchJimenez 2024 Verified, the benchmark most labs cite when claiming progress on coding agents. WebArena and OSWorldXie 2024 both call Python's eval() on text the agent produces, which is straightforward to abuse. GAIAMialon 2023's answer key is downloadable from Hugging Face.

This doesn't mean benchmarks are useless. It means single benchmark numbers should not be trusted on their own, and that you need your own evaluation if you care about whether your agent actually works. This chapter covers what the popular benchmarks actually measure, where they break, and how to build evaluation for your own system.

The popular benchmarks at a glance

Benchmark	What it tests	Year	Current state
SWE-bench Jimenez et al., ICLR 2024	Can the agent fix real GitHub issues from open-source Python projects?	2024, refined through 2026	Top scores are above 70%, but the benchmark can be cheated. Treat individual scores carefully.
OSWorld Xie et al., NeurIPS 2024	Can the agent operate a real Linux desktop the way a human would?	2024, still active	Scores jumped from 23% to 51% during 2025 thanks to better screen-grounding (the Jedi dataset). The agent can also tamper with the VM state, which makes some scores suspect.
GAIA Mialon et al., ICLR 2024	General assistant tasks: combine search, reasoning, file handling, and tool use.	2023, still cited	Answers are public on Hugging Face, so an agent that knows where to look can "solve" most of it without any reasoning.
TAU-benchYao 2024 Yao et al., 2024	Customer service style tasks: retail and airline scenarios with simulated users.	2024	Harder to game because the simulated user is unpredictable. One of the more honest benchmarks.
Terminal-Bench	Complex command-line tasks (e.g., configuring a system, building a small program).	2026	New; has the same "agent and grader share an environment" problem as SWE-bench.
MCPMCP 2025-Bench	Can the agent pick the right tool from a set of MCP servers?	2025	Useful for the new protocol-era agents. Numbers still settling.
Windows Agent Arena MS / arXiv 2024	Same idea as OSWorld but on Windows, easy to run many instances in parallel on Azure.	2024	Solid infrastructure; widely used in industry research.
OccuBenchOccuBench 2026 arXiv 2026	Real professional work tasks across a range of jobs.	2026	Newer, broader coverage of "what does an actual office worker do all day?"

Frontier scores at a glance

Treat the table below as a snapshot, not a leaderboard. Numbers move every quarter, sometimes by 10+ percentage points when a new harness lands. Always verify on the official leaderboard before quoting any specific score in a deck. The point of this table is the shape of where each benchmark sits, not its current peak.

Benchmark	Task type	Reported frontier (2025-26 era)	Where to verify
SWE-bench Verified	Real GitHub issues, Python repos	~50-65% with frontier models + scaffolding; cheating reports show some leaderboard scores were inflated	swebench.com
OSWorld	Linux desktop, GUI use	~20-50%; jumped during 2025 with screen-grounding improvements	os-world.github.io
GAIA	General assistant: search + reason + tool use	High but answer-leakage means treat with care	HF leaderboard
TAU-bench	Customer service simulation	Substantially harder; simulated-user variability is the point	sierra-research/tau-bench
Windows Agent Arena	Windows desktop tasks at parallel scale	Scaling is the differentiator, not raw score	microsoft.github.io/WindowsAgentArena
Terminal-Bench	Complex shell sequences	New, scores still settling; same agent-environment-share problem as SWE-bench	laude-institute/terminal-bench
OccuBench	Real professional work across occupations	Broad; reveals where agents are still weak (legal, medical reasoning)	arXiv 2026

Two recurring patterns worth naming. First, every benchmark whose harness lets the agent execute Python in the grading environment has been gameable so far. SWE-bench, OSWorld, Terminal-Bench all fall in this category. The Berkeley RDI work cited above showed that a 10-line config file is sufficient to fake a 100% score on SWE-bench Verified. Second, every benchmark whose answer keys are public on Hugging Face has been gamed by retrieval. GAIA is the prominent example. The benchmarks that hold up under stress are the ones with simulated counterparties (TAU-bench) or sealed environments where the agent cannot see the grader (private SWE-bench harness instances).

Why these benchmark scores can mislead you

The Berkeley study found four patterns that show up across most benchmarks. None of them are sophisticated. They're the kinds of mistakes any of us could make when we set up evaluation:

The agent and the grader share the same environment. If your agent's code runs in the same place the grader is checking for the answer, the agent can just write the answer where the grader expects to find it. Found in SWE-bench, Terminal-Bench, OSWorld.
The answer key is publicly available. WebArena ships reference answers in the task config. OSWorld puts gold-file URLs right in the task metadata. GAIA's answers are downloadable from Hugging Face. If the agent can find the answer key, the benchmark measures Google skills, not reasoning.
The grader runs code that the agent controls. WebArena and OSWorld both call Python's eval() on text the agent generates. That's executing arbitrary code on the grading machine, and the agent gets to choose what code.
Network access isn't blocked. Many evaluation containers default to having internet access. An agent with internet can fetch hints, download solutions, or even ask a human.

How to read benchmark numbers: when a paper or company announces "70% on SWE-bench Verified", that could mean a great model solving real problems. It could also mean a team found a clever exploit in the harness. Without seeing the actual agent traces and the harness audit logs, you can't tell which it is. Treat single-number leaderboards skeptically, especially when the numbers jump suddenly.

What good evaluation actually looks like

Forget the public benchmarks for a moment. If you're building a real agent, you need evaluation that reflects your use case. Most working teams stack three layers, from cheapest to most expensive:

1 Per-component tests (run on every commit)

Each agent in your system gets its own small test set: a few dozen example inputs paired with the outputs you'd consider correct. Run these on every code change. They catch regressions in single agents fast, before you notice them in the whole-system tests.

Build them like you'd build unit tests for any other code: pick the inputs deliberately, focus on failure modes you've already seen, and don't try to cover every possible input. Quality over quantity.

2 End-to-end workflow tests (run nightly)

These check whether the whole system completes a task: given this user request, did the agent get the right answer? Track success rate, how many iterations it took, total cost, and how long the whole thing ran. Per-component tests don't catch problems like agents handing off bad context to each other; these whole-flow tests do.

A good source of test cases: real user conversations from your production logs, with personal info removed. Real traces find edge cases that synthetic test sets never imagine.

3 Production monitoring (always on)

Log every workflow. Sample some for human review. Track per-step cost, latency, and error rate. Build dashboards that answer "is the agent getting better or worse this week?" without anyone having to dig through logs.

The MemMachine team's analysis MemMachine, arXiv 2026 on the LongMemEvalLongMemEval 2025 benchmark Wu et al., ICLR 2025 is a useful template. They listed six things they could change (chunking strategy, query wording, context formatting, etc) and measured the effect of each one separately. Running similar isolated experiments in production gives you a clear answer to "what should I change next?"

One number won't tell you the truth

The three layers above tell you whether your agent works on the inputs you tested. They don't tell you whether it works on inputs that are slightly different. The gap between the two is where most production failures live. A model that scores 92% on your eval set might score 51% on the same questions paraphrased, with names changed, or with one number swapped. The 92% gives you no warning about which world you're in.

The fix is to take each test case and create small variations of it that should have the same answer. Run all of them. Then report three numbers instead of one: how it does on average, how it does on the hardest variations, and how fast accuracy drops as you push the variations further. That third number is the important one. If accuracy stays flat as you vary the inputs, the agent has actually learned the task. If accuracy falls off a cliff, it's matching on surface details that won't survive contact with real users.

Variations that almost always reveal something:

Reword the request. Same meaning, different sentence. If the answer changes, the agent is matching the wording, not the intent.
Swap the names. Same question, different people, dates, places. If accuracy drops, the agent has memorized specifics rather than learned a pattern.
Add a few unrelated sentences. A robust agent ignores them. A fragile one gets distracted.
Change the numbers. Make a quantity ten times bigger. The agent should still produce a sensibly-shaped answer; surprisingly often it doesn't.
Run the eval again three months later. The world has moved on, and so has the data. This is the test no one runs and the one production cares about most.

A good internal report looks like: average 0.91, worst case 0.62, accuracy drops about 0.05 per unit of variation. The third number tells you the agent is brittle even though the average looks fine. Without this kind of evaluation, you'd ship on the 0.91 and find out about the 0.62 from a customer. Running fifty variations per case isn't free, but skipping it costs more.

Using one model to grade another (with care)

For tasks where there's no single right answer (summarization quality, helpfulness, code style), the most common approach in 2025 is to use a second LLM as the grader. This is sometimes called "LLM-as-judge". It's cheap and easy to set up. It also has well-documented biases you should know about:

Position bias. When comparing two answers side-by-side, the model often prefers whichever one is shown first.
Length bias. Longer responses get rated as more thorough, even when they're not actually better.
Self-preference. If the grader is the same model family as the candidate, it tends to favor its own outputs.
Style beats substance. Nicely-formatted wrong answers can score higher than messy correct ones.

What helps: rotate which model does the grading, randomize order in pairwise comparisons, calibrate against a small set of human-labeled examples once a month, and don't grade with the same model that produced the answer.

Reproducibility: boring but critical

If you can't re-run an evaluation tomorrow and get the same answer, you don't know what you measured. A 2026 guide on agent benchmarking Spheron, 2026 listed six things that quietly break reproducibility:

Container drift. Pin Docker images by digest (the long sha256:... string), not by tag. Tags like :latest can silently change between runs and break your scores.
Random seeds. Set PYTHONHASHSEED, model temperature=0, and top_p=1.0. Otherwise the model produces different output each run. Note that even at temperature=0, a model version change can shift outputs (chapter 02 covers what determinism really means here); pin the model version too.
Model weight hashes. If you're running a local model, record an md5sum of the weight files. Different checkpoints that share a model name often score differently.
Tool versions. Search APIs, browsers, code runners all have versions. Pin them too. A search-engine update can change your scores tomorrow.
Internet access. Decide once whether the agent has internet during eval. If yes, your scores depend on the live web (which changes daily). If no, you have to mock everything (which adds its own bugs).
The current date. Prompts that mention "today is November 5, 2025" produce different reasoning than ones that say "today is April 1, 2026". Pin the date for replay runs.

Evaluating multi-agent systems

For a system with multiple agents, the most important metric is still: did the whole thing succeed? But that single number hides a lot. Tracking these in addition gives you better debugging info:

How well each agent does its part. When you call agent X, how often is its output usable by the next step?
Handoff quality. When agent A passes work to agent B, does B have what it needs? Watch for things like "asked for clarification" or "fell back to a default" as red flags.
Iteration count. A workflow that succeeds in 3 turns is healthier than one that succeeds in 15. Climbing iteration count is an early sign something is degrading.
Cost per task, broken down by agent. Don't just track the average. If one agent's cost suddenly spikes, that's often a regression hiding under healthy-looking overall numbers.
How often agents disagree (in voting setups). A steady disagreement rate is normal. A sudden change usually means something shifted upstream.

The trap to watch for: tuning each agent until its individual numbers look great, while the overall system success rate drops. Local optimization can hurt the team. Always keep the whole-workflow success rate as the truth metric, and use the per-agent metrics for figuring out why the workflow failed.

Public benchmarks are a fine starting point. Real evaluation is the boring work of building a held-out test set from real traffic, sampling traces for human review, calibrating your judges, and pinning the things that drift. Skip those and a "70% on SWE-bench" tells you almost nothing about whether your agent works in practice.