Getting Up to Speed on Multi-Agent Systems, Part 7: Benchmarks and What They Miss
If you’ve read this far, you’ve noticed that every paper I’ve discussed has a number next to it. 85.9 percent on HumanEval. 12.5 percent on SWE-bench. 25 percent on TravelPlanner. These numbers do a lot of work in the multi-agent literature, and they also do a surprising amount of harm. This post is about the benchmarks themselves. What they measure. What they don’t. And why ChatDev and MetaGPT can report contradictory results on each other without either one being obviously wrong.
- Part 1. The Landscape
- Part 2. The Vocabulary
- Part 3. Wave 1: Can Agents Coordinate At All?
- Part 4. Wave 2: Why It Breaks
- Part 5. Debate, State, and Coordination
- Part 6. Verification Patterns
- Part 7. Benchmarks and What They Miss (you are here)
- Part 8. Open Questions
The Landscape
Here’s every benchmark that’s come up in the series so far, plus a few that haven’t.
| Benchmark | Domain | What It Tests | Scale | Multi-Agent? | Notable Results |
|---|---|---|---|---|---|
| HumanEval | Code generation | Write a correct Python function from a docstring | 164 tasks | No, single function | MetaGPT 85.9%, AutoDev 91.5% |
| MBPP | Code generation | Entry-level Python from description | 974 tasks | No, single function | MetaGPT 87.7% |
| SWE-bench | Software engineering | Resolve real GitHub issues in real repos | 2,294 (Verified: 500) | Designed for single agent | SWE-agent 12.5%, Devin 13.9% |
| GAIA | General assistant | Multi-step reasoning with tools, web, files | 466 tasks | Yes, benefits from parallel tools | AutoGen #1, Magentic-One 38% |
| WebArena | Web tasks | Real websites: shopping, forums, CMS | 812 tasks | Designed for single agent | Magentic-One 32.8% |
| AssistantBench | Assistant tasks | Open-ended web browsing plus reasoning | 214 tasks | Designed for single agent | Magentic-One 13.3% |
| BrowseComp | Web retrieval | Hard information retrieval via deep browsing | ~1,500 tasks | Benefits from parallel search | Anthropic +90% multi vs single |
| TravelPlanner | Constrained planning | Multi-constraint travel planning | 1,225 tasks | Explicitly tests coordination | Ou et al. 25% with notebook + orchestrator |
| Silo-Bench | Distributed coordination | Algorithmic tasks requiring cross-agent synthesis | 30 tasks, 54 configs | Designed for MAS evaluation | Agents fail at synthesis |
Two of these (highlighted) were designed with multi-agent evaluation in mind. The other seven were designed for single agents. Multi-agent systems get evaluated on them anyway.
Why This Matters
When you run a multi-agent system on a single-agent benchmark, you’re measuring the wrong thing. HumanEval gives you a pass-at-1 score. It doesn’t tell you how many tokens you burned to get there. It doesn’t tell you how many agent turns were redundant. It doesn’t tell you what happened when one of your agents got stuck. If you care about coordination quality, none of this information is in the score.
This is why ChatDev and MetaGPT can report contradictory numbers on similar tasks. ChatDev’s paper claims 88 percent executability. MetaGPT’s paper claims 41 percent executability. Different benchmarks, different metrics, different evaluation criteria. Neither paper is obviously lying. Neither paper is obviously right. And the field has no standard way to resolve the contradiction.
When Multi-Agent Actually Helps
If you look across all the benchmark results, a pattern emerges about when multi-agent systems earn their coordination overhead.
Where the Multi-Agent Premium Pays Off
This is the benchmark-level version of the conclusion I’ve been building toward across the whole series. Multi-agent earns its overhead on specific task shapes: breadth-first, parallel-decomposable, state-sharing-friendly. On other task shapes, it costs more than it delivers. The benchmarks, taken together, are unambiguous about this. It’s just that most individual benchmarks can’t show it.
The Benchmark Problem, Stated Plainly
Chen et al.’s survey names three evaluation-level challenges: no standardized benchmarks, no objective metrics, no common framework for individual vs aggregate evaluation. All three are symptoms of the same underlying issue. The field hasn’t agreed on what it’s measuring.
There are a few ways this gets resolved. One is that better MAS-specific benchmarks emerge (TravelPlanner and Silo-Bench are early signs). Another is that production telemetry replaces synthetic benchmarks (Anthropic’s internal research eval). A third is that the field matures enough to distinguish “this benchmark tests single-agent capability” from “this benchmark tests multi-agent capability,” and stops reporting contradictory single-agent numbers as if they were MAS comparisons.
None of these are fully here yet. If you’re reading a paper and the headline number is HumanEval Pass@1, you’re probably looking at a single-agent capability test dressed up as a MAS evaluation. Calibrate accordingly.
What I Look For
When I read a paper with benchmark numbers now, here’s what I check:
- Is this benchmark designed for single agents or multi-agent systems?
- If it’s single-agent, are they comparing against single-agent baselines, or are they using it to claim MAS superiority?
- What’s the token cost of the system? If that number isn’t reported, I assume it’s high.
- Do they report failure rates or just success rates? MAST data tells us this matters.
- Is this the first number they cite, or are they hiding it behind a friendlier number?
Most wave-1 papers fail several of these checks. The agentic coding papers pass them, and wave-2 papers are starting to. This is partly why the post-2024 literature is more trustworthy than the 2023 literature.
Next post, the last in the series: open questions. What’s missing. What’s next. What I’d read if I were doing this again. And the research gap that I keep tripping over: the absence of a rigorous distributed systems foundation underneath the multi-agent AI work.