Getting Up to Speed on Multi-Agent Systems, Part 7: Benchmarks and What They Miss

30 Apr 2026

If you’ve read this far, you’ve noticed that every paper I’ve discussed has a number next to it. 85.9 percent on HumanEval. 12.5 percent on SWE-bench. 25 percent on TravelPlanner. These numbers do a lot of work in the multi-agent literature, and they also do a surprising amount of harm. This post is about the benchmarks themselves. What they measure. What they don’t. And why ChatDev and MetaGPT can report contradictory results on each other without either one being obviously wrong.

Getting Up to Speed on MAS

Part 1. The Landscape
Part 2. The Vocabulary
Part 3. Wave 1: Can Agents Coordinate At All?
Part 4. Wave 2: Why It Breaks
Part 5. Debate, State, and Coordination
Part 6. Verification Patterns
Part 7. Benchmarks and What They Miss (you are here)
Part 8. Open Questions

The Landscape

Here’s every benchmark that’s come up in the series so far, plus a few that haven’t.

Benchmark	Domain	What It Tests	Scale	Multi-Agent?	Notable Results
HumanEval	Code generation	Write a correct Python function from a docstring	164 tasks	No, single function	MetaGPT 85.9%, AutoDev 91.5%
MBPP	Code generation	Entry-level Python from description	974 tasks	No, single function	MetaGPT 87.7%
SWE-bench	Software engineering	Resolve real GitHub issues in real repos	2,294 (Verified: 500)	Designed for single agent	SWE-agent 12.5%, Devin 13.9%
GAIA	General assistant	Multi-step reasoning with tools, web, files	466 tasks	Yes, benefits from parallel tools	AutoGen #1, Magentic-One 38%
WebArena	Web tasks	Real websites: shopping, forums, CMS	812 tasks	Designed for single agent	Magentic-One 32.8%
AssistantBench	Assistant tasks	Open-ended web browsing plus reasoning	214 tasks	Designed for single agent	Magentic-One 13.3%
BrowseComp	Web retrieval	Hard information retrieval via deep browsing	~1,500 tasks	Benefits from parallel search	Anthropic +90% multi vs single
TravelPlanner	Constrained planning	Multi-constraint travel planning	1,225 tasks	Explicitly tests coordination	Ou et al. 25% with notebook + orchestrator
Silo-Bench	Distributed coordination	Algorithmic tasks requiring cross-agent synthesis	30 tasks, 54 configs	Designed for MAS evaluation	Agents fail at synthesis

Two of these (highlighted) were designed with multi-agent evaluation in mind. The other seven were designed for single agents. Multi-agent systems get evaluated on them anyway.

Why This Matters

When you run a multi-agent system on a single-agent benchmark, you’re measuring the wrong thing. HumanEval gives you a pass-at-1 score. It doesn’t tell you how many tokens you burned to get there. It doesn’t tell you how many agent turns were redundant. It doesn’t tell you what happened when one of your agents got stuck. If you care about coordination quality, none of this information is in the score.

This is why ChatDev and MetaGPT can report contradictory numbers on similar tasks. ChatDev’s paper claims 88 percent executability. MetaGPT’s paper claims 41 percent executability. Different benchmarks, different metrics, different evaluation criteria. Neither paper is obviously lying. Neither paper is obviously right. And the field has no standard way to resolve the contradiction.

What single-agent benchmarks can't measure

Coordination quality. Communication overhead. Redundant work between agents. Recovery behavior when one agent fails. The token cost of the coordination itself. How performance degrades with scale. These are the things that distinguish multi-agent systems from single agents. And they're invisible to HumanEval, SWE-bench, and every other benchmark designed around "does the output match the expected answer."

When Multi-Agent Actually Helps

If you look across all the benchmark results, a pattern emerges about when multi-agent systems earn their coordination overhead.

Where the Multi-Agent Premium Pays Off

Helps

Breadth-first search (BrowseComp: +90%) Hard multi-step reasoning (GAIA Level 3: 2x) Constrained planning with state sharing (TravelPlanner: 3.3x) Independent parallel subtasks

Doesn't help

Focused coding (SWE-bench: single agents win) Tasks needing shared context (most coding) Simple function generation (HumanEval: overhead not worth it) Distributed reasoning / synthesis (Silo-Bench)

This is the benchmark-level version of the conclusion I’ve been building toward across the whole series. Multi-agent earns its overhead on specific task shapes: breadth-first, parallel-decomposable, state-sharing-friendly. On other task shapes, it costs more than it delivers. The benchmarks, taken together, are unambiguous about this. It’s just that most individual benchmarks can’t show it.

The Benchmark Problem, Stated Plainly

The benchmark gap

Most widely-used benchmarks (HumanEval, MBPP, SWE-bench, WebArena) were designed for single agents. Multi-agent systems get shoehorned into them, but the benchmarks can't measure coordination quality, communication overhead, or failure recovery, which are the things that distinguish MAS from single agents. TravelPlanner and Silo-Bench are rare exceptions that explicitly test multi-agent dynamics. ChatDev and MetaGPT reporting contradictory results on each other is a direct consequence of this gap.

Chen et al.’s survey names three evaluation-level challenges: no standardized benchmarks, no objective metrics, no common framework for individual vs aggregate evaluation. All three are symptoms of the same underlying issue. The field hasn’t agreed on what it’s measuring.

There are a few ways this gets resolved. One is that better MAS-specific benchmarks emerge (TravelPlanner and Silo-Bench are early signs). Another is that production telemetry replaces synthetic benchmarks (Anthropic’s internal research eval). A third is that the field matures enough to distinguish “this benchmark tests single-agent capability” from “this benchmark tests multi-agent capability,” and stops reporting contradictory single-agent numbers as if they were MAS comparisons.

None of these are fully here yet. If you’re reading a paper and the headline number is HumanEval Pass@1, you’re probably looking at a single-agent capability test dressed up as a MAS evaluation. Calibrate accordingly.

What I Look For

When I read a paper with benchmark numbers now, here’s what I check:

Is this benchmark designed for single agents or multi-agent systems?
If it’s single-agent, are they comparing against single-agent baselines, or are they using it to claim MAS superiority?
What’s the token cost of the system? If that number isn’t reported, I assume it’s high.
Do they report failure rates or just success rates? MAST data tells us this matters.
Is this the first number they cite, or are they hiding it behind a friendlier number?

Most wave-1 papers fail several of these checks. The agentic coding papers pass them, and wave-2 papers are starting to. This is partly why the post-2024 literature is more trustworthy than the 2023 literature.

Next post, the last in the series: open questions. What’s missing. What’s next. What I’d read if I were doing this again. And the research gap that I keep tripping over: the absence of a rigorous distributed systems foundation underneath the multi-agent AI work.