Getting Up to Speed on Multi-Agent Systems, Part 7: Benchmarks and What They Miss

30 Apr 2026

If you’ve read this far, you’ve noticed that every paper I’ve discussed has a number next to it. 85.9 percent on HumanEval. 12.5 percent on SWE-bench. 25 percent on TravelPlanner. These numbers do a lot of work in the multi-agent literature, and they also do a surprising amount of harm. This post is about the benchmarks themselves. What they measure. What they don’t. And why ChatDev and MetaGPT can report contradictory results on each other without either one being obviously wrong.

The Landscape

Here’s every benchmark that’s come up in the series so far, plus a few that haven’t.

BenchmarkDomainWhat It TestsScaleMulti-Agent?Notable Results
HumanEval Code generation Write a correct Python function from a docstring 164 tasks No, single function MetaGPT 85.9%, AutoDev 91.5%
MBPP Code generation Entry-level Python from description 974 tasks No, single function MetaGPT 87.7%
SWE-bench Software engineering Resolve real GitHub issues in real repos 2,294 (Verified: 500) Designed for single agent SWE-agent 12.5%, Devin 13.9%
GAIA General assistant Multi-step reasoning with tools, web, files 466 tasks Yes, benefits from parallel tools AutoGen #1, Magentic-One 38%
WebArena Web tasks Real websites: shopping, forums, CMS 812 tasks Designed for single agent Magentic-One 32.8%
AssistantBench Assistant tasks Open-ended web browsing plus reasoning 214 tasks Designed for single agent Magentic-One 13.3%
BrowseComp Web retrieval Hard information retrieval via deep browsing ~1,500 tasks Benefits from parallel search Anthropic +90% multi vs single
TravelPlanner Constrained planning Multi-constraint travel planning 1,225 tasks Explicitly tests coordination Ou et al. 25% with notebook + orchestrator
Silo-Bench Distributed coordination Algorithmic tasks requiring cross-agent synthesis 30 tasks, 54 configs Designed for MAS evaluation Agents fail at synthesis

Two of these (highlighted) were designed with multi-agent evaluation in mind. The other seven were designed for single agents. Multi-agent systems get evaluated on them anyway.

Why This Matters

When you run a multi-agent system on a single-agent benchmark, you’re measuring the wrong thing. HumanEval gives you a pass-at-1 score. It doesn’t tell you how many tokens you burned to get there. It doesn’t tell you how many agent turns were redundant. It doesn’t tell you what happened when one of your agents got stuck. If you care about coordination quality, none of this information is in the score.

This is why ChatDev and MetaGPT can report contradictory numbers on similar tasks. ChatDev’s paper claims 88 percent executability. MetaGPT’s paper claims 41 percent executability. Different benchmarks, different metrics, different evaluation criteria. Neither paper is obviously lying. Neither paper is obviously right. And the field has no standard way to resolve the contradiction.

What single-agent benchmarks can't measure
Coordination quality. Communication overhead. Redundant work between agents. Recovery behavior when one agent fails. The token cost of the coordination itself. How performance degrades with scale. These are the things that distinguish multi-agent systems from single agents. And they're invisible to HumanEval, SWE-bench, and every other benchmark designed around "does the output match the expected answer."

When Multi-Agent Actually Helps

If you look across all the benchmark results, a pattern emerges about when multi-agent systems earn their coordination overhead.

Where the Multi-Agent Premium Pays Off

Helps
Breadth-first search (BrowseComp: +90%) Hard multi-step reasoning (GAIA Level 3: 2x) Constrained planning with state sharing (TravelPlanner: 3.3x) Independent parallel subtasks
Doesn't help
Focused coding (SWE-bench: single agents win) Tasks needing shared context (most coding) Simple function generation (HumanEval: overhead not worth it) Distributed reasoning / synthesis (Silo-Bench)

This is the benchmark-level version of the conclusion I’ve been building toward across the whole series. Multi-agent earns its overhead on specific task shapes: breadth-first, parallel-decomposable, state-sharing-friendly. On other task shapes, it costs more than it delivers. The benchmarks, taken together, are unambiguous about this. It’s just that most individual benchmarks can’t show it.

The Benchmark Problem, Stated Plainly

The benchmark gap
Most widely-used benchmarks (HumanEval, MBPP, SWE-bench, WebArena) were designed for single agents. Multi-agent systems get shoehorned into them, but the benchmarks can't measure coordination quality, communication overhead, or failure recovery, which are the things that distinguish MAS from single agents. TravelPlanner and Silo-Bench are rare exceptions that explicitly test multi-agent dynamics. ChatDev and MetaGPT reporting contradictory results on each other is a direct consequence of this gap.

Chen et al.’s survey names three evaluation-level challenges: no standardized benchmarks, no objective metrics, no common framework for individual vs aggregate evaluation. All three are symptoms of the same underlying issue. The field hasn’t agreed on what it’s measuring.

There are a few ways this gets resolved. One is that better MAS-specific benchmarks emerge (TravelPlanner and Silo-Bench are early signs). Another is that production telemetry replaces synthetic benchmarks (Anthropic’s internal research eval). A third is that the field matures enough to distinguish “this benchmark tests single-agent capability” from “this benchmark tests multi-agent capability,” and stops reporting contradictory single-agent numbers as if they were MAS comparisons.

None of these are fully here yet. If you’re reading a paper and the headline number is HumanEval Pass@1, you’re probably looking at a single-agent capability test dressed up as a MAS evaluation. Calibrate accordingly.

What I Look For

When I read a paper with benchmark numbers now, here’s what I check:

  1. Is this benchmark designed for single agents or multi-agent systems?
  2. If it’s single-agent, are they comparing against single-agent baselines, or are they using it to claim MAS superiority?
  3. What’s the token cost of the system? If that number isn’t reported, I assume it’s high.
  4. Do they report failure rates or just success rates? MAST data tells us this matters.
  5. Is this the first number they cite, or are they hiding it behind a friendlier number?

Most wave-1 papers fail several of these checks. The agentic coding papers pass them, and wave-2 papers are starting to. This is partly why the post-2024 literature is more trustworthy than the 2023 literature.

Next post, the last in the series: open questions. What’s missing. What’s next. What I’d read if I were doing this again. And the research gap that I keep tripping over: the absence of a rigorous distributed systems foundation underneath the multi-agent AI work.