Getting Up to Speed on Multi-Agent Systems, Part 1: The Landscape

24 Apr 2026

I’ve been reading multi-agent systems papers for weeks trying to figure out where the field actually is, and the honest answer is that it moves fast enough that any single paper is a snapshot, not a map. So this is the map I wish I’d had when I started. It’s a short series of posts meant to get someone up to speed on multi-agent LLM systems without having to read thirty papers first.

Who this is for: you already ship or evaluate LLM agents (tools, long context, basic eval loops) and want the research landscape in view. This is not an on-ramp to transformers or prompting fundamentals.

Before I start, a note on the frame I’m going to use. I’m going to talk about two “waves” of multi-agent research, and one outside disruption that happened between them. I want to be upfront that the waves are a reader aid, not a historical claim. Nobody in 2023 was writing “wave 1” papers, and the field did not convene to name its generation. These aren’t named movements like French New Wave cinema or second-wave feminism, where participants self-consciously defined their work against a prior cohort. What I’m doing is retrospective grouping: the kind you use to keep thirty papers straight in your head.

What the grouping captures is that certain clusters of papers share assumptions, benchmarks, and failure modes. What it misses is that parallel threads exist (debate, simulation, distributed-systems-adjacent work) that don’t fit the wave structure at all, and that plenty of individual papers sit awkwardly between waves. If the framing helps you navigate the literature, keep it. If it gets in the way, drop it. The papers are what matter; the waves are scaffolding.

With that caveat in mind: two rough clusters, one outside disruption that reshaped both, and two rough questions the MAS field has been trying to answer.

Wave 1: Can Multiple LLMs Coordinate At All? (2023)

The first wave is the one most people have heard of. A cluster of papers came out in roughly a six-month window in 2023, all answering some version of the same question: if you put multiple LLMs together, can they do something one LLM cannot?

Not every paper in that cluster is “coordination theory” in the same sense: some are explicit software pipelines (ChatDev, MetaGPT), others foreground simulation and believable social dynamics (Generative Agents). I group them anyway because they share 2023-era benchmarks and a similar loose trust that multi-agent structure will carry the task.

Wave 1 · Theory and Architecture
Can multiple LLMs coordinate at all? What's the right shape?
Mar 2023
CAMEL
Two agents role-play
Apr 2023
Gen. Agents
Memory and reflection
May 2023
Debate (Du)
Competition as coordination
Jul 2023
ChatDev
Pairwise chat pipeline
Aug 2023
MetaGPT
Artifacts and test execution
Aug 2023
AutoGen
Configurable framework
Aug 2023
AgentVerse
Dynamic group composition

These papers are all proofs of concept. They show that multi-agent coordination is viable for some task, demonstrate the idea works on a benchmark, and argue that their particular coordination structure beats simpler baselines. CAMEL uses two agents role-playing. Generative Agents uses memory streams and reflection in a social simulation. Du et al. uses multi-round debate between identical model instances. ChatDev chains pairwise dialogues into a software development pipeline. MetaGPT replaces dialogue with structured artifacts and real test execution. AutoGen is a framework for building whatever topology you want. AgentVerse emphasizes dynamic group composition: the system can recruit specialists and reshape the team as the task proceeds.

The wave-1 papers share three assumptions that look very different in hindsight. First, they assume the benchmarks they run on are the task. Second, they treat failure as a termination condition: when the system stops converging, it just stops. Third, they trust agents to coordinate without formal concurrency control, shared memory protocols, or recovery paths. These assumptions are where wave 2 would later start to push back.

What Happened Next Door (2024)

Before I get to wave 2, I have to account for a thing that happened in parallel that isn’t really MAS research but reshaped what MAS has to answer for.

Through 2024, a wave of agentic coding systems shipped: Devin, SWE-agent, OpenHands, AutoDev, and Microsoft’s Magentic-One. These are not multi-agent systems in the wave-1 sense. Most of them are single agents with well-designed tool interfaces. The SWE-agent paper in particular showed that interface quality matters more than adding agents. They got a 10.7 percentage point improvement on SWE-bench from interface design alone, without changing the model.

I bring this up because you cannot read the MAS papers from 2025 onward without this context. Wave 1 had implicitly assumed that multi-agent coordination was the default way to solve complex agentic tasks. By late 2024, the agentic coding community had accumulated strong evidence against that default for at least one large, benchmarked slice of the space: autonomous patch-style software engineering (the kind of workflow SWE-bench and its successors measure). That is narrower than “all coding forever,” but it was the center of gravity in 2024 discourse, and it shifted what MAS papers need to beat. Anthropic’s research system post from June 2025 states the conclusion plainly: multi-agent earns its overhead on “breadth-first queries with independent parallel subtasks” and underperforms on “tasks needing shared context, including most coding tasks.”

Why this matters for MAS readers
The agentic coding papers aren't MAS research. But they narrowed the MAS claim. After 2024, "multi-agent for coding" became harder to defend without evidence, and the MAS field's next wave is partly a response to that. If you're reading a post-2024 MAS paper, it's almost certainly arguing implicitly against the single-agent-with-tools baseline that Devin and SWE-agent established. I won't spend a whole post on these systems because they're not MAS, but they belong on your mental map of the landscape.

Magentic-One is the interesting exception. It’s a real multi-agent system with an orchestrator coordinating four specialized workers. It earns its overhead on hard multi-step reasoning (38 percent on GAIA) but not on focused coding. The stuck-counter mechanism it introduces (if an agent loops more than twice, reflect and replan) is one of the few MAS design patterns to surface clearly in the shipped agentic-coding systems of this period. I’ll come back to it in later posts.

Wave 2: Why Does It Break? (2025 and Beyond)

The second MAS wave is where the field is now. With wave-1 systems running in production and the agentic coding turn having clarified when MAS is and isn’t the right tool, people started asking: when MAS does fail, why? And how do we even test that?

The Wave 2 timeline below is illustrative: a handful of papers that typify a shift toward measurement, taxonomies, and fault injection, not an attempt to enumerate every 2025–2026 contribution.

Wave 2 · Why Does It Break?
MAS works sometimes. Now: why does it fail? How do you test reliability?
Mar 2025
MAST (Cemri)
14 failure modes, 1,600 traces
Jun 2025
Anthropic Research
Production orchestrator-worker
Aug 2025
Info Sharing in Planning
Shared notebook on travel planning
Feb 2026
MAS-FIRE
Systematic fault injection for MAS
Mar 2026
Silo-Bench
Communication-reasoning gap

The MAST paper from Cemri and collaborators is the one I keep coming back to. They annotated 1,600 traces across seven popular multi-agent frameworks and built a taxonomy of 14 failure modes. Every framework they tested had failure rates between 41 and 87 percent. The top three failures are step repetition, reasoning-action mismatch, and being unaware of termination conditions. These are not model capability problems. They are system design problems.

MAS-FIRE goes the other direction. Instead of observing failures in the wild, they inject them on purpose. Fifteen fault types across intra-agent and inter-agent categories, three injection mechanisms, and a dual-level reliability metric. The most interesting result is what they call the capability paradox: GPT-5’s strict instruction compliance becomes a liability under “Blind Trust” faults, where DeepSeek-V3’s less compliant behavior holds up better.

Silo-Bench adds the third leg. 1,620 experiments showing that agents successfully form coordination topologies and actively exchange information, yet systematically fail to synthesize distributed state into correct answers. The bottleneck is not communication. The bottleneck is reasoning over distributed state.

What Each Wave Trusts

If you want a one-sentence read on each wave, this is it.

Wave 1
Trusts that role structure and dialogue are enough
Outside
Agentic coding: trusts that good tools beat agent count
Wave 2
Trusts nothing and measures what breaks

Read as a progression, the rough arc is: tell agents to coordinate, then notice that sometimes you don’t need them to, then measure what happens when you do. Each step trusts the agents less and verifies more than the one before.

What the Waves Don’t Capture

The two-wave framing is tidy, which should make you suspicious. Some things it flattens or misses:

Parallel threads run outside the waves entirely. Du et al. appears on the Wave 1 timeline because it landed in the same window and poses the same headline question (many LLMs versus one), but the debate line of work is its own cluster: multiple instances of the same model arguing, with a citation graph that barely touches coordination-theory papers and rarely gets cited back by them. The timeline entry is chronology, not a claim that debate belongs in ChatDev’s intellectual neighborhood; Part 5 treats debate as that parallel thread. Generative Agents is a social simulation paper that sits uncomfortably in wave 1 because it came out in 2023, but its descendants (game agents, persona simulations) are their own research community. Distributed-systems-adjacent work on shared state and coordination avoidance is a separate thread that’s only now starting to touch the MAS literature.

Individual papers don’t fit cleanly. AutoGen is a 2023 paper that kept evolving through 2024. The Anthropic research post is engineering content, not a research paper. Magentic-One sits between the agentic coding turn and the reliability wave. Calling any of these “wave 1” or “wave 2” is a judgment call, not a fact.

Keep all of that in mind as you read the rest of the series. The waves are a way to group papers for comprehension, not a claim about how the field evolved.

What’s Coming

The next seven posts build on this landscape.

Part 2 covers the vocabulary the field uses for itself. Three surveys have done the work of consolidating the shared terms, and once you know the vocabulary you can read any paper in the field at a glance.

Parts 3 and 4 go deep on the two waves. Part 3 is the canonical coordination-theory papers: CAMEL, ChatDev, MetaGPT, AutoGen, AgentVerse, what each one actually builds and where each one quietly disagrees with the others. Part 4 is the reliability wave: MAST, MAS-FIRE, Silo-Bench, and what happens when you try to measure a multi-agent system honestly.

Part 5 covers the parallel threads. Multi-agent debate (Du, Liang), shared state as coordination (Ou et al.), and the CALM theorem as a bridge between distributed systems and multi-agent AI.

Parts 6 and 7 are cross-cutting. Part 6 is verification patterns, including Cursor’s visual feedback loop, which is the most interesting production-scale verification pattern I’ve seen and isn’t in any of the papers. Part 7 is benchmarks, what they measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.

Part 8 is what I think is still missing, what’s worth stealing from adjacent fields, and what I’d read if I had to start over.

Next post: the vocabulary.

Errata and revisions

First published 2026-04-24. The list below logs substantive edits so early readers can see what moved.

  • 2026-04-26Agentic coding claim: Replaced “empirically falsified” framing with scoped wording: strong evidence on benchmarked autonomous patch-style software engineering (SWE-bench-style workflows), not a universal verdict on multi-agent for all coding.
  • 2026-04-26Audience: Added a short “who this is for” note (assumes you already work with LLM agents).
  • 2026-04-26Debate vs. timeline: Clarified that Du et al. on the Wave 1 timeline reflects release window and headline question; the broader debate thread is parallel and is how Part 5 uses “debate.”
  • 2026-04-26AgentVerse: Described in the narrative, not only in the timeline table (with arXiv link). The Part 3 preview in “What’s Coming” now names it with the other coordination papers.
  • 2026-04-26Wave 2 timeline: Explicitly labeled as illustrative examples, not a complete survey of the period.
  • 2026-04-26Magentic-One stuck counter: Scoped the “design pattern” remark to shipped agentic-coding systems of the period (not a claim about all of CS or all MAS research).
  • 2026-04-26Wave 1 heterogeneity: Added a paragraph distinguishing pipeline-style coordination papers from simulation-heavy work in the same calendar cluster.
  • 2026-04-26Waves caveat: Tightened prose; no change to the underlying claim that waves are retrospective scaffolding.

Discussion on Bluesky

Replies to this post on Bluesky appear below. Reply there to join the conversation.

Loading discussion…