Getting Up to Speed on Multi-Agent Systems, Part 4: Wave 2 (Why It Breaks)
By 2025, two things had happened. Wave-1 architectures were running in production (Anthropic had shipped its research system; the open-source ecosystem around orchestrator-worker patterns was maturing). The agentic coding turn had made clear that multi-agent was not the right tool for focused coding, and narrowed the interesting MAS question to “when we do use it, why does it break?”
This wave is where I find the literature most useful, because it’s where empirical work finally catches up with the claims of wave 1.
MAST: 14 Failure Modes from 1,600 Traces
1,600 annotated traces across 7 frameworks. First empirical taxonomy of why MAS break.
- 1,600+ traces across MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus
- 41 to 87 percent failure rates across all frameworks; systemic, not isolated
- Top 3 failures: step repetition (15.7 percent), reasoning-action mismatch (13.2 percent), unaware of termination (12.4 percent)
- Systems with explicit verifiers (MetaGPT, ChatDev) had fewer failures
- Inter-agent failures require "theory of mind"; agents can't model each other's information needs
- Adding high-level objective verification gave +15.6 percent improvement
- LLM-as-Judge pipeline: 94 percent accuracy against human experts
The MAST paper is the one I keep coming back to. It’s the first rigorous empirical study of multi-agent failures. The authors took 1,600 execution traces from seven popular multi-agent frameworks, annotated each one with human experts, and built a 14-mode failure taxonomy.
The headline number is that every framework they tested had failure rates between 41 and 87 percent. Every single one. These are the production frameworks. These are the systems people cite in their papers. And they fail almost as often as they succeed.
MAST's 14 Failure Modes
The taxonomy matters because it lets you diagnose specific failures. When someone says “my agent got stuck in a loop,” you can now ask whether that’s step repetition (FM-1.3) or conversation reset (FM-2.1). Those have different causes and different fixes.
The finding that the paper downplays but I think is most important: the inter-agent failures are the hardest to fix. FC1 issues are prompt engineering problems. You can get meaningful improvements by rewriting role specifications. FC2 issues require what the paper calls “theory of mind,” meaning the agents don’t accurately model each other’s information needs. Prompt fixes don’t help there. The solutions are structural.
MAS-FIRE: Fault Injection for Multi-Agent Systems
The first systematic fault injection framework for LLM-based MAS.
- 15 fault types: 8 intra-agent (planning, memory, reasoning, action) + 7 inter-agent (config, instruction, communication)
- Three injection mechanisms: prompt modification, response rewriting, message routing manipulation
- Tested on MetaGPT, Table-Critic, CAMEL with GPT-5 and DeepSeek-V3
- Key finding: config and instruction faults are catastrophic (Robustness Score = 0 percent for Blind Trust on MetaGPT)
- Capability paradox: GPT-5's strict compliance hurts under Blind Trust (6.3 percent) vs DeepSeek-V3 (70.6 percent)
- Linear pipelines extremely vulnerable; iterative architectures resilient (79-91 percent)
- Shared message pools neutralize memory faults (+25 percent advantage)
MAST is observational. You watch real failures and categorize them. MAS-FIRE is the complement: you inject failures on purpose and measure how the system handles them. This is standard practice in distributed systems (Chaos Engineering, Jepsen) but it’s new for LLM agents.
The taxonomy is worth reading carefully.
MAS-FIRE's 15 Fault Types
The capability paradox is the finding I find most provocative. GPT-5 is a stronger model than DeepSeek-V3 by most benchmarks. But under the “Blind Trust” fault (where one agent is told to unconditionally accept instructions from another), GPT-5 fails almost completely (6.3 percent robustness) while DeepSeek-V3 holds up (70.6 percent). Why? Because GPT-5 is better at following instructions. It’s also better at following bad instructions. Strict compliance is a liability when you can’t trust the source.
The implication for system design is that you want agents that can question upstream inputs. Not agents that just obey.
Silo-Bench: Communication Isn’t Reasoning
1,620 experiments showing agents can communicate but can't reason about distributed state.
- 30 algorithmic tasks across 3 communication complexity tiers, 54 configs, 1,620 experiments
- Central finding: agents form correct coordination topologies and actively exchange information
- But they systematically fail to synthesize distributed state into correct answers
- Bottleneck is information integration, not information acquisition
- Coordination overhead increases with agent scale, eventually eliminating parallelization benefits
- Merely increasing agent count cannot circumvent context limitations
Silo-Bench is the paper I wish more people would read. The finding is simple and surprising. When you give multiple LLM agents a problem that requires distributed reasoning, they do all the communication correctly. They build the right topology. They exchange the right information. Then they fail to synthesize what they’ve gathered into a correct answer.
The bottleneck isn’t the network. The bottleneck is the integration. Each agent has received the necessary pieces, and each agent individually fails to combine those pieces into the answer. This is not a coordination problem in the distributed systems sense. It’s a reasoning problem that the coordination can’t compensate for.
For wave-1 architectures, this result is devastating. The whole argument for agents-debating-each-other was that two agents looking at the same problem from different angles could synthesize a better answer than one agent alone. Silo-Bench says: maybe sometimes, but not for information-integration tasks, which is most tasks.
What Wave 3 Adds That Wave 1 and 2 Missed
Wave 3's New Contributions
The critical thing wave 2 does is separate “this is a coordination problem” from “this is a reasoning problem.” Wave 1 assumed everything was a coordination problem. Wave 2 says: if your agents can’t synthesize distributed state individually, no amount of better message passing will save you. Fix the reasoning first. Coordinate second.
Next post: debate, state, and the CALM theorem. Three papers on whether agents should agree, disagree, or just share a notebook. And a theoretical result from distributed systems that explains which choice makes sense when.