Getting Up to Speed on Multi-Agent Systems, Part 4: Wave 2 (Why It Breaks)

27 Apr 2026

By 2025, two things had happened. Wave-1 architectures were running in production (Anthropic had shipped its research system; the open-source ecosystem around orchestrator-worker patterns was maturing). The agentic coding turn had made clear that multi-agent was not the right tool for focused coding, and narrowed the interesting MAS question to “when we do use it, why does it break?”

This wave is where I find the literature most useful, because it’s where empirical work finally catches up with the claims of wave 1.

Getting Up to Speed on MAS

Part 1. The Landscape
Part 2. The Vocabulary
Part 3. Wave 1: Can Agents Coordinate At All?
Part 4. Wave 2: Why It Breaks (you are here)
Part 5. Debate, State, and Coordination
Part 6. Verification Patterns
Part 7. Benchmarks and What They Miss
Part 8. Open Questions

MAST: 14 Failure Modes from 1,600 Traces

Why Do Multi-Agent LLM Systems Fail? (MAST) arXiv 2503.13657 · NeurIPS 2025 D&B

1,600 annotated traces across 7 frameworks. First empirical taxonomy of why MAS break.

Core contribution: MAST taxonomy of 14 failure modes in 3 categories

1,600+ traces across MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus
41 to 87 percent failure rates across all frameworks; systemic, not isolated
Top 3 failures: step repetition (15.7 percent), reasoning-action mismatch (13.2 percent), unaware of termination (12.4 percent)
Systems with explicit verifiers (MetaGPT, ChatDev) had fewer failures
Inter-agent failures require "theory of mind"; agents can't model each other's information needs
Adding high-level objective verification gave +15.6 percent improvement
LLM-as-Judge pipeline: 94 percent accuracy against human experts

Cemri, Pan, Yang, Agrawal, Chopra, Tiwari, Keutzer, Parameswaran, Klein, Ramchandran, Zaharia, Gonzalez, Stoica

The MAST paper is the one I keep coming back to. It’s the first rigorous empirical study of multi-agent failures. The authors took 1,600 execution traces from seven popular multi-agent frameworks, annotated each one with human experts, and built a 14-mode failure taxonomy.

The headline number is that every framework they tested had failure rates between 41 and 87 percent. Every single one. These are the production frameworks. These are the systems people cite in their papers. And they fail almost as often as they succeed.

MAST's 14 Failure Modes

FC1 System Design

Disobey task spec (11.8%) Disobey role spec (1.5%) Step repetition (15.7%) Loss of conversation history (2.8%) Unaware of termination (12.4%)

FC2 Inter-Agent

Conversation reset (2.2%) Fail to ask for clarification (6.8%) Task derailment (7.4%) Information withholding (0.85%) Ignored other agent's input (1.9%) Reasoning-action mismatch (13.2%)

FC3 Verification

Premature termination (6.2%) No/incomplete verification (8.2%) Incorrect verification (9.1%)

The taxonomy matters because it lets you diagnose specific failures. When someone says “my agent got stuck in a loop,” you can now ask whether that’s step repetition (FM-1.3) or conversation reset (FM-2.1). Those have different causes and different fixes.

The finding that the paper downplays but I think is most important: the inter-agent failures are the hardest to fix. FC1 issues are prompt engineering problems. You can get meaningful improvements by rewriting role specifications. FC2 issues require what the paper calls “theory of mind,” meaning the agents don’t accurately model each other’s information needs. Prompt fixes don’t help there. The solutions are structural.

MAS-FIRE: Fault Injection for Multi-Agent Systems

MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems arXiv 2602.19843 · February 2026

The first systematic fault injection framework for LLM-based MAS.

Core contribution: Active probing via fault injection, not just passive observation

15 fault types: 8 intra-agent (planning, memory, reasoning, action) + 7 inter-agent (config, instruction, communication)
Three injection mechanisms: prompt modification, response rewriting, message routing manipulation
Tested on MetaGPT, Table-Critic, CAMEL with GPT-5 and DeepSeek-V3
Key finding: config and instruction faults are catastrophic (Robustness Score = 0 percent for Blind Trust on MetaGPT)
Capability paradox: GPT-5's strict compliance hurts under Blind Trust (6.3 percent) vs DeepSeek-V3 (70.6 percent)
Linear pipelines extremely vulnerable; iterative architectures resilient (79-91 percent)
Shared message pools neutralize memory faults (+25 percent advantage)

MAST is observational. You watch real failures and categorize them. MAS-FIRE is the complement: you inject failures on purpose and measure how the system handles them. This is standard practice in distributed systems (Chaos Engineering, Jepsen) but it’s new for LLM agents.

The taxonomy is worth reading carefully.

MAS-FIRE's 15 Fault Types

Intra-agent

Inexecutable Plan Critical Info Loss Memory Loss Context Length Violation Hallucination Tool Selection Error Param Filling Error Param Format Error

Inter-agent

Role Ambiguity Blind Trust Instruction Logic Conflict Instruction Ambiguity Message Cycle Message Storm Broadcast Amplification

Injection via

Prompt Modification Response Rewriting Message Routing Manipulation

The capability paradox is the finding I find most provocative. GPT-5 is a stronger model than DeepSeek-V3 by most benchmarks. But under the “Blind Trust” fault (where one agent is told to unconditionally accept instructions from another), GPT-5 fails almost completely (6.3 percent robustness) while DeepSeek-V3 holds up (70.6 percent). Why? Because GPT-5 is better at following instructions. It’s also better at following bad instructions. Strict compliance is a liability when you can’t trust the source.

The implication for system design is that you want agents that can question upstream inputs. Not agents that just obey.

Silo-Bench: Communication Isn’t Reasoning

Silo-Bench: Evaluating Distributed Coordination in Multi-Agent LLM Systems arXiv 2603.01045 · March 2026

1,620 experiments showing agents can communicate but can't reason about distributed state.

Core finding: The bottleneck is synthesis, not acquisition

30 algorithmic tasks across 3 communication complexity tiers, 54 configs, 1,620 experiments
Central finding: agents form correct coordination topologies and actively exchange information
But they systematically fail to synthesize distributed state into correct answers
Bottleneck is information integration, not information acquisition
Coordination overhead increases with agent scale, eventually eliminating parallelization benefits
Merely increasing agent count cannot circumvent context limitations

Silo-Bench is the paper I wish more people would read. The finding is simple and surprising. When you give multiple LLM agents a problem that requires distributed reasoning, they do all the communication correctly. They build the right topology. They exchange the right information. Then they fail to synthesize what they’ve gathered into a correct answer.

The bottleneck isn’t the network. The bottleneck is the integration. Each agent has received the necessary pieces, and each agent individually fails to combine those pieces into the answer. This is not a coordination problem in the distributed systems sense. It’s a reasoning problem that the coordination can’t compensate for.

For wave-1 architectures, this result is devastating. The whole argument for agents-debating-each-other was that two agents looking at the same problem from different angles could synthesize a better answer than one agent alone. Silo-Bench says: maybe sometimes, but not for information-integration tasks, which is most tasks.

What Wave 3 Adds That Wave 1 and 2 Missed

Wave 3's New Contributions

Observation

Real failure data across multiple frameworks (MAST)

Injection

Active fault testing (MAS-FIRE)

Limits

What coordination can and can't fix (Silo-Bench)

Production lessons

Anthropic blog: 15x token cost, shared-context failures

The wave-2 framing

MAST observes what breaks. MAS-FIRE tests what breaks by injecting it. Silo-Bench identifies the limits of what coordination can fix. Together they provide a reliability research stack that wave 1 didn't have. What's still missing: gates, recovery protocols, longitudinal failure datasets. The field is still working on these.

The critical thing wave 2 does is separate “this is a coordination problem” from “this is a reasoning problem.” Wave 1 assumed everything was a coordination problem. Wave 2 says: if your agents can’t synthesize distributed state individually, no amount of better message passing will save you. Fix the reasoning first. Coordinate second.

Next post: debate, state, and the CALM theorem. Three papers on whether agents should agree, disagree, or just share a notebook. And a theoretical result from distributed systems that explains which choice makes sense when.