Getting Up to Speed on Multi-Agent Systems, Part 2: The Vocabulary
If you try to read multi-agent systems papers without the vocabulary, you will get nowhere. The field has settled on a shared set of words for the pieces of a system, and every paper now slots into those categories even when it pretends to be doing something novel. This post is about those words. Once you know them, you can read any paper in the field and know what it is and isn’t claiming.
Three surveys have done the work of consolidating the vocabulary. Each one cuts the space slightly differently, but together they give you the conceptual toolkit.
Tran et al.: Actors, Types, Structures, Strategies
The most useful single survey is Tran et al. (2025). It defines a multi-agent system formally as a tuple of agents, collaboration channels, collective goals, and an environment. Then it taxonomizes the space along four axes.
Tran's Four Axes
Most of the famous wave-1 papers are in one box: cooperative, hierarchical, role-based, static. Everyone is doing roughly the same thing, with small variations in how agents pass messages and what they produce at each step. The survey’s most useful claim is that the optimal structure varies with the task. There is no universal topology.
Zhou et al.: The Five-Component Agent
Zhou et al. (2024) takes a different cut. Instead of asking how agents coordinate, they ask what each agent actually has inside it. They propose a five-component model that applies to any LLM-based agent.
Zhou's Five Components
Reading this as a distributed systems person, the labels sound like things you’d recognize from any actor system. Profile is identity. Perception is input. Self-Action is local state plus computation. Mutual Interaction is message passing. Evolution is the weakest piece, because nobody has really figured out what “agent learning from its own history” looks like in production.
Chen et al.: Applications and Unsolved Challenges
The third survey, Chen et al. (2024), is the one I’d skim rather than read in full. The applications chapter is useful, but what you actually want is the challenges section.
Chen's Challenge Levels
The interaction-level challenges are the ones that most concern me. Efficiency explosion is the observation that multi-agent systems scale worse than linearly because each agent’s autoregressive generation multiplies the token cost. Accumulative error is what it sounds like: errors made in round one propagate and amplify in rounds two, three, four.
Mapping Papers Into These Taxonomies
The payoff of the vocabulary is that you can now categorize any paper in the field at a glance.
| System | Type | Structure | Strategy | Architecture |
|---|---|---|---|---|
| CAMEL | Cooperation | Decentralized pair | Role-based | Static |
| ChatDev | Cooperation | Hierarchical pipeline | Role-based | Static |
| MetaGPT | Cooperation | Centralized pool | Role + Rule-based | Static |
| Debate (Du) | Competition | Decentralized all-to-all | Rule-based rounds | Static |
| Generative Agents | Coopetition | Decentralized open env | Model-based retrieval | Dynamic |
| Anthropic Research | Cooperation | Centralized orchestrator | Role-based | Dynamic |
| AutoGen | Configurable | Configurable | Configurable | Static or Dynamic |
Most of the canonical papers sit in the cooperative, role-based, static quadrant. The interesting ones are the exceptions. Du et al. is the rare competitive debate paper. Generative Agents is the rare fully dynamic system. AutoGen tries to be everything at once, which is its whole thesis.
The Gap the Vocabulary Exposes
The taxonomies do something else besides categorize papers. They make gaps visible.
Zhou’s “Evolution” component is the weakest across every system. Nobody has a real story for how agents learn from their own history in production. MetaGPT’s “test-driven retry” is the closest wave-1 paper to Evolution, and it’s still just a bounded retry loop with no memory of past attempts.
Tran’s “dynamic architecture” category is almost empty. The wave-1 papers all fix their topology at design time. AutoGen makes topology configurable, but it’s configured by the developer, not adjusted at runtime. The only system that truly adjusts at runtime is Generative Agents, and that’s a simulation, not a production framework.
Chen’s “evaluation-level” challenges are unsolved in a way that’s embarrassing for the field. When ChatDev claims 88 percent executability and MetaGPT claims 41 percent on a comparable benchmark, you’re not looking at a performance difference. You’re looking at two papers measuring different things with different tools and calling them the same.
Next post: the wave-1 theory papers in detail. CAMEL, Generative Agents, ChatDev, MetaGPT, AutoGen. What each one actually builds, what each one trusts, and where each one breaks.
Discussion on Bluesky
Replies to this post on Bluesky appear below. Reply there to join the conversation.
Loading discussion…