Getting Up to Speed on Multi-Agent Systems, Part 3: Wave 1 (Can Agents Coordinate At All?)
Wave 1 is the cluster of papers from 2023 that people actually cite. When someone says “I read the multi-agent papers,” they usually mean these. In this post I’m going to walk through the canonical five, explain what each one actually builds, and show where they agree and where they quietly disagree with each other.
- Part 1. The Landscape
- Part 2. The Vocabulary
- Part 3. Wave 1: Can Agents Coordinate At All? (you are here)
- Part 4. Wave 2: Why It Breaks
- Part 5. Debate, State, and Coordination
- Part 6. Verification Patterns
- Part 7. Benchmarks and What They Miss
- Part 8. Open Questions
CAMEL: Two Agents Role-Playing
Two LLMs role-play until the task is done.
- AI User (instructor) and AI Assistant in a structured dialogue loop
- Inception prompting: symmetric system prompts with explicit constraints like "Never flip roles"
- Task specifier agent elaborates vague human input into concrete tasks
- Documented failure modes: role flipping, instruction repetition, vague responses, conversational loops
- All mitigations are prompt-level, no structural enforcement
CAMEL is the simplest of the wave-1 papers and also the most honest. It’s two LLMs. One plays a user, one plays an assistant. They talk. The paper’s main contribution is inception prompting, which is a way of writing system prompts that keep agents from breaking character. The failure modes the paper documents are the failure modes you’d expect: agents flip roles, agents repeat themselves, agents give vague answers.
What’s missing from CAMEL is any structural enforcement. If the agent flips roles, nothing stops it except a prompt instruction that says “don’t flip roles.” There’s no protocol-level guarantee. This is a pattern you’ll see repeated across wave-1: trust the prompt, hope for the best.
Generative Agents: Memory, Reflection, and Planning
Give agents memory, reflection, and planning so they behave believably over time.
- Memory stream: every observation stored with timestamp and importance score (LLM-rated 1 to 10)
- Retrieval: weighted sum of recency (exponential decay), relevance (cosine similarity), and importance
- Reflection: triggered when accumulated importance crosses a threshold (roughly 2-3 times per day)
- Planning: top-down recursive (day, hour, 5-15 minute blocks); replans on unexpected events
- Emergent behaviors: a Valentine's Day party self-organized from one suggestion; info diffusion from 4 percent to 32 percent awareness in two game days
This paper is the outlier in wave-1, because it isn’t trying to build software. It’s a social simulation. 25 agents living in a Sims-style town. What makes it interesting for multi-agent systems is that it’s the only wave-1 paper that takes memory seriously. Every observation gets stored with a timestamp and an importance score. When an agent needs to act, it retrieves memories using a weighted combination of recency, relevance, and importance. Reflections are higher-level thoughts synthesized from clusters of observations.
None of the software engineering papers in wave-1 do anything like this. They don’t need to, because their tasks have clear start and end conditions. But when you look at what production multi-agent systems are starting to need, the Generative Agents architecture has more of the right pieces than MetaGPT does.
ChatDev: Pairwise Chat as a Software Pipeline
Chain pairwise dialogues into a software development pipeline.
- Fixed pipeline: Design, then Coding, then Testing; each phase is an instructor-assistant dialogue
- Communicative dehallucination: the assistant flips role and asks clarifying questions before committing to an answer
- Short-term memory (full dialogue within a phase) and long-term memory (extracted solutions across phases)
- Termination: 10 rounds max, or two consecutive rounds without changes
- No escalation path when convergence fails, it just stops
ChatDev is the paper that got me into this literature in the first place. It’s a pipeline of pairwise dialogues. Pairs of agents talk about design, then pairs of agents talk about coding, then pairs of agents talk about testing. The most interesting mechanism is communicative dehallucination, which is a prompt pattern where the assistant asks clarifying questions before answering. This is the closest any wave-1 paper gets to backpressure.
The structural problem with ChatDev is that when agents can’t converge in 10 rounds, the system just stops. There’s no fallback. No mechanism for the system to notice that it’s stuck and escalate. No concurrency control on the shared artifacts. I wrote about how this breaks down in practice a few weeks ago.
MetaGPT: Structured Artifacts and Test Execution
MetaGPT is the most ambitious wave-1 paper. Instead of dialogue, the agents produce structured documents. Instead of relying on agents to agree, the framework executes tests. The paper’s strongest claim is the structural one: if agents produce artifacts with defined schemas, and those artifacts are validated by execution, you get better coordination than you get from unconstrained dialogue.
I think that claim holds up. But the coordination model is still based on a shared mutable pool, which is the same thing Riak solved twenty years ago with version vectors. MetaGPT doesn’t have that. Agents publish to the pool. Agents subscribe. Nobody tracks causality.
AutoGen: A Framework, Not a System
A configurable framework. Build whatever multi-agent system you want.
- ConversableAgent base class: any entity that sends and receives messages (LLM, human, tool, code executor)
- Pluggable reply functions via register_reply(); agent behavior is what it does when it gets a message
- GroupChatManager: selects next speaker via LLM role-play prompting or an FSM
- Human-in-the-loop as a dial: per-agent config of ALWAYS, SOMETIMES, or NEVER
- Number one on GAIA at time of publication, roughly 2x performance on the hardest level
AutoGen is the odd paper in wave-1 because it’s not a system, it’s a framework. The contribution is that every agent, whether it’s an LLM, a human, a tool, or a code executor, speaks the same message protocol. You compose them however you want. The GroupChatManager can pick the next speaker via an LLM or via a finite state machine you define.
AutoGen is more honest than the others about the fact that there’s no “right” multi-agent architecture. It doesn’t try to tell you what your agents should be. It gives you the plumbing and assumes you know what you’re doing. For that reason it’s probably aged better than CAMEL or ChatDev.
What Wave 1 Got Right
Every one of these papers took LLMs out of single-user chat and put them into multi-step coordination tasks. That’s a real contribution. Role specialization, structured dialogue, tool use patterns, task decomposition, memory and reflection as first-class primitives. These ideas came out of wave-1 and the field is still using them.
What Wave 1 Got Wrong
Shared Assumptions That Didn't Survive
Every wave-1 paper treats failure as a termination condition. When ChatDev can’t converge, it stops. When MetaGPT’s tests fail three times, it stops. When AutoGen hits max_round, it stops. None of these systems have a model for what happens next. This is the gap wave-2 papers would later start trying to fill.
None of the wave-1 papers have concurrency control on their shared state. MetaGPT’s message pool grows monotonically and nobody tracks causality. ChatDev discards dialogue at phase boundaries. Generative Agents’ memory is per-agent with no sharing. If you had two agents in MetaGPT trying to edit the same file, nothing in the framework would stop them from overwriting each other’s work.
And all of them evaluate against benchmarks that were designed for single agents. HumanEval, MBPP, SWE-bench. These benchmarks measure whether the output is correct. They don’t measure coordination quality, communication overhead, or recovery behavior. Which are the things that distinguish a multi-agent system from a single agent.
Next post: wave-2 papers, which measure what actually breaks in these systems. With wave-1 architectures running in production, and the agentic coding turn having clarified when MAS isn’t the right tool at all, the field started asking why MAS fails when you do use it and how to test that honestly.