Getting Up to Speed on Multi-Agent Systems, Part 3: Wave 1 (Can Agents Coordinate At All?)

26 Apr 2026

Wave 1 is the cluster of papers from 2023 that people actually cite. When someone says “I read the multi-agent papers,” they usually mean these. In this post I’m going to walk through the canonical five, explain what each one actually builds, and show where they agree and where they quietly disagree with each other.

CAMEL: Two Agents Role-Playing

CAMEL: Communicative Agents for "Mind" Exploration arXiv 2303.17760 · NeurIPS 2023

Two LLMs role-play until the task is done.

Core bet: Prompt constraints keep agents on task
2 agents role-play inception prompting
  • AI User (instructor) and AI Assistant in a structured dialogue loop
  • Inception prompting: symmetric system prompts with explicit constraints like "Never flip roles"
  • Task specifier agent elaborates vague human input into concrete tasks
  • Documented failure modes: role flipping, instruction repetition, vague responses, conversational loops
  • All mitigations are prompt-level, no structural enforcement

CAMEL is the simplest of the wave-1 papers and also the most honest. It’s two LLMs. One plays a user, one plays an assistant. They talk. The paper’s main contribution is inception prompting, which is a way of writing system prompts that keep agents from breaking character. The failure modes the paper documents are the failure modes you’d expect: agents flip roles, agents repeat themselves, agents give vague answers.

What’s missing from CAMEL is any structural enforcement. If the agent flips roles, nothing stops it except a prompt instruction that says “don’t flip roles.” There’s no protocol-level guarantee. This is a pattern you’ll see repeated across wave-1: trust the prompt, hope for the best.

Generative Agents: Memory, Reflection, and Planning

Generative Agents: Interactive Simulacra of Human Behavior arXiv 2304.03442 · UIST 2023

Give agents memory, reflection, and planning so they behave believably over time.

Core bet: Retrieval scoring produces believable behavior
25 agents memory stream reflection planning
  • Memory stream: every observation stored with timestamp and importance score (LLM-rated 1 to 10)
  • Retrieval: weighted sum of recency (exponential decay), relevance (cosine similarity), and importance
  • Reflection: triggered when accumulated importance crosses a threshold (roughly 2-3 times per day)
  • Planning: top-down recursive (day, hour, 5-15 minute blocks); replans on unexpected events
  • Emergent behaviors: a Valentine's Day party self-organized from one suggestion; info diffusion from 4 percent to 32 percent awareness in two game days

This paper is the outlier in wave-1, because it isn’t trying to build software. It’s a social simulation. 25 agents living in a Sims-style town. What makes it interesting for multi-agent systems is that it’s the only wave-1 paper that takes memory seriously. Every observation gets stored with a timestamp and an importance score. When an agent needs to act, it retrieves memories using a weighted combination of recency, relevance, and importance. Reflections are higher-level thoughts synthesized from clusters of observations.

None of the software engineering papers in wave-1 do anything like this. They don’t need to, because their tasks have clear start and end conditions. But when you look at what production multi-agent systems are starting to need, the Generative Agents architecture has more of the right pieces than MetaGPT does.

ChatDev: Pairwise Chat as a Software Pipeline

ChatDev: Communicative Agents for Software Development arXiv 2307.07924

Chain pairwise dialogues into a software development pipeline.

Core bet: Dialogue convergence equals correct output
pairwise chat phase pipeline dehallucination
  • Fixed pipeline: Design, then Coding, then Testing; each phase is an instructor-assistant dialogue
  • Communicative dehallucination: the assistant flips role and asks clarifying questions before committing to an answer
  • Short-term memory (full dialogue within a phase) and long-term memory (extracted solutions across phases)
  • Termination: 10 rounds max, or two consecutive rounds without changes
  • No escalation path when convergence fails, it just stops

ChatDev is the paper that got me into this literature in the first place. It’s a pipeline of pairwise dialogues. Pairs of agents talk about design, then pairs of agents talk about coding, then pairs of agents talk about testing. The most interesting mechanism is communicative dehallucination, which is a prompt pattern where the assistant asks clarifying questions before answering. This is the closest any wave-1 paper gets to backpressure.

The structural problem with ChatDev is that when agents can’t converge in 10 rounds, the system just stops. There’s no fallback. No mechanism for the system to notice that it’s stuck and escalate. No concurrency control on the shared artifacts. I wrote about how this breaks down in practice a few weeks ago.

MetaGPT: Structured Artifacts and Test Execution

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework arXiv 2308.00352 · ICLR 2024 Oral

Replace dialogue with structured documents and real test execution.

Core bet: Schemas plus passing tests equal correct output
5 roles artifact pub-sub executable feedback
  • Roles: PM, Architect, Project Manager, Engineer, QA (waterfall)
  • No dialogue; agents produce structured documents (PRD, system design, task list, code, tests)
  • Pub-sub message pool: agents publish artifacts, subscribe by role to relevant messages
  • Executable feedback: unit tests actually run; failures trigger up to 3 retries referencing PRD and design docs
  • Results: 85.9 percent Pass@1 on HumanEval, 100 percent task completion, 0.83 human revisions vs ChatDev's 2.5

MetaGPT is the most ambitious wave-1 paper. Instead of dialogue, the agents produce structured documents. Instead of relying on agents to agree, the framework executes tests. The paper’s strongest claim is the structural one: if agents produce artifacts with defined schemas, and those artifacts are validated by execution, you get better coordination than you get from unconstrained dialogue.

I think that claim holds up. But the coordination model is still based on a shared mutable pool, which is the same thing Riak solved twenty years ago with version vectors. MetaGPT doesn’t have that. Agents publish to the pool. Agents subscribe. Nobody tracks causality.

AutoGen: A Framework, Not a System

AutoGen: Next-Gen LLM Applications via Multi-Agent Conversation arXiv 2308.08155

A configurable framework. Build whatever multi-agent system you want.

Core bet: Developers will build the right topology
framework ConversableAgent GroupChat human-in-loop
  • ConversableAgent base class: any entity that sends and receives messages (LLM, human, tool, code executor)
  • Pluggable reply functions via register_reply(); agent behavior is what it does when it gets a message
  • GroupChatManager: selects next speaker via LLM role-play prompting or an FSM
  • Human-in-the-loop as a dial: per-agent config of ALWAYS, SOMETIMES, or NEVER
  • Number one on GAIA at time of publication, roughly 2x performance on the hardest level

AutoGen is the odd paper in wave-1 because it’s not a system, it’s a framework. The contribution is that every agent, whether it’s an LLM, a human, a tool, or a code executor, speaks the same message protocol. You compose them however you want. The GroupChatManager can pick the next speaker via an LLM or via a finite state machine you define.

AutoGen is more honest than the others about the fact that there’s no “right” multi-agent architecture. It doesn’t try to tell you what your agents should be. It gives you the plumbing and assumes you know what you’re doing. For that reason it’s probably aged better than CAMEL or ChatDev.

What Wave 1 Got Right

Every one of these papers took LLMs out of single-user chat and put them into multi-step coordination tasks. That’s a real contribution. Role specialization, structured dialogue, tool use patterns, task decomposition, memory and reflection as first-class primitives. These ideas came out of wave-1 and the field is still using them.

What Wave 1 Got Wrong

Shared Assumptions That Didn't Survive

Failure model
Treated as termination, not a system state
Concurrency control
Shared state with no causality tracking
Evaluation
Benchmarks designed for single agents
Escalation
No path when convergence fails
Topology
Fixed at design time

Every wave-1 paper treats failure as a termination condition. When ChatDev can’t converge, it stops. When MetaGPT’s tests fail three times, it stops. When AutoGen hits max_round, it stops. None of these systems have a model for what happens next. This is the gap wave-2 papers would later start trying to fill.

None of the wave-1 papers have concurrency control on their shared state. MetaGPT’s message pool grows monotonically and nobody tracks causality. ChatDev discards dialogue at phase boundaries. Generative Agents’ memory is per-agent with no sharing. If you had two agents in MetaGPT trying to edit the same file, nothing in the framework would stop them from overwriting each other’s work.

And all of them evaluate against benchmarks that were designed for single agents. HumanEval, MBPP, SWE-bench. These benchmarks measure whether the output is correct. They don’t measure coordination quality, communication overhead, or recovery behavior. Which are the things that distinguish a multi-agent system from a single agent.

Next post: wave-2 papers, which measure what actually breaks in these systems. With wave-1 architectures running in production, and the agentic coding turn having clarified when MAS isn’t the right tool at all, the field started asking why MAS fails when you do use it and how to test that honestly.