Getting Up to Speed on Multi-Agent Systems, Part 8: Open Questions

01 May 2026

I started this series because I’d been reading multi-agent papers for weeks and wanted a map I wished I’d had on day one. This is the last post. I want to close it by laying out what the field still hasn’t figured out, what I think is worth stealing from adjacent fields, and what I’d read if I had to start over.

Stealable Ideas

Some ideas are not yet general patterns in the field, but they’re battle-tested in individual papers, and any new multi-agent system should probably adopt them. These are the things I’d take from one paper and apply in a different context.

Things Any New Multi-Agent System Should Adopt

Artifacts
Structured documents between stages (MetaGPT)
Clarification
Agents can ask before they act (ChatDev dehallucination)
Reflection
Importance-triggered synthesis (Generative Agents)
Memory retrieval
Recency x relevance x importance (Generative Agents)
Shared state
Append-only notebook for structured info (Ou et al.)
Tool interface
ACI-quality commands with guardrails (SWE-agent)
Stuck detection
Count loops, trigger replanning (Magentic-One)
Sandboxing
Docker plus permission configs (AutoDev, OpenHands)
Verification
Modality shift: code to visual, code to tests (Cursor, MetaGPT)

None of these are hard to implement. None of them require a research breakthrough. They just haven’t been brought together in a single system yet.

Open Research Questions

The gaps are bigger. These are the questions I don’t see anyone answering yet.

1. Topology-to-reliability mapping

CAMEL, ChatDev, MetaGPT all fix their topology at design time. AutoGen makes it configurable but doesn't study the effects. Nobody has varied topology systematically and measured reliability outcomes on the same task set. Hub-and-spoke vs mesh vs layered control: do they have different error rates, recovery times, incident severities? We don't know. Magentic-One's architecture lessons and MAS-FIRE's fault taxonomy are both one step away from this kind of study.

2. CRDTs for multi-agent shared state

MetaGPT's shared pool grows monotonically with no conflict resolution. ChatDev discards dialogue at phase boundaries. Generative Agents' memories are per-agent with no sharing. Nobody has applied CRDT merge semantics to multi-agent shared state. The CALM theorem predicts when coordination-free works and when it doesn't. The engineering work of building CRDT-backed agent state hasn't been done.

3. Failure recovery, not just failure detection

Every wave-1 system stops on failure. ChatDev stops after 10 rounds. MetaGPT stops after 3 test failures. AutoGen stops at max_round. None of them model recovery. Can a multi-agent system degrade gracefully, reassign work, escalate, fall back to a simpler approach? MAS-FIRE's fault injection framework is the closest thing to a way to test this, but the recovery strategies it would test don't exist in print yet.

4. Reflection for software engineering agents

Generative Agents proved that periodic reflection produces better long-term behavior in simulation. No software engineering paper has tried this. After a Dev-to-QA loop cycles three times, can the system synthesize "this is an architectural issue, not a code issue" and change strategy? That's a reflection primitive adapted to the SE domain. The MAST data on step repetition (15.7 percent) suggests this would help directly.

5. Benchmark reliability

ChatDev and MetaGPT report contradictory results on each other. Different benchmarks, different metrics, no reproducibility. Incident-level logging against real codebases might provide more trustworthy reliability measurement than self-reported aggregate benchmarks. This is infrastructure work. It's expensive. But the alternative is a field that can't actually tell you which system is better.

6. Backpressure and escalation protocols

MetaGPT's Architect can hallucinate an impossible interface; the Engineer just tries to implement it. ChatDev's dehallucination is the closest thing to backpressure, but it's prompt-level. Can agents formally reject or request revision of upstream artifacts? What's the protocol? Does it improve outcomes, or does it just add latency? This is the place where distributed systems vocabulary (flow control, rejection, retry) maps most directly into multi-agent AI, and it's been barely explored.

The Distributed Systems Bridge

The research gap I find most interesting is the one I’ve been flagging throughout this series. The multi-agent AI field has reinvented several problems that distributed systems solved twenty or thirty years ago.

Distributed systems problemMulti-agent equivalentStatus in MAS literature
Lost updatesTwo agents overwriting each other's workNot addressed
Causal consistencyOrdering agent actions across a pipelineNot addressed
Coordination avoidance (CALM)When agents can work without synchronizationNot applied
CRDTsMerging divergent agent views of shared stateNot applied
Fault injection (Jepsen)MAS-FIRE (starting to emerge)Early work
Back pressureRejecting upstream inputsNot formalized
Escalation / circuit breakingWhat happens when an agent failsNot addressed

These aren’t one-to-one mappings. LLM agents have features that distributed systems nodes don’t (they hallucinate, their behavior is probabilistic, their errors are semantic rather than syntactic). But the underlying coordination problems are the same. The right move is to take what worked in distributed systems, adapt it to the semantic messiness of LLMs, and build from there.

Where I think the field is going
Wave 1 asked whether agents could coordinate at all. The agentic coding turn showed that for a lot of tasks you don't need them to. Wave 2 is about why MAS breaks when you do need it. What comes next, I think, is the wave where multi-agent AI stops pretending it isn't a distributed systems problem and starts applying the full toolkit: CRDTs for shared state, causal ordering for handoffs, fault injection for reliability testing, coordination-avoidance theorems for knowing when to bother synchronizing at all. The groundwork is there. The application hasn't happened.

What I’d Read If I Were Starting Over

If you have limited time and want to get the core of the field fast, here’s the reading list I’d give my past self.

  1. Tran et al. survey (2025) for vocabulary.
  2. CAMEL for the simplest wave-1 mental model.
  3. MetaGPT for the ambitious wave-1 mental model.
  4. SWE-agent for the interface-design lesson from the agentic coding turn.
  5. Magentic-One for a real multi-agent system from the same period.
  6. MAST for what actually goes wrong in the wild.
  7. Anthropic’s research system post for production lessons.
  8. Ou et al. on information sharing for what state sharing actually buys you.
  9. CALM theorem for the theoretical bridge to distributed systems.

Nine papers. If you read those in that order, you have a working model of the field. You won’t have read everything, but you’ll have read enough to understand where new papers fit when you encounter them.

Closing

When I started reading this literature, I thought I was looking at a niche subfield of LLM research. What I found was the multi-agent AI community quietly rediscovering distributed systems, usually without the vocabulary to name what they were rediscovering. Every paper has pieces of the answer. None of them have the full picture. And the full picture, I think, will come from someone who knows both fields well enough to actually bridge them.

That’s the work I’m doing in Caucus. It’s also the work I think the field needs more of, and the reason I wrote this series in the first place. If I’ve saved you a few weeks of reading, that was the whole point.

Thanks for reading.