Getting Up to Speed on Multi-Agent Systems, Part 8: Open Questions
I started this series because I’d been reading multi-agent papers for weeks and wanted a map I wished I’d had on day one. This is the last post. I want to close it by laying out what the field still hasn’t figured out, what I think is worth stealing from adjacent fields, and what I’d read if I had to start over.
Stealable Ideas
Some ideas are not yet general patterns in the field, but they’re battle-tested in individual papers, and any new multi-agent system should probably adopt them. These are the things I’d take from one paper and apply in a different context.
Things Any New Multi-Agent System Should Adopt
None of these are hard to implement. None of them require a research breakthrough. They just haven’t been brought together in a single system yet.
Open Research Questions
The gaps are bigger. These are the questions I don’t see anyone answering yet.
CAMEL, ChatDev, MetaGPT all fix their topology at design time. AutoGen makes it configurable but doesn't study the effects. Nobody has varied topology systematically and measured reliability outcomes on the same task set. Hub-and-spoke vs mesh vs layered control: do they have different error rates, recovery times, incident severities? We don't know. Magentic-One's architecture lessons and MAS-FIRE's fault taxonomy are both one step away from this kind of study.
MetaGPT's shared pool grows monotonically with no conflict resolution. ChatDev discards dialogue at phase boundaries. Generative Agents' memories are per-agent with no sharing. Nobody has applied CRDT merge semantics to multi-agent shared state. The CALM theorem predicts when coordination-free works and when it doesn't. The engineering work of building CRDT-backed agent state hasn't been done.
Every wave-1 system stops on failure. ChatDev stops after 10 rounds. MetaGPT stops after 3 test failures. AutoGen stops at max_round. None of them model recovery. Can a multi-agent system degrade gracefully, reassign work, escalate, fall back to a simpler approach? MAS-FIRE's fault injection framework is the closest thing to a way to test this, but the recovery strategies it would test don't exist in print yet.
Generative Agents proved that periodic reflection produces better long-term behavior in simulation. No software engineering paper has tried this. After a Dev-to-QA loop cycles three times, can the system synthesize "this is an architectural issue, not a code issue" and change strategy? That's a reflection primitive adapted to the SE domain. The MAST data on step repetition (15.7 percent) suggests this would help directly.
ChatDev and MetaGPT report contradictory results on each other. Different benchmarks, different metrics, no reproducibility. Incident-level logging against real codebases might provide more trustworthy reliability measurement than self-reported aggregate benchmarks. This is infrastructure work. It's expensive. But the alternative is a field that can't actually tell you which system is better.
MetaGPT's Architect can hallucinate an impossible interface; the Engineer just tries to implement it. ChatDev's dehallucination is the closest thing to backpressure, but it's prompt-level. Can agents formally reject or request revision of upstream artifacts? What's the protocol? Does it improve outcomes, or does it just add latency? This is the place where distributed systems vocabulary (flow control, rejection, retry) maps most directly into multi-agent AI, and it's been barely explored.
The Distributed Systems Bridge
The research gap I find most interesting is the one I’ve been flagging throughout this series. The multi-agent AI field has reinvented several problems that distributed systems solved twenty or thirty years ago.
| Distributed systems problem | Multi-agent equivalent | Status in MAS literature |
|---|---|---|
| Lost updates | Two agents overwriting each other's work | Not addressed |
| Causal consistency | Ordering agent actions across a pipeline | Not addressed |
| Coordination avoidance (CALM) | When agents can work without synchronization | Not applied |
| CRDTs | Merging divergent agent views of shared state | Not applied |
| Fault injection (Jepsen) | MAS-FIRE (starting to emerge) | Early work |
| Back pressure | Rejecting upstream inputs | Not formalized |
| Escalation / circuit breaking | What happens when an agent fails | Not addressed |
These aren’t one-to-one mappings. LLM agents have features that distributed systems nodes don’t (they hallucinate, their behavior is probabilistic, their errors are semantic rather than syntactic). But the underlying coordination problems are the same. The right move is to take what worked in distributed systems, adapt it to the semantic messiness of LLMs, and build from there.
What I’d Read If I Were Starting Over
If you have limited time and want to get the core of the field fast, here’s the reading list I’d give my past self.
- Tran et al. survey (2025) for vocabulary.
- CAMEL for the simplest wave-1 mental model.
- MetaGPT for the ambitious wave-1 mental model.
- SWE-agent for the interface-design lesson from the agentic coding turn.
- Magentic-One for a real multi-agent system from the same period.
- MAST for what actually goes wrong in the wild.
- Anthropic’s research system post for production lessons.
- Ou et al. on information sharing for what state sharing actually buys you.
- CALM theorem for the theoretical bridge to distributed systems.
Nine papers. If you read those in that order, you have a working model of the field. You won’t have read everything, but you’ll have read enough to understand where new papers fit when you encounter them.
Closing
When I started reading this literature, I thought I was looking at a niche subfield of LLM research. What I found was the multi-agent AI community quietly rediscovering distributed systems, usually without the vocabulary to name what they were rediscovering. Every paper has pieces of the answer. None of them have the full picture. And the full picture, I think, will come from someone who knows both fields well enough to actually bridge them.
That’s the work I’m doing in Caucus. It’s also the work I think the field needs more of, and the reason I wrote this series in the first place. If I’ve saved you a few weeks of reading, that was the whole point.
Thanks for reading.