Getting Up to Speed on Multi-Agent Systems, Part 6: Verification Patterns
Every agent system has to answer the same question eventually: how does it know it did the right thing? Wave-1 papers mostly don’t. Wave-2 papers get serious about it. Wave-3 papers measure what happens when they don’t. And the most interesting verification pattern in the field right now is one that isn’t in any paper at all. It’s in a commercial product.
Three Architectures
Every verification pattern in the field fits into one of three categories. The difference is who checks the work and how.
Three Verification Architectures
Wave-1 papers are mostly in the first category. MetaGPT introduces the second with its QA agent. Wave-2 papers and production systems are moving toward the third.
What Each System Actually Does
| System | Pattern | Feedback Signal | Verifier | Modality Shift |
|---|---|---|---|---|
| CAMEL | Dialogue consensus | Partner agrees | Peer (same capability) | No (text → text) |
| ChatDev | Code review plus compiler | Reviewer approval + compile | Reviewer agent + compiler | Partial |
| MetaGPT | Unit test execution | Tests pass or fail | Test runtime (external) | Yes (code → result) |
| Gen. Agents | None at runtime | N/A | Post-hoc human eval | N/A |
| SWE-agent | ACI feedback + tests | Command output + test results | Environment | Yes |
| Magentic-One | Orchestrator inner loop | Progress assessment | Orchestrator (separate) | Partial |
| Cursor Agent | Visual feedback loop | Screenshot of rendered UI | Same agent (self-verify) | Yes (code → visual) |
The last row is the interesting one.
Cursor’s Visual Feedback Loop
Cursor’s agent mode has a pattern that isn’t in any of the papers. When it’s implementing a UI change, it does this:
- Writes code.
- Starts the app or preview.
- Takes a screenshot.
- Looks at the screenshot.
- Decides whether the output matches the intent.
- Fixes or ships.
This is self-verification, which from the table above should be the weakest category. The agent is checking its own work. And MAST data tells us self-verification has a 13.2 percent failure rate in the form of reasoning-action mismatch: the agent thinks it did the right thing but didn’t.
What saves Cursor’s approach is the modality shift. The agent wrote code (text). The verification happens on a screenshot (pixels). Re-reading your own code in the same modality you wrote it is a weak check. Looking at the rendered output of your code is a much stronger one. You can’t make the same mistake twice because you’re looking at a different representation of the work.
The Design Space
Once you’ve internalized the three architectures and the modality-shift principle, you can think about verification design more clearly.
Verification Choices Across the Field
Most wave-1 systems are in the bottom category. The agentic coding turn added some modality shift (tests, screenshots). Wave-2 systems and production systems are starting to combine patterns: Cursor uses self-verify with modality shift, and the hybrid opportunity is to layer a separate verifier or structural gate on top of that fast inner loop.
Why Verification Is Undertheorized
The multi-agent papers spend a lot of time on coordination and almost no time on verification. This is backwards. The MAST data from wave 2 shows that verification failures (FC3: premature termination, incomplete verification, incorrect verification) account for 23.5 percent of all observed failures across seven frameworks. That’s more than any other single failure category if you group them.
If verification were a first-class concern, you’d expect to see papers titled “How Agents Should Check Their Work” or “Verification Protocols for Multi-Agent Systems.” Those papers don’t really exist. What we have is a lot of papers that casually mention their verification mechanism in a subsection and move on.
The papers that take verification most seriously are the ones from the agentic coding turn, because they had to. You can’t fake SWE-bench. If your tests don’t pass, you don’t get credit. Interface design, guardrails, and structured feedback all exist in those systems because the benchmark forces them to. When the benchmark is a transcript of agents talking to each other, you don’t need real verification. When the benchmark is a working piece of software, you do.
Next post: benchmarks. What the standard evaluations measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.