Getting Up to Speed on Multi-Agent Systems, Part 6: Verification Patterns

29 Apr 2026

Every agent system has to answer the same question eventually: how does it know it did the right thing? Wave-1 papers mostly don’t. Wave-2 papers get serious about it. Wave-3 papers measure what happens when they don’t. And the most interesting verification pattern in the field right now is one that isn’t in any paper at all. It’s in a commercial product.

Getting Up to Speed on MAS

Part 1. The Landscape
Part 2. The Vocabulary
Part 3. Wave 1: Can Agents Coordinate At All?
Part 4. Wave 2: Why It Breaks
Part 5. Debate, State, and Coordination
Part 6. Verification Patterns (you are here)
Part 7. Benchmarks and What They Miss
Part 8. Open Questions

Three Architectures

Every verification pattern in the field fits into one of three categories. The difference is who checks the work and how.

Three Verification Architectures

Self-Verify

Same agent checks its own work Fast, no coordination overhead Blind to its own mistakes

Separate Verifier

Different agent or system checks the work Catches blind spots the author can't see

Structural Gate

Work cannot proceed without passing a gate Strongest: not advisory, blocking

Wave-1 papers are mostly in the first category. MetaGPT introduces the second with its QA agent. Wave-2 papers and production systems are moving toward the third.

What Each System Actually Does

System	Pattern	Feedback Signal	Verifier	Modality Shift
CAMEL	Dialogue consensus	Partner agrees	Peer (same capability)	No (text → text)
ChatDev	Code review plus compiler	Reviewer approval + compile	Reviewer agent + compiler	Partial
MetaGPT	Unit test execution	Tests pass or fail	Test runtime (external)	Yes (code → result)
Gen. Agents	None at runtime	N/A	Post-hoc human eval	N/A
SWE-agent	ACI feedback + tests	Command output + test results	Environment	Yes
Magentic-One	Orchestrator inner loop	Progress assessment	Orchestrator (separate)	Partial
Cursor Agent	Visual feedback loop	Screenshot of rendered UI	Same agent (self-verify)	Yes (code → visual)

The last row is the interesting one.

Cursor’s Visual Feedback Loop

Cursor’s agent mode has a pattern that isn’t in any of the papers. When it’s implementing a UI change, it does this:

Writes code.
Starts the app or preview.
Takes a screenshot.
Looks at the screenshot.
Decides whether the output matches the intent.
Fixes or ships.

This is self-verification, which from the table above should be the weakest category. The agent is checking its own work. And MAST data tells us self-verification has a 13.2 percent failure rate in the form of reasoning-action mismatch: the agent thinks it did the right thing but didn’t.

What saves Cursor’s approach is the modality shift. The agent wrote code (text). The verification happens on a screenshot (pixels). Re-reading your own code in the same modality you wrote it is a weak check. Looking at the rendered output of your code is a much stronger one. You can’t make the same mistake twice because you’re looking at a different representation of the work.

The modality shift principle

The stronger the modality shift between the work and the verification, the more bugs you catch. Code to test execution (MetaGPT) is a modality shift. Code to screenshot (Cursor) is a modality shift. Code to executable proof (structural gates) is a modality shift. Re-reading your own code is not. This is why wave-1 papers that rely on dialogue consensus score so poorly on reasoning-action-mismatch failures: text to text is not a real check.

The Design Space

Once you’ve internalized the three architectures and the modality-shift principle, you can think about verification design more clearly.

Verification Choices Across the Field

Strongest

Structural gate with modality shift Separate verifier with modality shift

Useful

Self-verify with modality shift Separate verifier without modality shift

Weakest

Self-verify without modality shift (dialogue consensus) No verification at runtime

Most wave-1 systems are in the bottom category. The agentic coding turn added some modality shift (tests, screenshots). Wave-2 systems and production systems are starting to combine patterns: Cursor uses self-verify with modality shift, and the hybrid opportunity is to layer a separate verifier or structural gate on top of that fast inner loop.

Hybrid opportunity

Use visual feedback as the dev agent's inner loop for fast iteration, but gate the output with a separate QA agent or executable proof to catch self-verification blind spots. This is a pattern you can build today, and it's more robust than either approach alone.

Why Verification Is Undertheorized

The multi-agent papers spend a lot of time on coordination and almost no time on verification. This is backwards. The MAST data from wave 2 shows that verification failures (FC3: premature termination, incomplete verification, incorrect verification) account for 23.5 percent of all observed failures across seven frameworks. That’s more than any other single failure category if you group them.

If verification were a first-class concern, you’d expect to see papers titled “How Agents Should Check Their Work” or “Verification Protocols for Multi-Agent Systems.” Those papers don’t really exist. What we have is a lot of papers that casually mention their verification mechanism in a subsection and move on.

The papers that take verification most seriously are the ones from the agentic coding turn, because they had to. You can’t fake SWE-bench. If your tests don’t pass, you don’t get credit. Interface design, guardrails, and structured feedback all exist in those systems because the benchmark forces them to. When the benchmark is a transcript of agents talking to each other, you don’t need real verification. When the benchmark is a working piece of software, you do.

Next post: benchmarks. What the standard evaluations measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.