Getting Up to Speed on Multi-Agent Systems, Part 6: Verification Patterns

29 Apr 2026

Every agent system has to answer the same question eventually: how does it know it did the right thing? Wave-1 papers mostly don’t. Wave-2 papers get serious about it. Wave-3 papers measure what happens when they don’t. And the most interesting verification pattern in the field right now is one that isn’t in any paper at all. It’s in a commercial product.

Three Architectures

Every verification pattern in the field fits into one of three categories. The difference is who checks the work and how.

Three Verification Architectures

Self-Verify
Same agent checks its own work Fast, no coordination overhead Blind to its own mistakes
Separate Verifier
Different agent or system checks the work Catches blind spots the author can't see
Structural Gate
Work cannot proceed without passing a gate Strongest: not advisory, blocking

Wave-1 papers are mostly in the first category. MetaGPT introduces the second with its QA agent. Wave-2 papers and production systems are moving toward the third.

What Each System Actually Does

SystemPatternFeedback SignalVerifierModality Shift
CAMELDialogue consensusPartner agreesPeer (same capability)No (text → text)
ChatDevCode review plus compilerReviewer approval + compileReviewer agent + compilerPartial
MetaGPTUnit test executionTests pass or failTest runtime (external)Yes (code → result)
Gen. AgentsNone at runtimeN/APost-hoc human evalN/A
SWE-agentACI feedback + testsCommand output + test resultsEnvironmentYes
Magentic-OneOrchestrator inner loopProgress assessmentOrchestrator (separate)Partial
Cursor AgentVisual feedback loopScreenshot of rendered UISame agent (self-verify)Yes (code → visual)

The last row is the interesting one.

Cursor’s Visual Feedback Loop

Cursor’s agent mode has a pattern that isn’t in any of the papers. When it’s implementing a UI change, it does this:

  1. Writes code.
  2. Starts the app or preview.
  3. Takes a screenshot.
  4. Looks at the screenshot.
  5. Decides whether the output matches the intent.
  6. Fixes or ships.

This is self-verification, which from the table above should be the weakest category. The agent is checking its own work. And MAST data tells us self-verification has a 13.2 percent failure rate in the form of reasoning-action mismatch: the agent thinks it did the right thing but didn’t.

What saves Cursor’s approach is the modality shift. The agent wrote code (text). The verification happens on a screenshot (pixels). Re-reading your own code in the same modality you wrote it is a weak check. Looking at the rendered output of your code is a much stronger one. You can’t make the same mistake twice because you’re looking at a different representation of the work.

The modality shift principle
The stronger the modality shift between the work and the verification, the more bugs you catch. Code to test execution (MetaGPT) is a modality shift. Code to screenshot (Cursor) is a modality shift. Code to executable proof (structural gates) is a modality shift. Re-reading your own code is not. This is why wave-1 papers that rely on dialogue consensus score so poorly on reasoning-action-mismatch failures: text to text is not a real check.

The Design Space

Once you’ve internalized the three architectures and the modality-shift principle, you can think about verification design more clearly.

Verification Choices Across the Field

Strongest
Structural gate with modality shift Separate verifier with modality shift
Useful
Self-verify with modality shift Separate verifier without modality shift
Weakest
Self-verify without modality shift (dialogue consensus) No verification at runtime

Most wave-1 systems are in the bottom category. The agentic coding turn added some modality shift (tests, screenshots). Wave-2 systems and production systems are starting to combine patterns: Cursor uses self-verify with modality shift, and the hybrid opportunity is to layer a separate verifier or structural gate on top of that fast inner loop.

Hybrid opportunity
Use visual feedback as the dev agent's inner loop for fast iteration, but gate the output with a separate QA agent or executable proof to catch self-verification blind spots. This is a pattern you can build today, and it's more robust than either approach alone.

Why Verification Is Undertheorized

The multi-agent papers spend a lot of time on coordination and almost no time on verification. This is backwards. The MAST data from wave 2 shows that verification failures (FC3: premature termination, incomplete verification, incorrect verification) account for 23.5 percent of all observed failures across seven frameworks. That’s more than any other single failure category if you group them.

If verification were a first-class concern, you’d expect to see papers titled “How Agents Should Check Their Work” or “Verification Protocols for Multi-Agent Systems.” Those papers don’t really exist. What we have is a lot of papers that casually mention their verification mechanism in a subsection and move on.

The papers that take verification most seriously are the ones from the agentic coding turn, because they had to. You can’t fake SWE-bench. If your tests don’t pass, you don’t get credit. Interface design, guardrails, and structured feedback all exist in those systems because the benchmark forces them to. When the benchmark is a transcript of agents talking to each other, you don’t need real verification. When the benchmark is a working piece of software, you do.

Next post: benchmarks. What the standard evaluations measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.