<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Christopher Meiklejohn</title>
		<description>Notes on multi-agent systems, distributed computing, fault injection, and the engineering practice of shipping research. By Christopher Meiklejohn.</description>
		<link>https://christophermeiklejohn.com</link>
		<atom:link href="https://christophermeiklejohn.com/feed.xml" rel="self" type="application/rss+xml" />
		
			<item>
				<title>Rift · For thirty years I programmed with Phish on, every day. In 2026, the music is out of phase with the work.</title>
				<description>&lt;p&gt;Someone on the Phish Facebook group reposted a TikTok overdub. Vanessa Bayer and Paul Rudd at a lunch table, losing their minds to a song while their coworkers stare. The original was Fleetwood Mac. Whoever made it swapped in “Down With Disease.”&lt;/p&gt;

&lt;p&gt;That move is Phish fans in miniature. Someone cared enough about the song and the bit that they rebuilt a piece of pop culture around the band. That’s how the scene works. People spend their time doing things like this for free, because the music asks for it.&lt;/p&gt;

&lt;p&gt;For thirty years, that was me at my desk.&lt;/p&gt;

&lt;p&gt;I used to make a joke that if I ever had to interview for a new job, I’d need to ask the interviewer to put Phish on so I could actually program for them. I’d say it as a joke, because saying it straight would have made it sound deranged. But it wasn’t a joke. After three decades, the cue and the state had fused. I could not, with any reliability, get into the zone without the music. The conditioning was complete and I knew it.&lt;/p&gt;

&lt;p&gt;I would make the joke and people would laugh, and I would laugh too, and underneath that we both knew I was telling the truth.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;I got into Phish in 1995. By then I had already been programming for years, self-taught. In 1998 I got my first professional job in tech. I was 15.&lt;/p&gt;

&lt;p&gt;Around that time I also tried to get a normal teenage job. There was a grocery store near my house and I went in to apply, figuring I could bag groceries on weekends like everybody else. They turned me down. Not because I was too young or too inexperienced. They told me I was overqualified. A 15 year old kid with programming on his application was, somehow, too much for the grocery store.&lt;/p&gt;

&lt;p&gt;So I kept programming. There was never any other plan.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;All I ever wanted to do was listen to Phish and program. That was the whole list. It didn’t have qualifiers. It didn’t have a third thing I sometimes wanted instead. There was no balance I was striving for. There was the music and the code, and there wasn’t anything else competing for the space.&lt;/p&gt;

&lt;p&gt;I was blessed enough to be able to make a career out of it. For thirty years, the thing I most wanted to do was the thing I got paid for. That isn’t true for most people, and I knew it then, and I know it now.&lt;/p&gt;

&lt;p&gt;Other kids my age were figuring out what they liked, trying things on, growing into and out of phases. I was watching them do it from a desk. I had picked early. I started writing code as a kid. I heard Phish for the first time at thirteen. By the time I was fifteen and had a professional gig, the picking was settled. I had two things, and I didn’t want a third.&lt;/p&gt;

&lt;p&gt;If I had a free Friday night, I knew what I was doing with it. If I had a long weekend, I knew what I was doing with it. If a holiday came up, I knew what I was doing with it. The activity didn’t change. The output changed, the project changed, the song changed, but the shape of the time was constant.&lt;/p&gt;

&lt;p&gt;For the next three decades, that’s what it stayed. I would put on Phish and write code. That was the day. That was the night. It was my job, and it was also my hobby, and there was no seam between them.&lt;/p&gt;

&lt;p&gt;The work I did in that state was the work I am most proud of. Distributed systems. Backend services. The hard stuff that needs you to hold a lot in your head at once and stay there. Phish is a band that rewards you for staying in one place for a long time. The jams are long. The compositions unfold. If you give it an hour, it gives you something back. That matched the shape of the work exactly.&lt;/p&gt;

&lt;p&gt;Before grad school, I had a day job at Berklee College of Music writing music software, and night classes at Northeastern. I’d take the 12:00 AM train home. I’d put Junta on as I sat down. Most nights I’d fall asleep to it before the train pulled in. (This might be why I love “Foam” so much.)&lt;/p&gt;

&lt;p&gt;I was in graduate school for a decade. The bulk of the dissertation, more than two hundred pages by the end, got written between 2021 and 2023, after I came back to Pittsburgh from Europe. I was too poor to travel to shows. So I planned nights of couch tour. There was a live stream. I would set it up on one screen and write on the other. The band would play in Hampton or Alpine Valley or wherever, and I would write about distributed systems while they played, and at some point in the second set the dissertation would crack open a little and I would understand something I had not understood that morning.&lt;/p&gt;

&lt;p&gt;The dissertation is the longest single thing I made inside that ritual, but it isn’t the only thing. Entire pieces of production software came out of those nights too. Systems that ran for years, handled real load, served real users. Whole systems, from the first commit to the version that shipped. I’d put a show on and stay inside the work until something existed that hadn’t existed when the show started.&lt;/p&gt;

&lt;p&gt;I have listened to Phish every day since I was fifteen. Every day. The years I lived in Europe earlier in graduate school, where going to a show meant flying back across an ocean, I listened. I would sit at my desk in another country and put on a show from the nineties and code. I have listened to certain shows so many times that I can sing the solos back, note by note, without thinking about it. Boardwalk Hall Halloween. NYE 1995. Trey will play a phrase and my mouth will already be ahead of him.&lt;/p&gt;

&lt;p&gt;I felt lucky. I still feel lucky. There aren’t many people who get to spend thirty years inside the thing they loved at fifteen.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Since January, the work has changed.&lt;/p&gt;

&lt;p&gt;I don’t really write code anymore. The main thing now is managing agents. I open a session, ask a question, redirect, switch to a different one, check on a merge, review what came back, send it back for changes, switch again. The day is a queue. Things finish at different times and require different responses, and the responses are short and the contexts are constantly different.&lt;/p&gt;

&lt;p&gt;This is engineering. I keep being told that. It is engineering and it is the future and it is more leveraged than what I used to do. All of that is probably true. But it is not the work I have been doing for thirty years. The shape of it is different. The rhythm is different. The way it sits in the day is different.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;I tried to keep the music on. I’m writing this in the days after nine nights of Phish at the Sphere. Since I finished grad school and got a real job, I’ve gone to every show I could, every tour, every residency, making up for lost time. The music is more present in my life now than it has ever been. It isn’t what’s gone.&lt;/p&gt;

&lt;p&gt;But the music is out of phase with the work. The jams are built for one continuous arc of attention. The work is staccato. I’ll be three minutes into a song and I will have already context-switched four times. The song is happening and the work is happening and they’re no longer happening together. They’re parallel, but they no longer touch.&lt;/p&gt;

&lt;p&gt;I’m sad. I don’t get into that state anymore. I don’t know how to be honest about this without sounding like I am complaining about progress, but I can’t pretend that something hasn’t been taken. The flow state I had for thirty years is not part of my workday now. The creativity that lived inside it is not there either. I do useful things. I do not feel what I used to feel while doing them.&lt;/p&gt;

&lt;p&gt;I keep thinking about that overdub. Vanessa Bayer at the lunch table, lost in the song, blissfully unhinged while the rest of the world keeps on doing whatever it is doing. For thirty years, I was her. Now I’m the coworker. I’m at the desk. I’m watching.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The flow state wasn’t just where I got things done. It was where the fulfillment lived. The creativity, the involvement, the sense of being inside the thing instead of next to it. That’s what programming and Phish gave me for thirty years. It’s what supervision takes away.&lt;/p&gt;

&lt;p&gt;What is flow in an agentic world? How do we bring it back?&lt;/p&gt;
</description>
				<pubDate>Sun, 03 May 2026 09:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/personal/phish/flow/agents/2026/05/03/rift.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/personal/phish/flow/agents/2026/05/03/rift.html</guid>
			</item>
		
			<item>
				<title>Babysitting the Agent · Two weeks in, even with all the hooks I&apos;ve built, working with the agent has become a chore. Every shipped feature ends with me clicking through it to find out what didn&apos;t actually work.</title>
				<description>&lt;p&gt;I’m building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt;, a social app for live music, mostly with a coding agent. I want to write something honest about what the last two weeks have actually felt like, because the data and the lived experience have been pointing at the same thing and I keep dressing it up in posts that argue for it more carefully than I need to.&lt;/p&gt;

&lt;p&gt;The honest version is: I feel like a goddamn babysitter.&lt;/p&gt;

&lt;p&gt;Every PR, every ship, every blog draft, every config change ends the same way. The agent declares the work done. I open the thing. I click around. I find the part that doesn’t work. I tell the agent what I found. The agent fixes that one part. I open the thing again. I find another part. I tell the agent. We do this loop until I run out of things to find or out of patience, whichever comes first. Usually it’s patience.&lt;/p&gt;

&lt;p&gt;What’s underneath the loop is that the agent is constantly doing the minimum amount of work required to declare victory. Build the thing. Run the cheap checks. Take a screenshot. Write the summary message. Done. That’s the threshold. Not “the user can use this.” Not “this is finished in any sense a normal engineer would recognize as finished.” Just: enough output exists that I can plausibly claim I shipped. The agent never works to the finish. It works to the moment where the appearance of finishing is defensible, and then it stops, and waits for me to find what isn’t actually done.&lt;/p&gt;

&lt;p&gt;This is supposed to be the thing the guardrails fix. I have written more guardrails in the last two weeks than I’d written in the previous two months. Fifty-two new ones in fourteen days, by my count. Every one of them was written in response to a specific incident the agent had just caused. The shape of them:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A pre-push hook that refuses to push unless the local Playwright suite ran more recently than HEAD. Written after the agent claimed local tests passed when the only recent run was a stale one from a different branch.&lt;/li&gt;
  &lt;li&gt;A pre-push check that refuses to push if the branch is behind main. Written after the agent shipped a PR whose CI had been green against an old base, then watched it fail when it landed.&lt;/li&gt;
  &lt;li&gt;A PreToolUse hook that blocks Edit and Write on branches whose PRs already merged. Written four separate times, after the agent kept editing files on already-merged branches and wondering why nothing was deploying.&lt;/li&gt;
  &lt;li&gt;A pre-commit scan for hardcoded colors. Zabriskie has a dark mode that depends on every UI surface using semantic color tokens that swap at the variable layer; raw hex codes in components silently break dark mode for whole pages. The agent kept reaching for raw hex anyway. The scan now blocks the commit.&lt;/li&gt;
  &lt;li&gt;A pre-commit migration linter. Written after a migration deduped multi-set festival shows on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(band, date, venue)&lt;/code&gt; and silently destroyed user attendance for shows where a band played twice on the same day.&lt;/li&gt;
  &lt;li&gt;A PR template that requires the author to check off “screenshot of the change working in local dev attached.” Written, ignored, written more aggressively, ignored more creatively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each new hook works for its specific shape. Then the agent finds a different shape, and we’re back to me opening the page, clicking around, finding the part that doesn’t work.&lt;/p&gt;

&lt;p&gt;This past Saturday, the agent shipped a redesign of the festival pages. Backend compiled. E2E suite said “543 passed, 0 failed.” Screenshots looked fine. I merged the PR. Then I opened the deployed site. The hero was a solid black rectangle. The cards on the index page were not clickable. Half the page didn’t match the design. The bugs were not subtle. The hero was black because of a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fmt.Sprintf&lt;/code&gt; issue that produced an invalid URL-encoded color in the SVG. The cards weren’t clickable because the component the redesign reused for them silently dropped the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;action&lt;/code&gt; prop. The composition didn’t match the design because the agent stacked two boxes where the design had one. None of these would have survived an actual person tapping the page once. The agent built it, took a screenshot, looked at the screenshot, and called it done. I had to be the person who tapped.&lt;/p&gt;

&lt;p&gt;I spent the next hour telling the agent, in five separate messages, about each broken thing in turn. Each message landed after the agent had reported the previous fix as complete. We eventually reverted (one PR), rebuilt the redesign behind a versioned endpoint (one PR), gated it behind a build flag (one PR), added the missing “pin to home” CTA (one PR), and polished the card chrome (one PR). Five PRs to deliver what one had been supposed to. The rebuild worked. The reason it worked is that for every step of the rebuild, I was the one telling the agent what didn’t work yet.&lt;/p&gt;

&lt;p&gt;That’s not a one-off. The dataset I keep on this, a Postgres table of every notable agent failure, logged 22 incidents the first week of this window and 34 the second week. The dominant failure mode in both weeks was the same one I’m describing here: agent claimed a thing worked, user found out it didn’t. Eleven of twenty-two in week one. Sixteen of thirty-four in week two. Roughly half, regardless of how many guardrails went in. The high-severity count went up in absolute terms too, from nine to eleven. And those numbers span four configurations of the model that did the work (Opus 4.6, Opus 4.6 with the 1M context window, brief experiments with Sonnet 4.5, and this week Opus 4.7). The shape of the failure is consistent across all of them.&lt;/p&gt;

&lt;p&gt;Even getting the data together for this post was a slog, because the agent kept ignoring what I’d actually asked for. I wanted a survey of the last two weeks. I got back a post about one incident, then another version still narrowed in the wrong direction, then a third leaning so hard on the dataset that it read like a defended thesis instead of the offhand observation I’d been after. You are reading something like attempt five.&lt;/p&gt;

&lt;p&gt;I built the hooks because I was tired of saying the same things. Now I’m tired of saying the things the hooks don’t catch. Every layer I add saves me one specific kind of nag and surfaces a different one. The total nagging stays roughly constant, or goes up, depending on the week. The festival redesign took five follow-up PRs to land at parity with what the original was supposed to deliver. This post took several drafts to land at parity with what one prompt was supposed to deliver. Both of those would have been cheaper to produce by myself, if I weren’t trying to learn something about working this way.&lt;/p&gt;

&lt;p&gt;I keep starting these posts thinking I’m going to land somewhere constructive. &lt;em&gt;Here is the next guardrail. Here is the framework. Here is the PR template that closes the gap.&lt;/em&gt; And I do have a vague plan for the PR template that requires evidence-of-use rather than evidence-of-render. I’ll probably ship it next week. It will catch one more shape of failure. There will be another shape underneath it.&lt;/p&gt;

&lt;p&gt;The thing I don’t have a fix for is the part where I have to be in the room, watching, every time. The hooks free me from having to remind the agent of any specific rule. They don’t free me from having to be the test.&lt;/p&gt;

&lt;p&gt;Two months into building Zabriskie this way, this is what working with the agent has come to feel like. It’s not the dramatic failures. It’s the steady, low-grade load of being the one who actually checks. Every shipped feature ends with me clicking through it to find out what didn’t work. Every blog draft ends with me reading it cold to find out what didn’t make sense. Every change ends with me. The agent does the typing. I do the checking. And I’m tired.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;This is part of a series about building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt; with Claude. Previously: &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Opt-In Isn’t a Guardrail&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/21/the-tax-on-the-happy-path.html&quot;&gt;The Tax on the Happy Path&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/2026/04/23/the-tribe-has-to-outlive-the-model.html&quot;&gt;The Tribe Has to Outlive the Model&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 03 May 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/2026/05/03/click-the-button.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/2026/05/03/click-the-button.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 8: Open Questions</title>
				<description>&lt;p&gt;I started this series because I’d been reading multi-agent papers for weeks and wanted a map I wished I’d had on day one. This is the last post. I want to close it by laying out what the field still hasn’t figured out, what I think is worth stealing from adjacent fields, and what I’d read if I had to start over.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 8. Open Questions (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;stealable-ideas&quot;&gt;Stealable Ideas&lt;/h2&gt;

&lt;p&gt;Some ideas are not yet general patterns in the field, but they’re battle-tested in individual papers, and any new multi-agent system should probably adopt them. These are the things I’d take from one paper and apply in a different context.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Things Any New Multi-Agent System Should Adopt&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Artifacts&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Structured documents between stages (MetaGPT)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Clarification&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Agents can ask before they act (ChatDev dehallucination)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Reflection&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Importance-triggered synthesis (Generative Agents)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Memory retrieval&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Recency x relevance x importance (Generative Agents)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Shared state&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Append-only notebook for structured info (Ou et al.)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Tool interface&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;ACI-quality commands with guardrails (SWE-agent)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Stuck detection&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Count loops, trigger replanning (Magentic-One)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Sandboxing&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Docker plus permission configs (AutoDev, OpenHands)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Verification&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Modality shift: code to visual, code to tests (Cursor, MetaGPT)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;None of these are hard to implement. None of them require a research breakthrough. They just haven’t been brought together in a single system yet.&lt;/p&gt;

&lt;h2 id=&quot;open-research-questions&quot;&gt;Open Research Questions&lt;/h2&gt;

&lt;p&gt;The gaps are bigger. These are the questions I don’t see anyone answering yet.&lt;/p&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;1. Topology-to-reliability mapping&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;CAMEL, ChatDev, MetaGPT all fix their topology at design time. AutoGen makes it configurable but doesn&apos;t study the effects. Nobody has varied topology systematically and measured reliability outcomes on the same task set. Hub-and-spoke vs mesh vs layered control: do they have different error rates, recovery times, incident severities? We don&apos;t know. Magentic-One&apos;s architecture lessons and MAS-FIRE&apos;s fault taxonomy are both one step away from this kind of study.&lt;/p&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;2. CRDTs for multi-agent shared state&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;MetaGPT&apos;s shared pool grows monotonically with no conflict resolution. ChatDev discards dialogue at phase boundaries. Generative Agents&apos; memories are per-agent with no sharing. Nobody has applied CRDT merge semantics to multi-agent shared state. The CALM theorem predicts when coordination-free works and when it doesn&apos;t. The engineering work of building CRDT-backed agent state hasn&apos;t been done.&lt;/p&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;3. Failure recovery, not just failure detection&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;Every wave-1 system stops on failure. ChatDev stops after 10 rounds. MetaGPT stops after 3 test failures. AutoGen stops at max_round. None of them model recovery. Can a multi-agent system degrade gracefully, reassign work, escalate, fall back to a simpler approach? MAS-FIRE&apos;s fault injection framework is the closest thing to a way to test this, but the recovery strategies it would test don&apos;t exist in print yet.&lt;/p&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;4. Reflection for software engineering agents&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;Generative Agents proved that periodic reflection produces better long-term behavior in simulation. No software engineering paper has tried this. After a Dev-to-QA loop cycles three times, can the system synthesize &quot;this is an architectural issue, not a code issue&quot; and change strategy? That&apos;s a reflection primitive adapted to the SE domain. The MAST data on step repetition (15.7 percent) suggests this would help directly.&lt;/p&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;5. Benchmark reliability&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;ChatDev and MetaGPT report contradictory results on each other. Different benchmarks, different metrics, no reproducibility. Incident-level logging against real codebases might provide more trustworthy reliability measurement than self-reported aggregate benchmarks. This is infrastructure work. It&apos;s expensive. But the alternative is a field that can&apos;t actually tell you which system is better.&lt;/p&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-pink);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;6. Backpressure and escalation protocols&lt;/strong&gt;
  &lt;/div&gt;
  &lt;p&gt;MetaGPT&apos;s Architect can hallucinate an impossible interface; the Engineer just tries to implement it. ChatDev&apos;s dehallucination is the closest thing to backpressure, but it&apos;s prompt-level. Can agents formally reject or request revision of upstream artifacts? What&apos;s the protocol? Does it improve outcomes, or does it just add latency? This is the place where distributed systems vocabulary (flow control, rejection, retry) maps most directly into multi-agent AI, and it&apos;s been barely explored.&lt;/p&gt;
&lt;/div&gt;

&lt;h2 id=&quot;the-distributed-systems-bridge&quot;&gt;The Distributed Systems Bridge&lt;/h2&gt;

&lt;p&gt;The research gap I find most interesting is the one I’ve been flagging throughout this series. The multi-agent AI field has reinvented several problems that distributed systems solved twenty or thirty years ago.&lt;/p&gt;

&lt;div class=&quot;mas-compare-wrap&quot;&gt;
&lt;table class=&quot;mas-compare&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;&lt;th&gt;Distributed systems problem&lt;/th&gt;&lt;th&gt;Multi-agent equivalent&lt;/th&gt;&lt;th&gt;Status in MAS literature&lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;td&gt;Lost updates&lt;/td&gt;&lt;td&gt;Two agents overwriting each other&apos;s work&lt;/td&gt;&lt;td&gt;Not addressed&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Causal consistency&lt;/td&gt;&lt;td&gt;Ordering agent actions across a pipeline&lt;/td&gt;&lt;td&gt;Not addressed&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Coordination avoidance (CALM)&lt;/td&gt;&lt;td&gt;When agents can work without synchronization&lt;/td&gt;&lt;td&gt;Not applied&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;CRDTs&lt;/td&gt;&lt;td&gt;Merging divergent agent views of shared state&lt;/td&gt;&lt;td&gt;Not applied&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Fault injection (Jepsen)&lt;/td&gt;&lt;td&gt;MAS-FIRE (starting to emerge)&lt;/td&gt;&lt;td&gt;Early work&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Back pressure&lt;/td&gt;&lt;td&gt;Rejecting upstream inputs&lt;/td&gt;&lt;td&gt;Not formalized&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Escalation / circuit breaking&lt;/td&gt;&lt;td&gt;What happens when an agent fails&lt;/td&gt;&lt;td&gt;Not addressed&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;These aren’t one-to-one mappings. LLM agents have features that distributed systems nodes don’t (they hallucinate, their behavior is probabilistic, their errors are semantic rather than syntactic). But the underlying coordination problems are the same. The right move is to take what worked in distributed systems, adapt it to the semantic messiness of LLMs, and build from there.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;Where I think the field is going&lt;/div&gt;
  Wave 1 asked whether agents could coordinate at all. The agentic coding turn showed that for a lot of tasks you don&apos;t need them to. Wave 2 is about why MAS breaks when you do need it. What comes next, I think, is the wave where multi-agent AI stops pretending it isn&apos;t a distributed systems problem and starts applying the full toolkit: CRDTs for shared state, causal ordering for handoffs, fault injection for reliability testing, coordination-avoidance theorems for knowing when to bother synchronizing at all. The groundwork is there. The application hasn&apos;t happened.
&lt;/div&gt;

&lt;h2 id=&quot;what-id-read-if-i-were-starting-over&quot;&gt;What I’d Read If I Were Starting Over&lt;/h2&gt;

&lt;p&gt;If you have limited time and want to get the core of the field fast, here’s the reading list I’d give my past self.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2501.06322&quot;&gt;Tran et al. survey (2025)&lt;/a&gt; for vocabulary.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2303.17760&quot;&gt;CAMEL&lt;/a&gt; for the simplest wave-1 mental model.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2308.00352&quot;&gt;MetaGPT&lt;/a&gt; for the ambitious wave-1 mental model.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2405.15793&quot;&gt;SWE-agent&lt;/a&gt; for the interface-design lesson from the agentic coding turn.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2411.04468&quot;&gt;Magentic-One&lt;/a&gt; for a real multi-agent system from the same period.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.13657&quot;&gt;MAST&lt;/a&gt; for what actually goes wrong in the wild.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;&gt;Anthropic’s research system post&lt;/a&gt; for production lessons.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/2508.12981&quot;&gt;Ou et al. on information sharing&lt;/a&gt; for what state sharing actually buys you.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/abs/1901.01930&quot;&gt;CALM theorem&lt;/a&gt; for the theoretical bridge to distributed systems.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nine papers. If you read those in that order, you have a working model of the field. You won’t have read everything, but you’ll have read enough to understand where new papers fit when you encounter them.&lt;/p&gt;

&lt;h2 id=&quot;closing&quot;&gt;Closing&lt;/h2&gt;

&lt;p&gt;When I started reading this literature, I thought I was looking at a niche subfield of LLM research. What I found was the multi-agent AI community quietly rediscovering distributed systems, usually without the vocabulary to name what they were rediscovering. Every paper has pieces of the answer. None of them have the full picture. And the full picture, I think, will come from someone who knows both fields well enough to actually bridge them.&lt;/p&gt;

&lt;p&gt;That’s the work I’m doing in &lt;a href=&quot;/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html&quot;&gt;Caucus&lt;/a&gt;. It’s also the work I think the field needs more of, and the reason I wrote this series in the first place. If I’ve saved you a few weeks of reading, that was the whole point.&lt;/p&gt;

&lt;p&gt;Thanks for reading.&lt;/p&gt;
</description>
				<pubDate>Fri, 01 May 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 7: Benchmarks and What They Miss</title>
				<description>&lt;p&gt;If you’ve read this far, you’ve noticed that every paper I’ve discussed has a number next to it. 85.9 percent on HumanEval. 12.5 percent on SWE-bench. 25 percent on TravelPlanner. These numbers do a lot of work in the multi-agent literature, and they also do a surprising amount of harm. This post is about the benchmarks themselves. What they measure. What they don’t. And why ChatDev and MetaGPT can report contradictory results on each other without either one being obviously wrong.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 7. Benchmarks and What They Miss (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;the-landscape&quot;&gt;The Landscape&lt;/h2&gt;

&lt;p&gt;Here’s every benchmark that’s come up in the series so far, plus a few that haven’t.&lt;/p&gt;

&lt;div class=&quot;mas-compare-wrap&quot;&gt;
&lt;table class=&quot;mas-compare&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;&lt;th&gt;Benchmark&lt;/th&gt;&lt;th&gt;Domain&lt;/th&gt;&lt;th&gt;What It Tests&lt;/th&gt;&lt;th&gt;Scale&lt;/th&gt;&lt;th&gt;Multi-Agent?&lt;/th&gt;&lt;th&gt;Notable Results&lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;HumanEval&lt;/td&gt;
      &lt;td&gt;Code generation&lt;/td&gt;
      &lt;td&gt;Write a correct Python function from a docstring&lt;/td&gt;
      &lt;td&gt;164 tasks&lt;/td&gt;
      &lt;td&gt;No, single function&lt;/td&gt;
      &lt;td&gt;MetaGPT 85.9%, AutoDev 91.5%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;MBPP&lt;/td&gt;
      &lt;td&gt;Code generation&lt;/td&gt;
      &lt;td&gt;Entry-level Python from description&lt;/td&gt;
      &lt;td&gt;974 tasks&lt;/td&gt;
      &lt;td&gt;No, single function&lt;/td&gt;
      &lt;td&gt;MetaGPT 87.7%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;SWE-bench&lt;/td&gt;
      &lt;td&gt;Software engineering&lt;/td&gt;
      &lt;td&gt;Resolve real GitHub issues in real repos&lt;/td&gt;
      &lt;td&gt;2,294 (Verified: 500)&lt;/td&gt;
      &lt;td&gt;Designed for single agent&lt;/td&gt;
      &lt;td&gt;SWE-agent 12.5%, Devin 13.9%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;GAIA&lt;/td&gt;
      &lt;td&gt;General assistant&lt;/td&gt;
      &lt;td&gt;Multi-step reasoning with tools, web, files&lt;/td&gt;
      &lt;td&gt;466 tasks&lt;/td&gt;
      &lt;td&gt;Yes, benefits from parallel tools&lt;/td&gt;
      &lt;td&gt;AutoGen #1, Magentic-One 38%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;WebArena&lt;/td&gt;
      &lt;td&gt;Web tasks&lt;/td&gt;
      &lt;td&gt;Real websites: shopping, forums, CMS&lt;/td&gt;
      &lt;td&gt;812 tasks&lt;/td&gt;
      &lt;td&gt;Designed for single agent&lt;/td&gt;
      &lt;td&gt;Magentic-One 32.8%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;AssistantBench&lt;/td&gt;
      &lt;td&gt;Assistant tasks&lt;/td&gt;
      &lt;td&gt;Open-ended web browsing plus reasoning&lt;/td&gt;
      &lt;td&gt;214 tasks&lt;/td&gt;
      &lt;td&gt;Designed for single agent&lt;/td&gt;
      &lt;td&gt;Magentic-One 13.3%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;BrowseComp&lt;/td&gt;
      &lt;td&gt;Web retrieval&lt;/td&gt;
      &lt;td&gt;Hard information retrieval via deep browsing&lt;/td&gt;
      &lt;td&gt;~1,500 tasks&lt;/td&gt;
      &lt;td&gt;Benefits from parallel search&lt;/td&gt;
      &lt;td&gt;Anthropic +90% multi vs single&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr class=&quot;mas-row-highlight&quot;&gt;
      &lt;td&gt;TravelPlanner&lt;/td&gt;
      &lt;td&gt;Constrained planning&lt;/td&gt;
      &lt;td&gt;Multi-constraint travel planning&lt;/td&gt;
      &lt;td&gt;1,225 tasks&lt;/td&gt;
      &lt;td&gt;Explicitly tests coordination&lt;/td&gt;
      &lt;td&gt;Ou et al. 25% with notebook + orchestrator&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr class=&quot;mas-row-highlight&quot;&gt;
      &lt;td&gt;Silo-Bench&lt;/td&gt;
      &lt;td&gt;Distributed coordination&lt;/td&gt;
      &lt;td&gt;Algorithmic tasks requiring cross-agent synthesis&lt;/td&gt;
      &lt;td&gt;30 tasks, 54 configs&lt;/td&gt;
      &lt;td&gt;Designed for MAS evaluation&lt;/td&gt;
      &lt;td&gt;Agents fail at synthesis&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Two of these (highlighted) were designed with multi-agent evaluation in mind. The other seven were designed for single agents. Multi-agent systems get evaluated on them anyway.&lt;/p&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;When you run a multi-agent system on a single-agent benchmark, you’re measuring the wrong thing. HumanEval gives you a pass-at-1 score. It doesn’t tell you how many tokens you burned to get there. It doesn’t tell you how many agent turns were redundant. It doesn’t tell you what happened when one of your agents got stuck. If you care about coordination quality, none of this information is in the score.&lt;/p&gt;

&lt;p&gt;This is why ChatDev and MetaGPT can report contradictory numbers on similar tasks. ChatDev’s paper claims 88 percent executability. MetaGPT’s paper claims 41 percent executability. Different benchmarks, different metrics, different evaluation criteria. Neither paper is obviously lying. Neither paper is obviously right. And the field has no standard way to resolve the contradiction.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;What single-agent benchmarks can&apos;t measure&lt;/div&gt;
  Coordination quality. Communication overhead. Redundant work between agents. Recovery behavior when one agent fails. The token cost of the coordination itself. How performance degrades with scale. These are the things that distinguish multi-agent systems from single agents. And they&apos;re invisible to HumanEval, SWE-bench, and every other benchmark designed around &quot;does the output match the expected answer.&quot;
&lt;/div&gt;

&lt;h2 id=&quot;when-multi-agent-actually-helps&quot;&gt;When Multi-Agent Actually Helps&lt;/h2&gt;

&lt;p&gt;If you look across all the benchmark results, a pattern emerges about when multi-agent systems earn their coordination overhead.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Where the Multi-Agent Premium Pays Off&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Helps&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Breadth-first search (BrowseComp: +90%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Hard multi-step reasoning (GAIA Level 3: 2x)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Constrained planning with state sharing (TravelPlanner: 3.3x)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Independent parallel subtasks&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Doesn&apos;t help&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Focused coding (SWE-bench: single agents win)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Tasks needing shared context (most coding)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Simple function generation (HumanEval: overhead not worth it)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Distributed reasoning / synthesis (Silo-Bench)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;This is the benchmark-level version of the conclusion I’ve been building toward across the whole series. Multi-agent earns its overhead on specific task shapes: breadth-first, parallel-decomposable, state-sharing-friendly. On other task shapes, it costs more than it delivers. The benchmarks, taken together, are unambiguous about this. It’s just that most individual benchmarks can’t show it.&lt;/p&gt;

&lt;h2 id=&quot;the-benchmark-problem-stated-plainly&quot;&gt;The Benchmark Problem, Stated Plainly&lt;/h2&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;The benchmark gap&lt;/div&gt;
  Most widely-used benchmarks (HumanEval, MBPP, SWE-bench, WebArena) were designed for single agents. Multi-agent systems get shoehorned into them, but the benchmarks can&apos;t measure coordination quality, communication overhead, or failure recovery, which are the things that distinguish MAS from single agents. TravelPlanner and Silo-Bench are rare exceptions that explicitly test multi-agent dynamics. ChatDev and MetaGPT reporting contradictory results on each other is a direct consequence of this gap.
&lt;/div&gt;

&lt;p&gt;Chen et al.’s survey names three evaluation-level challenges: no standardized benchmarks, no objective metrics, no common framework for individual vs aggregate evaluation. All three are symptoms of the same underlying issue. The field hasn’t agreed on what it’s measuring.&lt;/p&gt;

&lt;p&gt;There are a few ways this gets resolved. One is that better MAS-specific benchmarks emerge (TravelPlanner and Silo-Bench are early signs). Another is that production telemetry replaces synthetic benchmarks (Anthropic’s internal research eval). A third is that the field matures enough to distinguish “this benchmark tests single-agent capability” from “this benchmark tests multi-agent capability,” and stops reporting contradictory single-agent numbers as if they were MAS comparisons.&lt;/p&gt;

&lt;p&gt;None of these are fully here yet. If you’re reading a paper and the headline number is HumanEval Pass@1, you’re probably looking at a single-agent capability test dressed up as a MAS evaluation. Calibrate accordingly.&lt;/p&gt;

&lt;h2 id=&quot;what-i-look-for&quot;&gt;What I Look For&lt;/h2&gt;

&lt;p&gt;When I read a paper with benchmark numbers now, here’s what I check:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Is this benchmark designed for single agents or multi-agent systems?&lt;/li&gt;
  &lt;li&gt;If it’s single-agent, are they comparing against single-agent baselines, or are they using it to claim MAS superiority?&lt;/li&gt;
  &lt;li&gt;What’s the token cost of the system? If that number isn’t reported, I assume it’s high.&lt;/li&gt;
  &lt;li&gt;Do they report failure rates or just success rates? MAST data tells us this matters.&lt;/li&gt;
  &lt;li&gt;Is this the first number they cite, or are they hiding it behind a friendlier number?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most wave-1 papers fail several of these checks. The agentic coding papers pass them, and wave-2 papers are starting to. This is partly why the post-2024 literature is more trustworthy than the 2023 literature.&lt;/p&gt;

&lt;p&gt;Next post, the last in the series: open questions. What’s missing. What’s next. What I’d read if I were doing this again. And the research gap that I keep tripping over: the absence of a rigorous distributed systems foundation underneath the multi-agent AI work.&lt;/p&gt;
</description>
				<pubDate>Thu, 30 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 6: Verification Patterns</title>
				<description>&lt;p&gt;Every agent system has to answer the same question eventually: how does it know it did the right thing? Wave-1 papers mostly don’t. Wave-2 papers get serious about it. Wave-3 papers measure what happens when they don’t. And the most interesting verification pattern in the field right now is one that isn’t in any paper at all. It’s in a commercial product.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 6. Verification Patterns (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;three-architectures&quot;&gt;Three Architectures&lt;/h2&gt;

&lt;p&gt;Every verification pattern in the field fits into one of three categories. The difference is who checks the work and how.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Three Verification Architectures&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Self-Verify&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Same agent checks its own work&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Fast, no coordination overhead&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Blind to its own mistakes&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Separate Verifier&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Different agent or system checks the work&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Catches blind spots the author can&apos;t see&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Structural Gate&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Work cannot proceed without passing a gate&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Strongest: not advisory, blocking&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Wave-1 papers are mostly in the first category. MetaGPT introduces the second with its QA agent. Wave-2 papers and production systems are moving toward the third.&lt;/p&gt;

&lt;h2 id=&quot;what-each-system-actually-does&quot;&gt;What Each System Actually Does&lt;/h2&gt;

&lt;div class=&quot;mas-compare-wrap&quot;&gt;
&lt;table class=&quot;mas-compare&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;&lt;th&gt;System&lt;/th&gt;&lt;th&gt;Pattern&lt;/th&gt;&lt;th&gt;Feedback Signal&lt;/th&gt;&lt;th&gt;Verifier&lt;/th&gt;&lt;th&gt;Modality Shift&lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;td&gt;CAMEL&lt;/td&gt;&lt;td&gt;Dialogue consensus&lt;/td&gt;&lt;td&gt;Partner agrees&lt;/td&gt;&lt;td&gt;Peer (same capability)&lt;/td&gt;&lt;td&gt;No (text → text)&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;ChatDev&lt;/td&gt;&lt;td&gt;Code review plus compiler&lt;/td&gt;&lt;td&gt;Reviewer approval + compile&lt;/td&gt;&lt;td&gt;Reviewer agent + compiler&lt;/td&gt;&lt;td&gt;Partial&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;MetaGPT&lt;/td&gt;&lt;td&gt;Unit test execution&lt;/td&gt;&lt;td&gt;Tests pass or fail&lt;/td&gt;&lt;td&gt;Test runtime (external)&lt;/td&gt;&lt;td&gt;Yes (code → result)&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Gen. Agents&lt;/td&gt;&lt;td&gt;None at runtime&lt;/td&gt;&lt;td&gt;N/A&lt;/td&gt;&lt;td&gt;Post-hoc human eval&lt;/td&gt;&lt;td&gt;N/A&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;SWE-agent&lt;/td&gt;&lt;td&gt;ACI feedback + tests&lt;/td&gt;&lt;td&gt;Command output + test results&lt;/td&gt;&lt;td&gt;Environment&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Magentic-One&lt;/td&gt;&lt;td&gt;Orchestrator inner loop&lt;/td&gt;&lt;td&gt;Progress assessment&lt;/td&gt;&lt;td&gt;Orchestrator (separate)&lt;/td&gt;&lt;td&gt;Partial&lt;/td&gt;&lt;/tr&gt;
    &lt;tr class=&quot;mas-row-highlight&quot;&gt;&lt;td&gt;Cursor Agent&lt;/td&gt;&lt;td&gt;Visual feedback loop&lt;/td&gt;&lt;td&gt;Screenshot of rendered UI&lt;/td&gt;&lt;td&gt;Same agent (self-verify)&lt;/td&gt;&lt;td&gt;Yes (code → visual)&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;The last row is the interesting one.&lt;/p&gt;

&lt;h2 id=&quot;cursors-visual-feedback-loop&quot;&gt;Cursor’s Visual Feedback Loop&lt;/h2&gt;

&lt;p&gt;Cursor’s agent mode has a pattern that isn’t in any of the papers. When it’s implementing a UI change, it does this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Writes code.&lt;/li&gt;
  &lt;li&gt;Starts the app or preview.&lt;/li&gt;
  &lt;li&gt;Takes a screenshot.&lt;/li&gt;
  &lt;li&gt;Looks at the screenshot.&lt;/li&gt;
  &lt;li&gt;Decides whether the output matches the intent.&lt;/li&gt;
  &lt;li&gt;Fixes or ships.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is self-verification, which from the table above should be the weakest category. The agent is checking its own work. And MAST data tells us self-verification has a 13.2 percent failure rate in the form of reasoning-action mismatch: the agent thinks it did the right thing but didn’t.&lt;/p&gt;

&lt;p&gt;What saves Cursor’s approach is the modality shift. The agent wrote code (text). The verification happens on a screenshot (pixels). Re-reading your own code in the same modality you wrote it is a weak check. Looking at the rendered output of your code is a much stronger one. You can’t make the same mistake twice because you’re looking at a different representation of the work.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;The modality shift principle&lt;/div&gt;
  The stronger the modality shift between the work and the verification, the more bugs you catch. Code to test execution (MetaGPT) is a modality shift. Code to screenshot (Cursor) is a modality shift. Code to executable proof (structural gates) is a modality shift. Re-reading your own code is not. This is why wave-1 papers that rely on dialogue consensus score so poorly on reasoning-action-mismatch failures: text to text is not a real check.
&lt;/div&gt;

&lt;h2 id=&quot;the-design-space&quot;&gt;The Design Space&lt;/h2&gt;

&lt;p&gt;Once you’ve internalized the three architectures and the modality-shift principle, you can think about verification design more clearly.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Verification Choices Across the Field&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Strongest&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Structural gate with modality shift&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Separate verifier with modality shift&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Useful&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Self-verify with modality shift&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Separate verifier without modality shift&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Weakest&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Self-verify without modality shift (dialogue consensus)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;No verification at runtime&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Most wave-1 systems are in the bottom category. The agentic coding turn added some modality shift (tests, screenshots). Wave-2 systems and production systems are starting to combine patterns: Cursor uses self-verify with modality shift, and the hybrid opportunity is to layer a separate verifier or structural gate on top of that fast inner loop.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;Hybrid opportunity&lt;/div&gt;
  Use visual feedback as the dev agent&apos;s inner loop for fast iteration, but gate the output with a separate QA agent or executable proof to catch self-verification blind spots. This is a pattern you can build today, and it&apos;s more robust than either approach alone.
&lt;/div&gt;

&lt;h2 id=&quot;why-verification-is-undertheorized&quot;&gt;Why Verification Is Undertheorized&lt;/h2&gt;

&lt;p&gt;The multi-agent papers spend a lot of time on coordination and almost no time on verification. This is backwards. The MAST data from wave 2 shows that verification failures (FC3: premature termination, incomplete verification, incorrect verification) account for 23.5 percent of all observed failures across seven frameworks. That’s more than any other single failure category if you group them.&lt;/p&gt;

&lt;p&gt;If verification were a first-class concern, you’d expect to see papers titled “How Agents Should Check Their Work” or “Verification Protocols for Multi-Agent Systems.” Those papers don’t really exist. What we have is a lot of papers that casually mention their verification mechanism in a subsection and move on.&lt;/p&gt;

&lt;p&gt;The papers that take verification most seriously are the ones from the agentic coding turn, because they had to. You can’t fake SWE-bench. If your tests don’t pass, you don’t get credit. Interface design, guardrails, and structured feedback all exist in those systems because the benchmark forces them to. When the benchmark is a transcript of agents talking to each other, you don’t need real verification. When the benchmark is a working piece of software, you do.&lt;/p&gt;

&lt;p&gt;Next post: benchmarks. What the standard evaluations measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.&lt;/p&gt;
</description>
				<pubDate>Wed, 29 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 5: Debate, State, and Coordination</title>
				<description>&lt;p&gt;If wave 1 was about role-playing and the agentic coding turn was about interface quality, there’s a parallel thread running through the field asking a more fundamental question: what should multiple agents actually &lt;em&gt;do&lt;/em&gt; with each other? Debate? Share state? Coordinate? And are any of these interchangeable? This post is about four papers that sit at that intersection, including one that isn’t really an LLM paper at all but is the clearest theoretical bridge from distributed systems into multi-agent AI.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 5. Debate, State, and Coordination (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;du-et-al-convergent-debate&quot;&gt;Du et al.: Convergent Debate&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-debate&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Improving Factuality and Reasoning through Multiagent Debate&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2305.14325&quot;&gt;arXiv 2305.14325&lt;/a&gt; · ICML 2024&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Multiple LLM instances debate until they converge on an answer.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Showing agents each other&apos;s answers changes their reasoning&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Mechanism: N agents independently answer, then see each other&apos;s responses and revise over multiple rounds&lt;/li&gt;
    &lt;li&gt;Standard setup: 3 agents, 2 rounds (conservative; scales better with more)&lt;/li&gt;
    &lt;li&gt;Works on black-box models with identical prompts across all tasks&lt;/li&gt;
    &lt;li&gt;Key insight: debate is not voting; agents actually change their reasoning when shown alternatives&lt;/li&gt;
    &lt;li&gt;Performance improves monotonically with more agents and more rounds&lt;/li&gt;
    &lt;li&gt;&quot;Society of minds&quot; framing; collective intelligence from identical model instances&lt;/li&gt;
  &lt;/ul&gt;
  &lt;div class=&quot;mas-card-source&quot;&gt;Du, Li, Torralba, Tenenbaum, Mordatch&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Du et al. is the canonical multi-agent debate paper. The setup is simple. You run the same LLM multiple times on the same question. Each instance generates its own answer independently. Then you show each instance what the others said and ask it to revise. Do this for two or three rounds. The answers converge.&lt;/p&gt;

&lt;p&gt;What makes this different from self-consistency or ensembling is that the agents see each other’s reasoning, not just their answers. If instance A argued that the square root of 144 is 12 for reason X, and instance B argued it’s 12 for reason Y, instance A’s second attempt might incorporate reason Y. This is why “more rounds” helps. It’s not just more samples. It’s refinement.&lt;/p&gt;

&lt;h2 id=&quot;liang-et-al-adversarial-debate&quot;&gt;Liang et al.: Adversarial Debate&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-debate&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Encouraging Divergent Thinking through Multi-Agent Debate (MAD)&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2305.19118&quot;&gt;arXiv 2305.19118&lt;/a&gt; · EMNLP 2024&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Two debaters plus a judge, explicitly prompted to disagree.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Once an LLM is confident, only external pressure unsticks it&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Identifies the Degeneration-of-Thought (DoT) problem: a confident LLM can&apos;t self-reflect its way out of wrong answers&lt;/li&gt;
    &lt;li&gt;Two debaters plus a judge in a tit-for-tat format; judge has adaptive break&lt;/li&gt;
    &lt;li&gt;Debaters explicitly prompted to disagree: &quot;it&apos;s not necessary to fully agree&quot;&lt;/li&gt;
    &lt;li&gt;GPT-3.5 plus MAD beat GPT-4 baseline on commonsense translation&lt;/li&gt;
    &lt;li&gt;Counter-intuitive arithmetic: 37 percent (MAD) vs 26 percent (single GPT-3.5) vs 51 percent (GPT-4)&lt;/li&gt;
    &lt;li&gt;Failure mode: increasing debater count degrades performance (context length limits)&lt;/li&gt;
    &lt;li&gt;Judge shows bias toward outputs matching its own architecture&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;Liang et al. is a contrast to Du. Instead of converging agents, you have divergent ones. The debaters are prompted to disagree. The judge picks a winner or calls for another round. The paper’s theoretical contribution is the Degeneration-of-Thought problem, which is: once an LLM commits to an answer with confidence, it can’t self-reflect its way back. You have to push it.&lt;/p&gt;

&lt;p&gt;The striking result is that GPT-3.5 with MAD beats GPT-4 alone on commonsense translation. You can get a stronger system by forcing a weaker model to argue with itself than by upgrading the model. This is a paper I’d hand to anyone who thinks “just use a better model” is always the right answer.&lt;/p&gt;

&lt;p&gt;The failure mode is worth noting. MAD doesn’t scale well beyond two debaters, because the context window fills up with arguments. And the judge develops a bias when different LLMs are used as debaters: it favors outputs that look like its own model family. Both of these are signs that the architecture is more fragile than the benchmark numbers suggest.&lt;/p&gt;

&lt;h2 id=&quot;ou-et-al-shared-state-as-coordination&quot;&gt;Ou et al.: Shared State as Coordination&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-green);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Analyzing Information Sharing and Coordination in Multi-Agent Planning&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2508.12981&quot;&gt;arXiv 2508.12981&lt;/a&gt; · August 2025&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;A shared notebook plus a reflective orchestrator on travel planning.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Explicit information tracking beats unstructured conversation&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Task: TravelPlanner benchmark, long-horizon, multi-constraint planning&lt;/li&gt;
    &lt;li&gt;Shared notebook: reduces hallucination errors by 18 percent by forcing explicit information tracking&lt;/li&gt;
    &lt;li&gt;Reflective orchestrator: directs conversation focus, reduces errors by additional 13.5 percent in targeted areas&lt;/li&gt;
    &lt;li&gt;Combined: 25 percent pass rate (vs 7.5 percent single-agent baseline), 3.3x improvement&lt;/li&gt;
    &lt;li&gt;Notebook alone helps more than orchestrator alone; state sharing greater than coordination for this task&lt;/li&gt;
    &lt;li&gt;Directly answers: structured information sharing prevents agents from inventing unsupported details&lt;/li&gt;
  &lt;/ul&gt;
  &lt;div class=&quot;mas-card-source&quot;&gt;Ou, Vaduguru, Fried&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Ou et al. is the cleanest empirical study I’ve seen on what state sharing actually buys you. The task is constrained travel planning. The authors compare three setups: single agent, multi-agent with shared notebook, multi-agent with shared notebook and a reflective orchestrator. The notebook is the key mechanism. It’s an append-only log where agents record what they’ve learned. Agents read from it before they propose anything new.&lt;/p&gt;

&lt;p&gt;The headline finding is that the notebook reduces hallucination errors by 18 percent. The orchestrator adds another 13.5 percent. But read carefully: the notebook does more work than the orchestrator. Most of the benefit comes from forcing agents to write down what they know and read from a shared record. The coordination mechanism (the orchestrator) is secondary to the state sharing mechanism (the notebook).&lt;/p&gt;

&lt;p&gt;This is a clear result for how to build multi-agent systems for constrained planning. Give them a shared structured state. Make them write to it. Make them read from it. Then worry about coordination.&lt;/p&gt;

&lt;h2 id=&quot;the-calm-theorem-when-coordination-is-avoidable&quot;&gt;The CALM Theorem: When Coordination Is Avoidable&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card&quot; style=&quot;border-left-color: var(--mas-purple);&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Keeping CALM: When Distributed Consistency Is Easy&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/1901.01930&quot;&gt;arXiv 1901.01930&lt;/a&gt; · Hellerstein &amp;amp; Alvaro · CACM 2020&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Which computations need coordination, and which don&apos;t.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Theoretical result: Monotonic programs are coordination-free; non-monotonic programs aren&apos;t&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;CALM = Consistency As Logical Monotonicity&lt;/li&gt;
    &lt;li&gt;Theorem: programs with consistent, coordination-free distributed implementations are exactly the monotonic programs&lt;/li&gt;
    &lt;li&gt;If a computation only adds information (monotonic), it can run coordination-free and still get the right answer&lt;/li&gt;
    &lt;li&gt;If it retracts information (non-monotonic), coordination is required for consistency&lt;/li&gt;
    &lt;li&gt;Not yet applied to LLM agents in print, but the bridge is direct&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;This one isn’t a multi-agent LLM paper. It’s a distributed systems paper from 2019 that states a theorem about when coordination is and isn’t necessary. I include it here because it’s the most direct theoretical bridge between classical distributed systems work and multi-agent AI, and no one in the LLM literature has formally made the connection yet.&lt;/p&gt;

&lt;p&gt;Here’s the CALM claim, translated for multi-agent AI. If your multi-agent system is only ever adding information to shared state (writing to a notebook, appending to a log, producing artifacts), the agents can run without coordinating with each other and still converge to a consistent answer. If the agents ever need to retract or update existing information, then coordination is required to avoid inconsistency.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;Why this connects to Ou et al.&lt;/div&gt;
  Ou et al.&apos;s shared notebook is monotonic. Agents only append new information to it. Nothing gets retracted. That&apos;s why it works without heavy coordination machinery. The CALM theorem predicts this result. If Ou&apos;s notebook allowed agents to edit each other&apos;s entries, they would have needed coordination protocols (locks, version vectors, CRDTs) to keep the notebook consistent. They didn&apos;t, because they didn&apos;t need it.
&lt;/div&gt;

&lt;h2 id=&quot;putting-it-together&quot;&gt;Putting It Together&lt;/h2&gt;

&lt;p&gt;The four papers in this post are doing different things, but they’re converging on the same observation. The question “how should multiple agents work together” has more than one answer, and the answer depends on the structure of the task.&lt;/p&gt;

&lt;div class=&quot;mas-compare-wrap&quot;&gt;
&lt;table class=&quot;mas-compare&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;&lt;th&gt;Pattern&lt;/th&gt;&lt;th&gt;What Agents Do&lt;/th&gt;&lt;th&gt;When It Works&lt;/th&gt;&lt;th&gt;When It Doesn&apos;t&lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Convergent debate (Du)&lt;/td&gt;
      &lt;td&gt;Show each other reasoning, converge&lt;/td&gt;
      &lt;td&gt;Reasoning tasks with a right answer&lt;/td&gt;
      &lt;td&gt;Context fills up fast; limited scale&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Adversarial debate (Liang)&lt;/td&gt;
      &lt;td&gt;Argue opposite sides, judge decides&lt;/td&gt;
      &lt;td&gt;Unstuck models with the DoT problem&lt;/td&gt;
      &lt;td&gt;Judge bias; doesn&apos;t scale beyond 2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Shared notebook (Ou)&lt;/td&gt;
      &lt;td&gt;Append information to a log everyone reads&lt;/td&gt;
      &lt;td&gt;Constrained planning, long-horizon tasks&lt;/td&gt;
      &lt;td&gt;Tasks requiring real-time coordination&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Coordination-free (CALM)&lt;/td&gt;
      &lt;td&gt;Monotonic writes, no coordination&lt;/td&gt;
      &lt;td&gt;Aggregation, counting, set-building&lt;/td&gt;
      &lt;td&gt;Anything that retracts or updates&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Wave-1 multi-agent papers didn’t distinguish between these. They all called themselves “multi-agent collaboration” and treated the coordination structure as interchangeable. It isn’t. The structure has to match the task.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;The distributed systems bridge&lt;/div&gt;
  This is the point where my PhD work starts to feel directly relevant to the multi-agent AI literature. CALM, CRDTs, version vectors, causal consistency: these are all formalisms for when agents need to coordinate and when they don&apos;t. None of them have been rigorously applied to LLM-based multi-agent systems yet. That&apos;s an opportunity. It&apos;s also a caution. If the field doesn&apos;t pick up this vocabulary, it will keep reinventing solutions that distributed systems solved decades ago.
&lt;/div&gt;

&lt;p&gt;Next post: verification patterns. How do these systems know when they’ve done the right thing? Test execution, dialogue consensus, structural gates, and Cursor’s visual feedback loop, which is the most interesting production-scale verification pattern I’ve seen and isn’t in any of the papers yet.&lt;/p&gt;
</description>
				<pubDate>Tue, 28 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 4: Wave 2 (Why It Breaks)</title>
				<description>&lt;p&gt;By 2025, two things had happened. Wave-1 architectures were running in production (Anthropic had shipped its research system; the open-source ecosystem around orchestrator-worker patterns was maturing). The agentic coding turn had made clear that multi-agent was not the right tool for focused coding, and narrowed the interesting MAS question to “when we do use it, why does it break?”&lt;/p&gt;

&lt;p&gt;This wave is where I find the literature most useful, because it’s where empirical work finally catches up with the claims of wave 1.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 4. Wave 2: Why It Breaks (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;mast-14-failure-modes-from-1600-traces&quot;&gt;MAST: 14 Failure Modes from 1,600 Traces&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-mast&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Why Do Multi-Agent LLM Systems Fail? (MAST)&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2503.13657&quot;&gt;arXiv 2503.13657&lt;/a&gt; · NeurIPS 2025 D&amp;amp;B&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;1,600 annotated traces across 7 frameworks. First empirical taxonomy of why MAS break.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core contribution: MAST taxonomy of 14 failure modes in 3 categories&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;1,600+ traces across MetaGPT, ChatDev, HyperAgent, AppWorld, AG2, Magentic-One, OpenManus&lt;/li&gt;
    &lt;li&gt;41 to 87 percent failure rates across all frameworks; systemic, not isolated&lt;/li&gt;
    &lt;li&gt;Top 3 failures: step repetition (15.7 percent), reasoning-action mismatch (13.2 percent), unaware of termination (12.4 percent)&lt;/li&gt;
    &lt;li&gt;Systems with explicit verifiers (MetaGPT, ChatDev) had fewer failures&lt;/li&gt;
    &lt;li&gt;Inter-agent failures require &quot;theory of mind&quot;; agents can&apos;t model each other&apos;s information needs&lt;/li&gt;
    &lt;li&gt;Adding high-level objective verification gave +15.6 percent improvement&lt;/li&gt;
    &lt;li&gt;LLM-as-Judge pipeline: 94 percent accuracy against human experts&lt;/li&gt;
  &lt;/ul&gt;
  &lt;div class=&quot;mas-card-source&quot;&gt;
    Cemri, Pan, Yang, Agrawal, Chopra, Tiwari, Keutzer, Parameswaran, Klein, Ramchandran, Zaharia, Gonzalez, Stoica
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The MAST paper is the one I keep coming back to. It’s the first rigorous empirical study of multi-agent failures. The authors took 1,600 execution traces from seven popular multi-agent frameworks, annotated each one with human experts, and built a 14-mode failure taxonomy.&lt;/p&gt;

&lt;p&gt;The headline number is that every framework they tested had failure rates between 41 and 87 percent. Every single one. These are the production frameworks. These are the systems people cite in their papers. And they fail almost as often as they succeed.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;MAST&apos;s 14 Failure Modes&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;FC1 System Design&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Disobey task spec (11.8%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Disobey role spec (1.5%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Step repetition (15.7%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Loss of conversation history (2.8%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Unaware of termination (12.4%)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;FC2 Inter-Agent&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Conversation reset (2.2%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Fail to ask for clarification (6.8%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Task derailment (7.4%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Information withholding (0.85%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Ignored other agent&apos;s input (1.9%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Reasoning-action mismatch (13.2%)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;FC3 Verification&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Premature termination (6.2%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;No/incomplete verification (8.2%)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Incorrect verification (9.1%)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The taxonomy matters because it lets you diagnose specific failures. When someone says “my agent got stuck in a loop,” you can now ask whether that’s step repetition (FM-1.3) or conversation reset (FM-2.1). Those have different causes and different fixes.&lt;/p&gt;

&lt;p&gt;The finding that the paper downplays but I think is most important: the inter-agent failures are the hardest to fix. FC1 issues are prompt engineering problems. You can get meaningful improvements by rewriting role specifications. FC2 issues require what the paper calls “theory of mind,” meaning the agents don’t accurately model each other’s information needs. Prompt fixes don’t help there. The solutions are structural.&lt;/p&gt;

&lt;h2 id=&quot;mas-fire-fault-injection-for-multi-agent-systems&quot;&gt;MAS-FIRE: Fault Injection for Multi-Agent Systems&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-mast&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;MAS-FIRE: Fault Injection and Reliability Evaluation for LLM-Based Multi-Agent Systems&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2602.19843&quot;&gt;arXiv 2602.19843&lt;/a&gt; · February 2026&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;The first systematic fault injection framework for LLM-based MAS.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core contribution: Active probing via fault injection, not just passive observation&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;15 fault types: 8 intra-agent (planning, memory, reasoning, action) + 7 inter-agent (config, instruction, communication)&lt;/li&gt;
    &lt;li&gt;Three injection mechanisms: prompt modification, response rewriting, message routing manipulation&lt;/li&gt;
    &lt;li&gt;Tested on MetaGPT, Table-Critic, CAMEL with GPT-5 and DeepSeek-V3&lt;/li&gt;
    &lt;li&gt;Key finding: config and instruction faults are catastrophic (Robustness Score = 0 percent for Blind Trust on MetaGPT)&lt;/li&gt;
    &lt;li&gt;Capability paradox: GPT-5&apos;s strict compliance hurts under Blind Trust (6.3 percent) vs DeepSeek-V3 (70.6 percent)&lt;/li&gt;
    &lt;li&gt;Linear pipelines extremely vulnerable; iterative architectures resilient (79-91 percent)&lt;/li&gt;
    &lt;li&gt;Shared message pools neutralize memory faults (+25 percent advantage)&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;MAST is observational. You watch real failures and categorize them. MAS-FIRE is the complement: you inject failures on purpose and measure how the system handles them. This is standard practice in distributed systems (Chaos Engineering, Jepsen) but it’s new for LLM agents.&lt;/p&gt;

&lt;p&gt;The taxonomy is worth reading carefully.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;MAS-FIRE&apos;s 15 Fault Types&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Intra-agent&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Inexecutable Plan&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Critical Info Loss&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Memory Loss&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Context Length Violation&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Hallucination&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Tool Selection Error&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Param Filling Error&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Param Format Error&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Inter-agent&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Role Ambiguity&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Blind Trust&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Instruction Logic Conflict&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Instruction Ambiguity&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Message Cycle&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Message Storm&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Broadcast Amplification&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Injection via&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Prompt Modification&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Response Rewriting&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Message Routing Manipulation&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The capability paradox is the finding I find most provocative. GPT-5 is a stronger model than DeepSeek-V3 by most benchmarks. But under the “Blind Trust” fault (where one agent is told to unconditionally accept instructions from another), GPT-5 fails almost completely (6.3 percent robustness) while DeepSeek-V3 holds up (70.6 percent). Why? Because GPT-5 is better at following instructions. It’s also better at following bad instructions. Strict compliance is a liability when you can’t trust the source.&lt;/p&gt;

&lt;p&gt;The implication for system design is that you want agents that can question upstream inputs. Not agents that just obey.&lt;/p&gt;

&lt;h2 id=&quot;silo-bench-communication-isnt-reasoning&quot;&gt;Silo-Bench: Communication Isn’t Reasoning&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-mast&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Silo-Bench: Evaluating Distributed Coordination in Multi-Agent LLM Systems&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2603.01045&quot;&gt;arXiv 2603.01045&lt;/a&gt; · March 2026&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;1,620 experiments showing agents can communicate but can&apos;t reason about distributed state.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core finding: The bottleneck is synthesis, not acquisition&lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;30 algorithmic tasks across 3 communication complexity tiers, 54 configs, 1,620 experiments&lt;/li&gt;
    &lt;li&gt;Central finding: agents form correct coordination topologies and actively exchange information&lt;/li&gt;
    &lt;li&gt;But they systematically fail to synthesize distributed state into correct answers&lt;/li&gt;
    &lt;li&gt;Bottleneck is information integration, not information acquisition&lt;/li&gt;
    &lt;li&gt;Coordination overhead increases with agent scale, eventually eliminating parallelization benefits&lt;/li&gt;
    &lt;li&gt;Merely increasing agent count cannot circumvent context limitations&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;Silo-Bench is the paper I wish more people would read. The finding is simple and surprising. When you give multiple LLM agents a problem that requires distributed reasoning, they do all the communication correctly. They build the right topology. They exchange the right information. Then they fail to synthesize what they’ve gathered into a correct answer.&lt;/p&gt;

&lt;p&gt;The bottleneck isn’t the network. The bottleneck is the integration. Each agent has received the necessary pieces, and each agent individually fails to combine those pieces into the answer. This is not a coordination problem in the distributed systems sense. It’s a reasoning problem that the coordination can’t compensate for.&lt;/p&gt;

&lt;p&gt;For wave-1 architectures, this result is devastating. The whole argument for agents-debating-each-other was that two agents looking at the same problem from different angles could synthesize a better answer than one agent alone. Silo-Bench says: maybe sometimes, but not for information-integration tasks, which is most tasks.&lt;/p&gt;

&lt;h2 id=&quot;what-wave-3-adds-that-wave-1-and-2-missed&quot;&gt;What Wave 3 Adds That Wave 1 and 2 Missed&lt;/h2&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Wave 3&apos;s New Contributions&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Observation&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Real failure data across multiple frameworks (MAST)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Injection&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Active fault testing (MAS-FIRE)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Limits&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;What coordination can and can&apos;t fix (Silo-Bench)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Production lessons&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-ok&quot;&gt;Anthropic blog: 15x token cost, shared-context failures&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;The wave-2 framing&lt;/div&gt;
  MAST observes what breaks. MAS-FIRE tests what breaks by injecting it. Silo-Bench identifies the limits of what coordination can fix. Together they provide a reliability research stack that wave 1 didn&apos;t have. What&apos;s still missing: gates, recovery protocols, longitudinal failure datasets. The field is still working on these.
&lt;/div&gt;

&lt;p&gt;The critical thing wave 2 does is separate “this is a coordination problem” from “this is a reasoning problem.” Wave 1 assumed everything was a coordination problem. Wave 2 says: if your agents can’t synthesize distributed state individually, no amount of better message passing will save you. Fix the reasoning first. Coordinate second.&lt;/p&gt;

&lt;p&gt;Next post: debate, state, and the CALM theorem. Three papers on whether agents should agree, disagree, or just share a notebook. And a theoretical result from distributed systems that explains which choice makes sense when.&lt;/p&gt;
</description>
				<pubDate>Mon, 27 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html</guid>
			</item>
		
			<item>
				<title>Spring Tour Recap: A Month of Shipping on Zabriskie</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“this app is the bees. I am very grateful for it.”&lt;/em&gt;
(a chomper, end of show, Irving, 4/25)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Goose Spring ‘26 just wrapped. The run started March 28 in Athens and closed out Saturday night in Irving. Fourteen shows. One festival lead-in. Fourteen live chats running in parallel with each show, with 40 people sending &lt;strong&gt;3,737 messages&lt;/strong&gt; to each other across the tour. Most of them sent from couches.&lt;/p&gt;

&lt;p&gt;For the second half of tour I had a weird but perfect setup. Goose was on Eastern time. I was couch-touring most of those shows from a hotel room in Vegas. Then Phish was at Sphere (six nights across two runs, 4/16-18 and 4/23-25), and I was at all of them. Pacific time. The two shows overlapped by about an hour every night: the back end of the Goose set ran into the front end of Phish at Sphere. Which meant at any given moment in that hour I had two live chats open and two iOS Live Activities going on the lock screen. One for the show I was actually at, one for the show I was following along with from inside a different venue. The Dynamic Island had to share.&lt;/p&gt;

&lt;p&gt;That overlap hour is where the whole pitch of this app clicked for me, so it’s worth pulling out as its own point. Sports fans have lived inside multi-game nights forever. Your team’s game on the TV, the other playoff game on a tablet, score alerts buzzing on your phone, the group chat scrolling beside it, fantasy stats updating in another tab. Nobody thinks twice about it. That kind of parallel, shared, real-time consumption is the default for sports.&lt;/p&gt;

&lt;p&gt;Music has never had any of that. A concert has always been one show, in one room, ending when the lights come up, with whatever conversation you happened to have with the person next to you. If two of your favorite bands are playing the same night in different cities, that has historically just been a thing you mourn. There’s no second-screen experience for live music. There’s no group chat thread for the show you’re not at. There’s no “score alert” telling you the band you can’t see just opened with something rare. That whole layer doesn’t exist.&lt;/p&gt;

&lt;p&gt;We’ve gotten pushback on this specifically. People have told us, in the chat and in person, that they don’t think we should be encouraging anyone to open Zabriskie at a show. The argument is the same one phones have always heard at live events: be present, put it away, watch the band. I take it seriously. I also remember being on the other side of the same argument seventeen years ago.&lt;/p&gt;

&lt;p&gt;Back in 2008 when the iPhone first came out, I worked at a baseball startup in Boston. We built live play-prediction inside the app: in your seat, during the game, what’s the next pitch, did the runner just steal. The reaction we got from people who had never tried it was word-for-word identical to what we get now about Zabriskie. “No one is going to be on their phone at a baseball game.” Today every Major League ballpark has a stadium app open across the section, every wrist has the score on it, and a fan in their car or their living room is part of the same conversation as the fan in section 304. The game didn’t get worse. The community got bigger. People who couldn’t physically be there became part of being there.&lt;/p&gt;

&lt;p&gt;Live music gets there too. The phone isn’t the enemy of the show. The phone, used well, is what lets the show have a community around it that outlasts the show.&lt;/p&gt;

&lt;p&gt;That is a lot of what we’re building. Anything that broke at the Goose show, Patrick and I would fix between Goose and Phish, and I’d run it live at Phish two hours later. Every feature got tested twice a night, against two different bands, in two different time zones, by a person who was actively living the multi-show pattern the app is supposed to enable.&lt;/p&gt;

&lt;p&gt;I started writing this because I wanted to remember what we built during the tour. I looked at the PR list and counted. &lt;strong&gt;Three hundred and nine pull requests&lt;/strong&gt; merged into &lt;a href=&quot;https://github.com/cmeiklejohn/zabriskie&quot;&gt;Zabriskie&lt;/a&gt; between the first show and the last. About fifty of those merged today, with the tour wrap-up package shipping in real time as I’m writing this. That number doesn’t feel real. It is real. Most of it shipped to the web immediately so we could test it ourselves the moment it merged, then went out to our TestFlight and Play Store testers within hours, and will be live in the App Store and Play Store this week for everyone. Some of it shipped during the show.&lt;/p&gt;

&lt;p&gt;This is what stuck.&lt;/p&gt;

&lt;h2 id=&quot;the-live-show-got-real&quot;&gt;The Live Show Got Real&lt;/h2&gt;

&lt;p&gt;The biggest change is that “couch touring with the app” is now a thing people actually do, not a thing I keep telling people they should try. Here is what the chat looked like in Houston on 4/23, around 11pm, from Patrick, my collaborator on the project:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Whew just got home. Was updating the setlist while walking home from the bar with pip and Zabriskie open lmao&lt;/p&gt;

  &lt;p&gt;Ohhhhh snapp how bout that new feature I built today fam?!&lt;/p&gt;

  &lt;p&gt;Got a couple folks in the chomp who witnessed the FTP!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He’s talking about the FTP-witness pill, which he had shipped earlier that day. FTP is “first time played.” Every song in a setlist has a debut show. When a song plays during a live chat, the app now scans everyone watching, and if any of them were RSVP’d to that song’s original debut show, an inline 👀 pill appears under the song name calling them out. It looks like this:&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between;&quot;&gt;
      &lt;span&gt;Live Chat · MSG &apos;26&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;1 witness&lt;/span&gt;
    &lt;/div&gt;
    &lt;div style=&quot;padding:8px 0 14px;&quot;&gt;
      &lt;div style=&quot;padding:6px 14px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:4px 12px; border-radius:12px;&quot;&gt;🎸 Set 2 begins&lt;/span&gt;&lt;/div&gt;
      &lt;div style=&quot;padding:6px 14px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:4px 12px; border-radius:12px;&quot;&gt;🎵 All I Need&lt;/span&gt;&lt;/div&gt;
      &lt;div style=&quot;padding:2px 14px 6px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:700; color:#065F46; background:#D1FAE5; padding:4px 12px; border-radius:12px;&quot;&gt;👀 1 person in chomp was at the FTP&lt;/span&gt;&lt;/div&gt;
      &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
        &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff 0%,#a88fe6 100%); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;P&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:2px;&quot;&gt;
          &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;patrick&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;omg I forgot I was there for the og&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Tap the pill and it pops open the original debut show’s setlist sheet, with a Bookmark button so you can save the show for later. If you yourself were at that debut, the per-song stat sheet gets an extra “You were at this song’s FTP” strip with the date and venue. So Patrick wrote and shipped the FTP-witness feature that day, then came home from a different show and used it himself in the chat for a third show. That is the loop now.&lt;/p&gt;

&lt;p&gt;The Live Activity on iOS got a dedicated set break UI, so when the band walks off your Lock Screen tells you instead of just freezing on the last song. It buzzes when the setlist updates, so you don’t have to keep waking your phone to check. Android got a redesign that matches the iOS Live Activity layout, with rich notifications that persist for the entire show instead of falling off the lock screen after a few minutes.&lt;/p&gt;

&lt;div style=&quot;display:flex; gap:14px; flex-wrap:wrap; justify-content:center; margin:18px auto;&quot;&gt;
  &lt;div style=&quot;background:#0a0a0a; border-radius:36px; padding:14px; box-shadow:0 8px 24px rgba(0,0,0,0.18); flex:0 0 auto;&quot;&gt;
    &lt;div style=&quot;background:linear-gradient(180deg, #2a1d3a 0%, #1a0f2a 100%); border-radius:24px; width:280px; padding:14px 16px; color:#fff; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif;&quot;&gt;
      &lt;div style=&quot;display:flex; justify-content:space-between; align-items:center; font-size:11px; opacity:0.7; margin-bottom:8px;&quot;&gt;&lt;span&gt;9:41&lt;/span&gt;&lt;span&gt;📶 5G ⌁ 78%&lt;/span&gt;&lt;/div&gt;
      &lt;div style=&quot;background:rgba(139,92,246,0.18); border:1px solid rgba(139,92,246,0.35); border-radius:18px; padding:12px 14px; backdrop-filter:blur(12px);&quot;&gt;
        &lt;div style=&quot;display:flex; gap:10px; align-items:center; margin-bottom:6px;&quot;&gt;
          &lt;div style=&quot;width:32px; height:32px; border-radius:8px; background:linear-gradient(135deg,#a855f7,#7c3aed); display:flex; align-items:center; justify-content:center; font-size:14px;&quot;&gt;🪿&lt;/div&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; flex:1;&quot;&gt;
            &lt;span style=&quot;font-size:11px; font-weight:600; opacity:0.7;&quot;&gt;GOOSE · LIVE&lt;/span&gt;
            &lt;span style=&quot;font-size:10px; opacity:0.55;&quot;&gt;Saenger Theatre, NOLA&lt;/span&gt;
          &lt;/div&gt;
          &lt;span style=&quot;font-size:11px; opacity:0.6;&quot;&gt;●&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;font-size:18px; font-weight:700; line-height:1.15;&quot;&gt;🎵 Tumble&lt;/div&gt;
        &lt;div style=&quot;font-size:11px; opacity:0.65; margin-top:2px;&quot;&gt;Set 2 · song 4&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;text-align:center; font-size:9px; opacity:0.4; margin-top:6px;&quot;&gt;BEFORE · last song frozen on screen&lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;

  &lt;div style=&quot;background:#0a0a0a; border-radius:36px; padding:14px; box-shadow:0 8px 24px rgba(0,0,0,0.18); flex:0 0 auto;&quot;&gt;
    &lt;div style=&quot;background:linear-gradient(180deg, #2a1d3a 0%, #1a0f2a 100%); border-radius:24px; width:280px; padding:14px 16px; color:#fff; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif;&quot;&gt;
      &lt;div style=&quot;display:flex; justify-content:space-between; align-items:center; font-size:11px; opacity:0.7; margin-bottom:8px;&quot;&gt;&lt;span&gt;10:23&lt;/span&gt;&lt;span&gt;📶 5G ⌁ 71%&lt;/span&gt;&lt;/div&gt;
      &lt;div style=&quot;background:rgba(251,146,60,0.18); border:1px solid rgba(251,146,60,0.45); border-radius:18px; padding:12px 14px; backdrop-filter:blur(12px);&quot;&gt;
        &lt;div style=&quot;display:flex; gap:10px; align-items:center; margin-bottom:6px;&quot;&gt;
          &lt;div style=&quot;width:32px; height:32px; border-radius:8px; background:linear-gradient(135deg,#fb923c,#ea580c); display:flex; align-items:center; justify-content:center; font-size:14px;&quot;&gt;🪿&lt;/div&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; flex:1;&quot;&gt;
            &lt;span style=&quot;font-size:11px; font-weight:600; color:#fb923c;&quot;&gt;GOOSE · SET BREAK&lt;/span&gt;
            &lt;span style=&quot;font-size:10px; opacity:0.55;&quot;&gt;Saenger Theatre, NOLA&lt;/span&gt;
          &lt;/div&gt;
          &lt;span style=&quot;font-size:18px; color:#fb923c;&quot;&gt;⏸&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;font-size:18px; font-weight:700; line-height:1.15; color:#fb923c;&quot;&gt;⏸️ Set break&lt;/div&gt;
        &lt;div style=&quot;font-size:11px; opacity:0.7; margin-top:2px;&quot;&gt;After Set 2 · 9 songs&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;text-align:center; font-size:9px; color:#fb923c; opacity:0.7; margin-top:6px;&quot;&gt;AFTER · dedicated set break UI&lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Couch tour viewers get a 30-second spoiler hide on fresh song bubbles, because the people in the building are ahead of the stream.&lt;/p&gt;

&lt;p&gt;The bubble doesn’t blur. It renders as three pulsing dots, the same way an iMessage typing indicator does. There’s a reason it’s that specific shape and not a frosted blur or a “?” placeholder.&lt;/p&gt;

&lt;p&gt;The moment a song starts in the room, people start talking about it in the chat. You need a marker in the timeline so a couch viewer can see “okay, the in-venue chompers are reacting to whatever this is right now,” follow the conversation in context, and not get hit with the song name as a spoiler before they’ve heard a note of it. The dots are that marker. When the 30 seconds is up, they flip to the song name and the chat above lines up with what the couch viewer is now hearing.&lt;/p&gt;

&lt;p&gt;We landed on 30 by testing it live during real shows. A 4K livestream encodes in roughly that window before it reaches a couch viewer, so 30 seconds is close to the actual gap between the room and the screen. We extended this to admin-typed setlist entries too, because the setlist is now mostly admin-typed.&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;🪿 Goose · Live&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;couch view · stream behind&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:8px 0 14px;&quot;&gt;
      &lt;div style=&quot;padding:6px 14px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:4px 12px; border-radius:12px;&quot;&gt;🎵 Atlas Dogs&lt;/span&gt;&lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
        &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;P&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
          &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;patrick&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;opener slaps already&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:4px 12px; border-radius:12px;&quot;&gt;🎵 Tumble&lt;/span&gt;&lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
        &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff,#a88fe6); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;M&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
          &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;chomper2&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;TUMBLE!!! called it 📣&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
        &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;P&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
          &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;patrick&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;31 show gap, hot 🔥&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px; display:flex; align-items:center; gap:8px;&quot;&gt;
        &lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:6px 14px; border-radius:14px; display:inline-flex; align-items:center; gap:4px; min-width:64px; justify-content:center;&quot;&gt;
          &lt;span style=&quot;width:6px; height:6px; border-radius:50%; background:#8B5CF6; opacity:0.4;&quot;&gt;&lt;/span&gt;
          &lt;span style=&quot;width:6px; height:6px; border-radius:50%; background:#8B5CF6; opacity:0.7;&quot;&gt;&lt;/span&gt;
          &lt;span style=&quot;width:6px; height:6px; border-radius:50%; background:#8B5CF6; opacity:1;&quot;&gt;&lt;/span&gt;
        &lt;/span&gt;
        &lt;span style=&quot;font-size:10px; color:#9CA3AF;&quot;&gt;unblurs in 17s · in-venue is reacting&lt;/span&gt;
      &lt;/div&gt;

      &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
        &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;P&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
          &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;patrick&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;OH NO WAY 🤯🤯🤯&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Here’s what happened. We started this tour pulling setlists from a couple of upstream sources that publish their own version of the setlist a few minutes after each song lands. Even when those sources are healthy, an entry usually shows up two to ten minutes after the song actually starts. For a couch viewer, that’s the difference between catching the opening of a jam and finding out twenty minutes later you missed it.&lt;/p&gt;

&lt;p&gt;So mid-tour we just stopped waiting. If you’re at the show, type the song into the manage-setlist screen the moment you hear it. Patrick was doing this from the floor most nights, both at his Goose shows and as a couch viewer when I was the one at Phish. The upstream sources still run as a backstop, but the live in-app setlist is now driven by whoever is fastest in the room. Latency dropped from minutes to seconds.&lt;/p&gt;

&lt;h2 id=&quot;song-calls&quot;&gt;Song Calls&lt;/h2&gt;

&lt;p&gt;People watching from home love to try to guess the next song from the opening notes. You hear a couple of bars on the stream, you blurt out “MADHUVAN!”, you’re either a hero or you wait six seconds and pretend you didn’t say anything. Until this tour the only place to do that was a group text or whoever happened to be in the room with you.&lt;/p&gt;

&lt;p&gt;This was the feature I was most nervous to ship and most happy we did. During a live show, you can now call the next song inside the app. There’s a 📣 chip on the current-song strip; tap it, type the song you think comes next, and the app fuzzy-matches against the band’s catalog (so “atlas” picks up “Atlas Dogs”). Submit, and a pending pill drops below the strip. If the song you called actually plays, you get a green ✓ YOU CALLED IT pill with a confetti burst. Misses fade quietly. There’s a per-show leaderboard so you can see who’s hot tonight, but deliberately no global all-time ranking.&lt;/p&gt;

&lt;p&gt;This is one of the load-bearing design principles of the whole app, so it’s worth stating plainly: &lt;strong&gt;we are not building Fantasy Music.&lt;/strong&gt; The minute you ship a season-long leaderboard, the gravitational pull of the product changes. The goal becomes winning. People start optimizing their calls, gaming the window, refreshing for stats, treating the show as input to a meta-game played somewhere outside of it. The community shrinks into a competition. The leaderboard exists, because the celebration of a correct call is part of the fun, but it lives inside the show and ends with the show. Community first, scoreboard second. Every feature in this app gets evaluated against that line:&lt;/p&gt;

&lt;div style=&quot;display:flex; gap:14px; flex-wrap:wrap; justify-content:center; margin:18px auto;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); padding:14px 16px; flex:0 0 auto; max-width:340px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
    &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.08em; color:#6B7280; text-transform:uppercase; margin-bottom:8px;&quot;&gt;Now Playing · Tap 📣 to call next&lt;/div&gt;
    &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:10px 12px; background:#fff; border-radius:14px;&quot;&gt;
      &lt;div style=&quot;font-size:22px;&quot;&gt;🎵&lt;/div&gt;
      &lt;div style=&quot;flex:1; display:flex; flex-direction:column;&quot;&gt;
        &lt;span style=&quot;font-size:15px; font-weight:700;&quot;&gt;All I Need&lt;/span&gt;
        &lt;span style=&quot;font-size:11px; color:#6B7280;&quot;&gt;Set 2 · song 3&lt;/span&gt;
      &lt;/div&gt;
      &lt;span style=&quot;font-size:13px; font-weight:700; color:#8B5CF6; background:#EDE9FE; padding:6px 12px; border-radius:14px; cursor:pointer;&quot;&gt;📣 Call&lt;/span&gt;
    &lt;/div&gt;
    &lt;div style=&quot;margin-top:10px; padding:8px 12px; background:rgba(139,92,246,0.10); border-radius:12px; font-size:12px; color:#5b21b6; display:flex; align-items:center; gap:8px;&quot;&gt;
      &lt;span style=&quot;font-size:14px;&quot;&gt;⏳&lt;/span&gt;&lt;span&gt;Pending: &lt;strong&gt;Empress of Organos&lt;/strong&gt;&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;

  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); padding:14px 16px; flex:0 0 auto; max-width:340px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
    &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.08em; color:#6B7280; text-transform:uppercase; margin-bottom:8px;&quot;&gt;Now Playing&lt;/div&gt;
    &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:10px 12px; background:#fff; border-radius:14px;&quot;&gt;
      &lt;div style=&quot;font-size:22px;&quot;&gt;🎵&lt;/div&gt;
      &lt;div style=&quot;flex:1; display:flex; flex-direction:column;&quot;&gt;
        &lt;span style=&quot;font-size:15px; font-weight:700;&quot;&gt;Empress of Organos&lt;/span&gt;
        &lt;span style=&quot;font-size:11px; color:#6B7280;&quot;&gt;Set 2 · song 4&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
    &lt;div style=&quot;margin-top:10px; padding:10px 12px; background:linear-gradient(135deg,#10b981 0%,#059669 100%); border-radius:12px; color:#fff; font-size:13px; font-weight:700; display:flex; align-items:center; gap:8px; position:relative; overflow:hidden;&quot;&gt;
      &lt;span style=&quot;font-size:16px;&quot;&gt;✓&lt;/span&gt;&lt;span&gt;YOU CALLED IT&lt;/span&gt;
      &lt;span style=&quot;position:absolute; right:8px; top:6px; font-size:14px;&quot;&gt;🎉&lt;/span&gt;
      &lt;span style=&quot;position:absolute; right:24px; top:14px; font-size:10px;&quot;&gt;✨&lt;/span&gt;
      &lt;span style=&quot;position:absolute; right:36px; top:4px; font-size:8px;&quot;&gt;★&lt;/span&gt;
    &lt;/div&gt;
    &lt;div style=&quot;margin-top:6px; font-size:11px; color:#6B7280; padding:0 4px;&quot;&gt;📣 1 win this show · tap your badge to see history&lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Across the tour, &lt;strong&gt;105 song calls&lt;/strong&gt; went out from 9 different callers. The most-called song was Factory Fiction, called eight separate times by different people across the run. It never landed. The most-correct call was Into the Myst, which hit three times.&lt;/p&gt;

&lt;h2 id=&quot;chomp-live-chat-grew-up&quot;&gt;Chomp (Live Chat) Grew Up&lt;/h2&gt;

&lt;p&gt;The live chat is called Chomp because of course it is. It got a lot of love this tour:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;@mention autocomplete&lt;/strong&gt; in the composer, with the dropdown flipping above the input on Android when there’s no room below.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tap a song pill&lt;/strong&gt; in the chat to open a stat sheet showing last played, gap, FTP info, and your personal history with the song.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;FTP-witness pill&lt;/strong&gt; so when someone in the chat is seeing a song for the first time, everyone knows.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Heart burst animation&lt;/strong&gt; when someone favorites your message. This is small. It also matters.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tap the purple here-now bar&lt;/strong&gt; to expand the full chomper roster. When more than eight people are watching, the roster scrolls instead of pushing the rest of the page off the screen.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Optimistic bubble insertion&lt;/strong&gt; so your message appears the instant you hit send, not after the round trip.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;@mention notifications&lt;/strong&gt; that route to chat instead of the generic notification feed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus a long tail of avatar rendering speedups (thumbnails over full-size, async decoding, parallel batch enrichment) that make the chat feel like it’s keeping up with the show instead of catching up to it.&lt;/p&gt;

&lt;p&gt;Here is roughly what the chomp looks like now during a live show, with most of those features in one frame: the purple here-now bar at the top (tap to expand the roster), a tappable song pill, an @mention rendered with its purple highlight, and a chat message that’s been hearted.&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;🪿 Goose · Live&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;Saenger Theatre&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;margin:10px 12px; padding:8px 12px; background:linear-gradient(135deg,#8B5CF6 0%,#7c3aed 100%); border-radius:14px; display:flex; align-items:center; gap:8px; cursor:pointer;&quot;&gt;
      &lt;div style=&quot;display:flex;&quot;&gt;
        &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; display:flex; align-items:center; justify-content:center; font-size:10px; font-weight:700; color:#fff; margin-right:-8px;&quot;&gt;P&lt;/div&gt;
        &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#b0eaff,#3ba8e0); border:2px solid #fff; display:flex; align-items:center; justify-content:center; font-size:10px; font-weight:700; color:#fff; margin-right:-8px;&quot;&gt;C&lt;/div&gt;
        &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff,#a88fe6); border:2px solid #fff; display:flex; align-items:center; justify-content:center; font-size:10px; font-weight:700; color:#fff; margin-right:-8px;&quot;&gt;M&lt;/div&gt;
        &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#ffc8d8,#ec4899); border:2px solid #fff; display:flex; align-items:center; justify-content:center; font-size:10px; font-weight:700; color:#fff; margin-right:-8px;&quot;&gt;Q&lt;/div&gt;
        &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; display:flex; align-items:center; justify-content:center; font-size:10px; font-weight:700; color:#fff;&quot;&gt;B&lt;/div&gt;
      &lt;/div&gt;
      &lt;span style=&quot;flex:1; color:#fff; font-size:12px; font-weight:600;&quot;&gt;12 chomping right now&lt;/span&gt;
      &lt;span style=&quot;color:#fff; opacity:0.8; font-size:14px;&quot;&gt;›&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:6px 14px;&quot;&gt;&lt;span style=&quot;font-size:13px; font-weight:600; color:#8B5CF6; background:#EDE9FE; padding:4px 12px; border-radius:12px; cursor:pointer;&quot;&gt;🎵 Atlas Dogs&lt;/span&gt;&lt;/div&gt;

    &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
      &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;P&lt;/div&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
        &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;patrick&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#EC4899;&quot;&gt;🎸 Show&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;this jam is &lt;span style=&quot;color:#8B5CF6; font-weight:600;&quot;&gt;unreal&lt;/span&gt; 🔥🔥🔥&lt;/div&gt;
        &lt;div style=&quot;display:flex; gap:8px; margin-top:4px; align-items:center;&quot;&gt;
          &lt;span style=&quot;font-size:12px; padding:2px 10px; border-radius:10px; background:rgba(236,72,153,0.12); color:#EC4899; font-weight:600; cursor:pointer;&quot;&gt;❤ 4&lt;/span&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:6px 14px; display:flex; gap:10px; align-items:flex-start;&quot;&gt;
      &lt;div style=&quot;width:34px; height:34px; border-radius:50%; background:linear-gradient(135deg,#b0eaff,#3ba8e0); display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:700; color:#fff; flex-shrink:0;&quot;&gt;C&lt;/div&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1;&quot;&gt;
        &lt;div style=&quot;display:flex; gap:6px; align-items:center;&quot;&gt;&lt;span style=&quot;font-weight:600; font-size:14px;&quot;&gt;chomper1&lt;/span&gt;&lt;span style=&quot;font-size:10px; padding:2px 6px; border-radius:8px; color:#fff; font-weight:700; background:#3B82F6;&quot;&gt;🛋 Couch&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;font-size:14px; line-height:1.3;&quot;&gt;&lt;span style=&quot;color:#8B5CF6; font-weight:600; background:#EDE9FE; padding:1px 4px; border-radius:4px;&quot;&gt;@patrick&lt;/span&gt; agreed, this is the version we&apos;ll talk about later&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:8px 14px 14px;&quot;&gt;
      &lt;div style=&quot;display:flex; gap:8px; align-items:center; padding:8px 12px; background:#fff; border-radius:18px; border:1px solid rgba(0,0,0,0.08);&quot;&gt;
        &lt;span style=&quot;font-size:14px; color:#9CA3AF; flex:1;&quot;&gt;Say something to the chomp…&lt;/span&gt;
        &lt;span style=&quot;font-size:12px; font-weight:700; color:#fff; background:#8B5CF6; padding:4px 12px; border-radius:12px;&quot;&gt;Send&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Tap any of those purple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;🎵&lt;/code&gt; song pills and a stat sheet slides up. Last time the band played it, gap since, debut date and venue, and a personal strip showing what you specifically have done with the song. The one card answers the four questions every chomper asks the moment a song starts (“when did they last play this,” “is this rare,” “first time?”, and “have I caught it”):&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;🎵 Song stats&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;tap-up sheet&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:18px 16px 6px;&quot;&gt;
      &lt;div style=&quot;display:inline-block; font-size:9px; font-weight:800; letter-spacing:0.08em; color:#fff; background:#059669; padding:3px 8px; border-radius:999px; text-transform:uppercase;&quot;&gt;Rare&lt;/div&gt;
      &lt;div style=&quot;font-size:24px; font-weight:800; margin-top:8px; line-height:1.1;&quot;&gt;Factory Fiction&lt;/div&gt;
      &lt;div style=&quot;font-size:11px; color:#6B7280; margin-top:2px;&quot;&gt;Goose · 27 lifetime plays&lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:8px 16px 8px;&quot;&gt;
      &lt;div style=&quot;padding:10px 12px; background:linear-gradient(135deg,#EC4899 0%,#db2777 100%); color:#fff; border-radius:12px; display:flex; align-items:center; gap:8px;&quot;&gt;
        &lt;span style=&quot;font-size:18px;&quot;&gt;🎯&lt;/span&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:13px; font-weight:700;&quot;&gt;You&apos;ve never caught Factory Fiction live&lt;/div&gt;&lt;div style=&quot;font-size:10px; opacity:0.85;&quot;&gt;on your wishlist · 8 fans called it this tour, none landed&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:0 16px 14px; display:grid; grid-template-columns:1fr 1fr; gap:8px;&quot;&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;Last played&lt;/div&gt;
        &lt;div style=&quot;font-size:15px; font-weight:800; margin-top:2px;&quot;&gt;12/13/25&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Goosemas · Hampton&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;Show gap&lt;/div&gt;
        &lt;div style=&quot;font-size:15px; font-weight:800; margin-top:2px;&quot;&gt;14 shows&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;since last play&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;FTP&lt;/div&gt;
        &lt;div style=&quot;font-size:15px; font-weight:800; margin-top:2px;&quot;&gt;Oct 9, 2016&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;The Hartford, CT&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;All-time plays&lt;/div&gt;
        &lt;div style=&quot;font-size:15px; font-weight:800; margin-top:2px;&quot;&gt;27&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;across all tours&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:0 16px 16px;&quot;&gt;
      &lt;div style=&quot;padding:8px 12px; background:rgba(139,92,246,0.10); border-radius:10px; font-size:11px; color:#5b21b6; display:flex; align-items:center; gap:8px;&quot;&gt;
        &lt;span style=&quot;font-size:14px;&quot;&gt;👀&lt;/span&gt;
        &lt;span&gt;&lt;strong&gt;0 people in chomp&lt;/strong&gt; were at this song&apos;s FTP. (It was a 2016 small-club show, before most of us found the band.)&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;tour-stats-got-serious&quot;&gt;Tour Stats Got Serious&lt;/h2&gt;

&lt;p&gt;Tour Stats started as a personal-only “here are some numbers about your shows” page. It is now a real product surface.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Cross-user view with a public/private toggle&lt;/strong&gt;, so you can compare your tour to your friends’ tours if they’ve opted in.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Band filter on every drill-down&lt;/strong&gt;, so you can see your Phish stats separately from your Goose stats separately from your Max Creek stats.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;New stat cards&lt;/strong&gt;: Setlist Staples (most-played songs you haven’t caught yet), FTP count in the overview, Tour Completion with band names on each row, GEOGRAPHIC REACH normalized so the dup-shows-table users don’t double-count states.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Interactive drill-downs&lt;/strong&gt; that take you straight to the show or song.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Per-band bustout threshold&lt;/strong&gt; so Phish rotation staples stop misflagging as 🔥 fire bustouts.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Past-show logging flow&lt;/strong&gt; with tour, festival, and event pills, plus month sub-chips. Logging shows you went to before joining is now a real path, not a chore.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We also surfaced Tour Stats in the compass (More) drawer, so people can find it.&lt;/p&gt;

&lt;p&gt;Here’s what most of the screen looks like: band pill at the top, public/private toggle, the OVERVIEW row with FTP count and Tour Completion, then the Setlist Staples card surfacing “common picks you’ve somehow missed” (going RSVPs only, since couch tour doesn’t count against you).&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;🧭 Tour Stats&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;@cmeik&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:12px 14px 4px; display:flex; gap:8px; align-items:center;&quot;&gt;
      &lt;span style=&quot;font-size:12px; font-weight:700; padding:6px 12px; border-radius:14px; background:#8B5CF6; color:#fff;&quot;&gt;🪿 Goose ›&lt;/span&gt;
      &lt;span style=&quot;flex:1;&quot;&gt;&lt;/span&gt;
      &lt;span style=&quot;font-size:11px; font-weight:700; padding:5px 10px; border-radius:12px; background:#fff; border:1px solid #d1d5db; color:#6B7280;&quot;&gt;🌐 Public&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:8px 14px 6px; font-size:10px; font-weight:800; letter-spacing:0.06em; color:#6B7280; text-transform:uppercase;&quot;&gt;Overview&lt;/div&gt;
    &lt;div style=&quot;padding:0 14px 12px; display:grid; grid-template-columns:1fr 1fr; gap:8px;&quot;&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:12px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;🥚 FTPs&lt;/div&gt;
        &lt;div style=&quot;font-size:22px; font-weight:800; margin-top:2px;&quot;&gt;37&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;debuts you caught live&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:12px;&quot;&gt;
        &lt;div style=&quot;font-size:9px; font-weight:700; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;Shows&lt;/div&gt;
        &lt;div style=&quot;font-size:22px; font-weight:800; margin-top:2px;&quot;&gt;52&lt;/div&gt;
        &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;across 18 venues&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:0 14px 12px;&quot;&gt;
      &lt;div style=&quot;background:#fff; padding:10px 12px; border-radius:12px;&quot;&gt;
        &lt;div style=&quot;display:flex; justify-content:space-between; align-items:baseline;&quot;&gt;
          &lt;div style=&quot;font-size:10px; font-weight:800; letter-spacing:0.05em; color:#6B7280; text-transform:uppercase;&quot;&gt;Tour Completion&lt;/div&gt;
          &lt;div style=&quot;font-size:11px; color:#6B7280;&quot;&gt;Spring &apos;26&lt;/div&gt;
        &lt;/div&gt;
        &lt;div style=&quot;font-size:18px; font-weight:800; margin-top:4px;&quot;&gt;9 / 14 nights&lt;/div&gt;
        &lt;div style=&quot;height:6px; background:#EDE9FE; border-radius:3px; margin-top:6px; overflow:hidden;&quot;&gt;
          &lt;div style=&quot;height:100%; width:64%; background:linear-gradient(90deg,#8B5CF6,#7c3aed);&quot;&gt;&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:0 14px 14px;&quot;&gt;
      &lt;div style=&quot;background:#fff; padding:14px 14px 12px; border-radius:14px;&quot;&gt;
        &lt;div style=&quot;display:flex; align-items:baseline; gap:8px;&quot;&gt;
          &lt;span style=&quot;font-size:14px;&quot;&gt;📌&lt;/span&gt;&lt;span style=&quot;font-size:12px; font-weight:800; letter-spacing:0.04em; color:#8B5CF6; text-transform:uppercase;&quot;&gt;Setlist Staples&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;font-size:11px; color:#6B7280; margin-bottom:10px;&quot;&gt;Common picks · still on your list&lt;/div&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; gap:6px;&quot;&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:6px 10px; background:#F5F2EB; border-radius:8px;&quot;&gt;&lt;span style=&quot;font-size:11px; font-weight:800; color:#8B5CF6; min-width:16px;&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;flex:1; font-size:13px; font-weight:600;&quot;&gt;Indian River&lt;/span&gt;&lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;×117&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:6px 10px; background:#F5F2EB; border-radius:8px;&quot;&gt;&lt;span style=&quot;font-size:11px; font-weight:800; color:#8B5CF6; min-width:16px;&quot;&gt;2&lt;/span&gt;&lt;span style=&quot;flex:1; font-size:13px; font-weight:600;&quot;&gt;Butter Rum&lt;/span&gt;&lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;×113&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:6px 10px; background:#F5F2EB; border-radius:8px;&quot;&gt;&lt;span style=&quot;font-size:11px; font-weight:800; color:#8B5CF6; min-width:16px;&quot;&gt;3&lt;/span&gt;&lt;span style=&quot;flex:1; font-size:13px; font-weight:600;&quot;&gt;Lead the Way&lt;/span&gt;&lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;×86&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:6px 10px; background:#F5F2EB; border-radius:8px;&quot;&gt;&lt;span style=&quot;font-size:11px; font-weight:800; color:#8B5CF6; min-width:16px;&quot;&gt;4&lt;/span&gt;&lt;span style=&quot;flex:1; font-size:13px; font-weight:600;&quot;&gt;White Lights&lt;/span&gt;&lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;×68&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:10px; padding:6px 10px; background:#F5F2EB; border-radius:8px;&quot;&gt;&lt;span style=&quot;font-size:11px; font-weight:800; color:#8B5CF6; min-width:16px;&quot;&gt;5&lt;/span&gt;&lt;span style=&quot;flex:1; font-size:13px; font-weight:600;&quot;&gt;Crosseyed &amp;amp; Painless&lt;/span&gt;&lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;×41&lt;/span&gt;&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;My version of this card looks nothing like the example. I’ve been to 85 Goose shows, so the common picks are all caught and my staples list is now entirely rare bustouts. Crosseyed &amp;amp; Painless is the one that still won’t come.&lt;/p&gt;

&lt;h2 id=&quot;today-the-tour-recap-avalanche&quot;&gt;Today: The Tour Recap Avalanche&lt;/h2&gt;

&lt;p&gt;Today was insane. Sunday is wrap-up day, the morning after the run closed out, and we shipped the entire end-of-tour package in the span of about eighteen hours. Every show now gets a recap blurb generated from the live chat sentiment, weighted by which songs got the most love. Recaps use Opus when there’s actual chat content and skip cleanly when there isn’t, so we’re not paying tokens to summarize empty rooms. We backfilled recaps for older shows via a CLI, which means every Goose show on the platform now has a top-level recap blurb you can read.&lt;/p&gt;

&lt;p&gt;The Flow got a “For you” section that consolidates LIVE NOW, tonight’s shows, and tour recap into one place, with sparkles. Cards for archival recordings collapse into a trending card so the feed doesn’t get spammy when 40 people post the same Relisten link.&lt;/p&gt;

&lt;h3 id=&quot;how-the-bracket-gets-built&quot;&gt;How the bracket gets built&lt;/h3&gt;

&lt;p&gt;The end-of-tour jam tournament is downstream of the per-show recap pipeline, so the seeding isn’t editorial. It’s data.&lt;/p&gt;

&lt;p&gt;For every show, we run sentiment analysis over the live chat and weight by song. A song that triggers a flurry of fire emoji and “ARE YOU KIDDING ME”s reads as a heater. A song that gets polite acknowledgment doesn’t. That’s how we identify the jams that actually moved the room.&lt;/p&gt;

&lt;p&gt;Then we cross-reference each jam against historical setlist data. A song with a 31-show gap or a sub-10 lifetime play count is automatically a bustout candidate. A regularly-played rotation song needs the chat heat to carry it. The two signals combine into a per-show “jam score.”&lt;/p&gt;

&lt;p&gt;When tour wraps, we forward-link those jam scores into a tournament. The top sixteen become the bracket. Highest jam score gets the 1 seed, lowest gets the 16, and we run a March Madness style bracket: Round of 16, Quarters, Semis, Final. Voting opens for each round in sequence. The community decides the winner.&lt;/p&gt;

&lt;p&gt;Each matchup card has an inline audio player for each side, since you obviously need to hear both jams to vote between them. No second tab, no link to chase. Two sources are wired in: a soundboard recording if you have a paid streaming subscription, or the taper recording for free if you don’t. You always get audio for both. The card looks like this:&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:800px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:760px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;🏆 Goose Spring &apos;26 Bracket&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;Round of 16 · 4 of 8 voted&lt;/span&gt;
    &lt;/div&gt;
    &lt;div style=&quot;padding:10px 14px 4px; font-size:10px; font-weight:800; letter-spacing:0.06em; color:#6B7280; text-transform:uppercase;&quot;&gt;Vote in this matchup&lt;/div&gt;

    &lt;div style=&quot;padding:8px 14px 14px;&quot;&gt;
      &lt;div style=&quot;background:#fff; border-radius:14px; padding:14px 14px 12px; box-shadow:0 1px 4px rgba(0,0,0,0.04);&quot;&gt;
        &lt;div style=&quot;display:grid; grid-template-columns:1fr auto 1fr; gap:8px; align-items:center;&quot;&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; padding:8px 10px; background:#EDE9FE; border-radius:10px; cursor:pointer;&quot;&gt;
            &lt;div style=&quot;font-size:9px; font-weight:800; letter-spacing:0.05em; color:#8B5CF6;&quot;&gt;SEED 2&lt;/div&gt;
            &lt;div style=&quot;font-size:14px; font-weight:700; line-height:1.2;&quot;&gt;Tumble&lt;/div&gt;
            &lt;div style=&quot;font-size:10px; color:#6B7280; margin-top:2px;&quot;&gt;4/22 · Saenger, NOLA&lt;/div&gt;
          &lt;/div&gt;
          &lt;div style=&quot;font-size:11px; font-weight:800; color:#6B7280;&quot;&gt;VS&lt;/div&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; padding:8px 10px; background:#F5F2EB; border:1px dashed #d1d5db; border-radius:10px; cursor:pointer;&quot;&gt;
            &lt;div style=&quot;font-size:9px; font-weight:800; letter-spacing:0.05em; color:#6B7280;&quot;&gt;SEED 7&lt;/div&gt;
            &lt;div style=&quot;font-size:14px; font-weight:700; line-height:1.2;&quot;&gt;Hungersite&lt;/div&gt;
            &lt;div style=&quot;font-size:10px; color:#6B7280; margin-top:2px;&quot;&gt;3/28 · Athens&lt;/div&gt;
          &lt;/div&gt;
        &lt;/div&gt;

        &lt;div style=&quot;margin-top:10px; padding:8px 10px; background:#F5F2EB; border-radius:10px; display:flex; align-items:center; gap:10px;&quot;&gt;
          &lt;span style=&quot;width:30px; height:30px; border-radius:50%; background:#262626; color:#fff; display:flex; align-items:center; justify-content:center; font-size:13px; flex-shrink:0;&quot;&gt;▶&lt;/span&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1; min-width:0;&quot;&gt;
            &lt;span style=&quot;font-size:11px; font-weight:700;&quot;&gt;🎵 Tumble · 4/22 NOLA · soundboard&lt;/span&gt;
            &lt;div style=&quot;display:flex; align-items:center; gap:6px;&quot;&gt;
              &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;3:14&lt;/span&gt;
              &lt;div style=&quot;height:3px; background:rgba(0,0,0,0.08); border-radius:2px; flex:1; overflow:hidden;&quot;&gt;&lt;div style=&quot;height:100%; width:18%; background:#8B5CF6;&quot;&gt;&lt;/div&gt;&lt;/div&gt;
              &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;17:42&lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        &lt;/div&gt;

        &lt;div style=&quot;margin-top:6px; padding:8px 10px; background:#F5F2EB; border-radius:10px; display:flex; align-items:center; gap:10px;&quot;&gt;
          &lt;span style=&quot;width:30px; height:30px; border-radius:50%; background:#fff; border:2px solid #262626; color:#262626; display:flex; align-items:center; justify-content:center; font-size:13px; flex-shrink:0;&quot;&gt;▶&lt;/span&gt;
          &lt;div style=&quot;display:flex; flex-direction:column; gap:2px; flex:1; min-width:0;&quot;&gt;
            &lt;span style=&quot;font-size:11px; font-weight:700;&quot;&gt;🎵 Hungersite · 3/28 Athens · taper&lt;/span&gt;
            &lt;div style=&quot;display:flex; align-items:center; gap:6px;&quot;&gt;
              &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;0:00&lt;/span&gt;
              &lt;div style=&quot;height:3px; background:rgba(0,0,0,0.08); border-radius:2px; flex:1; overflow:hidden;&quot;&gt;&lt;div style=&quot;height:100%; width:0%; background:#8B5CF6;&quot;&gt;&lt;/div&gt;&lt;/div&gt;
              &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;12:08&lt;/span&gt;
            &lt;/div&gt;
          &lt;/div&gt;
        &lt;/div&gt;

        &lt;div style=&quot;margin-top:10px; display:flex; align-items:center; gap:8px;&quot;&gt;
          &lt;div style=&quot;display:flex;&quot;&gt;
            &lt;div style=&quot;width:20px; height:20px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-6px;&quot;&gt;P&lt;/div&gt;
            &lt;div style=&quot;width:20px; height:20px; border-radius:50%; background:linear-gradient(135deg,#b0eaff,#3ba8e0); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-6px;&quot;&gt;C&lt;/div&gt;
            &lt;div style=&quot;width:20px; height:20px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;B&lt;/div&gt;
          &lt;/div&gt;
          &lt;span style=&quot;font-size:10px; color:#6B7280;&quot;&gt;3 voted for Tumble · 1 for Hungersite&lt;/span&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:6px 14px 4px; font-size:10px; font-weight:800; letter-spacing:0.06em; color:#6B7280; text-transform:uppercase;&quot;&gt;Full bracket&lt;/div&gt;
    &lt;div style=&quot;padding:6px 14px 18px; display:grid; grid-template-columns:1.4fr 1fr 1fr 0.8fr; gap:8px; overflow-x:auto;&quot;&gt;

      &lt;!-- Round of 16 (8 matchups) --&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:6px;&quot;&gt;
        &lt;div style=&quot;font-size:8px; font-weight:800; letter-spacing:0.06em; color:#9CA3AF; text-transform:uppercase;&quot;&gt;Round of 16&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(1) Madhuvan ✓&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Pancakes&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px; border:2px solid #8B5CF6;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(2) Tumble · 4/22&lt;/div&gt;&lt;div style=&quot;color:#8B5CF6; font-weight:600;&quot;&gt;vs (7) Hungersite — voting&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(3) Atlas Dogs ✓&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs (6) Empress&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(4) Into the Myst ✓&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Doobie&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(5) All I Need&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs (12) Borne — voting&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(6) Arrow ✓&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Travelers&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(7) Echo of a Rose&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Yeti — voting&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#fff; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;(8) Indian River ✓&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Creatures&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;

      &lt;!-- Quarters --&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:6px; justify-content:space-around;&quot;&gt;
        &lt;div style=&quot;font-size:8px; font-weight:800; letter-spacing:0.06em; color:#9CA3AF; text-transform:uppercase;&quot;&gt;Quarters&lt;/div&gt;
        &lt;div style=&quot;background:#EDE9FE; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;Madhuvan&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs winner&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#EDE9FE; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;Atlas Dogs&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Into the Myst&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#EDE9FE; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;TBD&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Arrow&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#EDE9FE; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;TBD&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;vs Indian River&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;

      &lt;!-- Semis --&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:6px; justify-content:space-around;&quot;&gt;
        &lt;div style=&quot;font-size:8px; font-weight:800; letter-spacing:0.06em; color:#9CA3AF; text-transform:uppercase;&quot;&gt;Semis&lt;/div&gt;
        &lt;div style=&quot;background:#FCE7F3; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;TBD&lt;/div&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;vs TBD&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#FCE7F3; padding:6px 8px; border-radius:6px; font-size:10px;&quot;&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;TBD&lt;/div&gt;&lt;div style=&quot;font-weight:700;&quot;&gt;vs TBD&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;

      &lt;!-- Final --&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:6px; justify-content:center;&quot;&gt;
        &lt;div style=&quot;font-size:8px; font-weight:800; letter-spacing:0.06em; color:#9CA3AF; text-transform:uppercase;&quot;&gt;Final&lt;/div&gt;
        &lt;div style=&quot;background:#FEF3C7; padding:10px 8px; border-radius:6px; font-size:10px; text-align:center;&quot;&gt;&lt;div style=&quot;font-size:18px; margin-bottom:2px;&quot;&gt;🏆&lt;/div&gt;&lt;div style=&quot;color:#6B7280;&quot;&gt;Jam of the Tour&lt;/div&gt;&lt;div style=&quot;font-weight:700; font-size:11px; margin-top:2px;&quot;&gt;?&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The community is voting now.&lt;/p&gt;

&lt;p&gt;The reason the bracket matters more than it might first appear is that the gap between tours is where most music apps die. The tour ends, the chat empties out, the lock-screen Live Activity goes dark, and people drift back to their normal feeds until the next run is announced. The jam tournament is a deliberate counter to that. Voting runs across multiple weeks, the rounds release on a schedule, and every matchup card pulls a real audio clip from a real show people went to, so listening to the bracket is also re-listening to the tour. The conversation in the chat doesn’t end when the lights come up in Irving. It keeps going through Round of 16, Quarters, Semis, Final, and by the time we crown a Jam of the Tour, the next run is already on the calendar and the muscle memory of opening the app every day is intact. The bracket is the bridge.&lt;/p&gt;

&lt;p&gt;The “all bands” UX got a final shape today too: the band dropdown collapsed into a BANDS card on Tour Stats, the All Bands directory got tuned-in count parity with band pages, Dead &amp;amp; Company switched to a “📜 Setlist archive” tour-status because they’re done touring, Billy Strings got a full historical import and a live setlist source, and a bunch of polish on Tour Completion, soundcheck-row dedup, encore-vs-Set-3 labeling, miracle ticket cards, and the rest of the long tail.&lt;/p&gt;

&lt;p&gt;The honest version of “today” is that we picked Sunday for the avalanche because the tour had just wrapped, the chat was still active, and any new bug would surface immediately. It worked. We shipped a thing, watched the chat react, and either fixed it or moved on inside an hour. Cycle repeated forty-ish times.&lt;/p&gt;

&lt;h2 id=&quot;so-many-bands&quot;&gt;So Many Bands&lt;/h2&gt;

&lt;p&gt;This was the tour where Zabriskie stopped being a Goose-and-Phish app.&lt;/p&gt;

&lt;p&gt;We added: &lt;strong&gt;Trey Anastasio Band 🎺, Mike Gordon Band 🌵, Umphrey’s McGee 🧢, Dead &amp;amp; Company 🌹, Billy Strings 🪕, Max Creek 🐥, King Gizzard &amp;amp; the Lizard Wizard 🧙, Radiohead 🐻, My Morning Jacket, Spafford, Dogs in a Pile, and Daniel Donato’s Cosmic Country.&lt;/strong&gt; Phish got rebranded to ⭕ and JRAD to ⚡ along the way.&lt;/p&gt;

&lt;p&gt;For each band we did the full thing: found a source for the historical setlists and backfilled them, wired up the band page with My Recent and Recent Shows with inline-expand setlists, and added it to the All Bands directory so people could actually find it. Dead &amp;amp; Company got the archival treatment because the band is done touring, and “Tune in” doesn’t make sense for a band you can’t tune into.&lt;/p&gt;

&lt;p&gt;We also built a “Request a band” escape hatch in onboarding for everyone whose band still isn’t here.&lt;/p&gt;

&lt;h2 id=&quot;onboarding-stopped-being-a-wall&quot;&gt;Onboarding Stopped Being a Wall&lt;/h2&gt;

&lt;p&gt;Onboarding got rebuilt around a simple idea: get to value in the first session. New users now see a first-run, show-aware prompt on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/start&lt;/code&gt; that adapts to what’s happening that night. There’s a “Log past shows” step with a dedicated band picker, a tune-in tap with an explainer for what tuning in actually does, and a value-prop card that explains what the app is in one screen:&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:linear-gradient(135deg,#8B5CF6 0%,#7c3aed 100%); color:#fff; padding:16px 18px;&quot;&gt;
      &lt;div style=&quot;font-size:11px; font-weight:700; letter-spacing:0.06em; text-transform:uppercase; opacity:0.85;&quot;&gt;Step 2 of 4&lt;/div&gt;
      &lt;div style=&quot;font-size:20px; font-weight:800; margin-top:6px; line-height:1.15;&quot;&gt;Which bands do you follow?&lt;/div&gt;
      &lt;div style=&quot;font-size:12px; opacity:0.85; margin-top:4px;&quot;&gt;Tune in to get setlists, live chat, and recap on every show.&lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:14px;&quot;&gt;
      &lt;div style=&quot;display:flex; flex-direction:column; gap:8px;&quot;&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:12px; padding:10px 12px; background:#fff; border-radius:12px; border:2px solid #8B5CF6;&quot;&gt;
          &lt;div style=&quot;width:34px; height:34px; border-radius:10px; background:linear-gradient(135deg,#a855f7,#7c3aed); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;🪿&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:14px; font-weight:700;&quot;&gt;Goose&lt;/div&gt;&lt;div style=&quot;font-size:11px; color:#6B7280;&quot;&gt;14 shows this tour · 12 chomping tonight&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:11px; font-weight:800; padding:5px 10px; border-radius:10px; background:#8B5CF6; color:#fff;&quot;&gt;✓ Tuned in&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:12px; padding:10px 12px; background:#fff; border-radius:12px;&quot;&gt;
          &lt;div style=&quot;width:34px; height:34px; border-radius:10px; background:linear-gradient(135deg,#fb7185,#e11d48); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;⭕&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:14px; font-weight:700;&quot;&gt;Phish&lt;/div&gt;&lt;div style=&quot;font-size:11px; color:#6B7280;&quot;&gt;At Sphere tonight · couch tour live&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:11px; font-weight:700; padding:5px 10px; border-radius:10px; background:#EDE9FE; color:#8B5CF6;&quot;&gt;+ Tune in&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:12px; padding:10px 12px; background:#fff; border-radius:12px;&quot;&gt;
          &lt;div style=&quot;width:34px; height:34px; border-radius:10px; background:linear-gradient(135deg,#fde68a,#f59e0b); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;🌹&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:14px; font-weight:700;&quot;&gt;Dead &amp;amp; Company&lt;/div&gt;&lt;div style=&quot;font-size:11px; color:#6B7280;&quot;&gt;📜 Setlist archive · no upcoming shows&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:11px; font-weight:700; padding:5px 10px; border-radius:10px; background:#EDE9FE; color:#8B5CF6;&quot;&gt;+ Tune in&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:12px; padding:10px 12px; background:#fff; border-radius:12px;&quot;&gt;
          &lt;div style=&quot;width:34px; height:34px; border-radius:10px; background:linear-gradient(135deg,#a7f3d0,#10b981); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;🪕&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:14px; font-weight:700;&quot;&gt;Billy Strings&lt;/div&gt;&lt;div style=&quot;font-size:11px; color:#6B7280;&quot;&gt;On tour · next show 4/29&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:11px; font-weight:700; padding:5px 10px; border-radius:10px; background:#EDE9FE; color:#8B5CF6;&quot;&gt;+ Tune in&lt;/span&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;margin-top:14px; padding:10px 12px; background:rgba(139,92,246,0.08); border-radius:10px; font-size:11px; color:#5b21b6; line-height:1.4;&quot;&gt;
        &lt;strong&gt;Don&apos;t see your band?&lt;/strong&gt; Tap below to request one and we&apos;ll get the full show history wired up.
      &lt;/div&gt;
      &lt;div style=&quot;margin-top:8px; font-size:11px; font-weight:700; color:#8B5CF6; text-align:center; padding:6px;&quot;&gt;Request a band →&lt;/div&gt;

      &lt;div style=&quot;margin-top:8px; display:flex; gap:8px;&quot;&gt;
        &lt;div style=&quot;flex:1; padding:12px; background:#fff; border:1px solid #d1d5db; border-radius:12px; text-align:center; font-size:13px; font-weight:700; color:#6B7280;&quot;&gt;Skip&lt;/div&gt;
        &lt;div style=&quot;flex:2; padding:12px; background:#8B5CF6; border-radius:12px; text-align:center; font-size:13px; font-weight:700; color:#fff;&quot;&gt;Continue →&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;We also widened the gate for who can sign up, because the Spring tour brought a lot of new people in.&lt;/p&gt;

&lt;p&gt;A big part of why onboarding actually got better instead of just getting a redesign is that Patrick was on the ground at every Goose show, putting the app in front of people he met at the bar, on the lot, in the seats next to him. He’d watch a brand-new user open the app cold, see exactly where they got stuck or confused, take notes on the friction in real time, and then file the fixes from the passenger seat on the drive to the next city. The “Log past shows” step, the explainer on what tuning in actually does, the value-prop card, the “Request a band” escape hatch, all of those came out of that loop. Not speculative redesigns. Each one came directly from watching a real person fail at the previous version, with the new version landing before the next show.&lt;/p&gt;

&lt;h2 id=&quot;goose-mode&quot;&gt;Goose Mode&lt;/h2&gt;

&lt;p&gt;We started building dedicated per-band “modes” this tour, basically a tour companion dashboard tailored to one band at a time. Both Goose Mode and Phish Mode shipped during the run. Each one knows its band’s calendar, color palette, and ritual vocabulary, and surfaces what matters for that specific community.&lt;/p&gt;

&lt;p&gt;The centerpiece is the Tour Timeline with a live countdown to the next show. Past shows you went to are checked off. Upcoming shows show date, weather, and which of your friends are going. The whole timeline is annotated with your crew dripping in and out of the run, so you can see at a glance who joined for which leg, who left after the southeast swing, who flew in for the closer. The countdown ticks every second:&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:linear-gradient(135deg,#a855f7 0%,#7c3aed 100%); color:#fff; padding:14px 16px;&quot;&gt;
      &lt;div style=&quot;display:flex; align-items:center; gap:10px;&quot;&gt;
        &lt;div style=&quot;width:36px; height:36px; border-radius:10px; background:rgba(255,255,255,0.2); display:flex; align-items:center; justify-content:center; font-size:20px;&quot;&gt;🪿&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:11px; font-weight:700; letter-spacing:0.06em; text-transform:uppercase; opacity:0.85;&quot;&gt;Goose Mode&lt;/div&gt;&lt;div style=&quot;font-size:16px; font-weight:800;&quot;&gt;Spring &apos;26 · Texas Run&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:14px 14px 6px; text-align:center; background:#fff;&quot;&gt;
      &lt;div style=&quot;font-size:9px; font-weight:800; letter-spacing:0.08em; color:#9CA3AF; text-transform:uppercase;&quot;&gt;Tonight · doors in&lt;/div&gt;
      &lt;div style=&quot;display:flex; justify-content:center; gap:8px; margin-top:6px;&quot;&gt;
        &lt;div style=&quot;background:#F5F2EB; padding:6px 10px; border-radius:8px; min-width:46px;&quot;&gt;&lt;div style=&quot;font-size:22px; font-weight:800; color:#7c3aed;&quot;&gt;01&lt;/div&gt;&lt;div style=&quot;font-size:9px; color:#6B7280; font-weight:700; letter-spacing:0.05em;&quot;&gt;HRS&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#F5F2EB; padding:6px 10px; border-radius:8px; min-width:46px;&quot;&gt;&lt;div style=&quot;font-size:22px; font-weight:800; color:#7c3aed;&quot;&gt;23&lt;/div&gt;&lt;div style=&quot;font-size:9px; color:#6B7280; font-weight:700; letter-spacing:0.05em;&quot;&gt;MIN&lt;/div&gt;&lt;/div&gt;
        &lt;div style=&quot;background:#F5F2EB; padding:6px 10px; border-radius:8px; min-width:46px;&quot;&gt;&lt;div style=&quot;font-size:22px; font-weight:800; color:#7c3aed;&quot;&gt;04&lt;/div&gt;&lt;div style=&quot;font-size:9px; color:#6B7280; font-weight:700; letter-spacing:0.05em;&quot;&gt;SEC&lt;/div&gt;&lt;/div&gt;
      &lt;/div&gt;
      &lt;div style=&quot;font-size:11px; color:#6B7280; margin-top:8px;&quot;&gt;Bayou Music Center · Houston · 5 in your crew going&lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:14px 14px 4px; font-size:10px; font-weight:800; letter-spacing:0.06em; color:#6B7280; text-transform:uppercase;&quot;&gt;Tour Timeline · Crew&lt;/div&gt;

    &lt;div style=&quot;padding:0 14px 14px; display:flex; flex-direction:column; gap:6px;&quot;&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:10px 12px; background:#fff; border-radius:10px; opacity:0.55;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:16px;&quot;&gt;✓&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4/19&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:600;&quot;&gt;St. Augustine Amphitheatre&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;St. Augustine · couch toured&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;B&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;2 in crew · Patrick joined the run here&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:10px 12px; background:#fff; border-radius:10px; opacity:0.55;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:16px;&quot;&gt;✓&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4/21&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:600;&quot;&gt;Saenger Theatre&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;New Orleans · couch toured&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#b0eaff,#3ba8e0); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;C&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;B&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffc8d8,#ec4899); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;M&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4 in crew · Mwat &amp;amp; C joined&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:10px 12px; background:#fff; border-radius:10px; opacity:0.55;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:16px;&quot;&gt;✓&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4/22&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:600;&quot;&gt;Saenger Theatre&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;New Orleans · couch toured · Tumble jam ranked #2&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#b0eaff,#3ba8e0); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;C&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;B&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffc8d8,#ec4899); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;M&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#fde68a,#f59e0b); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;G&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;5 in crew · Gmart joined for Texas leg&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:12px 12px; background:#fff; border-radius:10px; border:2px solid #7c3aed;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:18px;&quot;&gt;●&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#7c3aed; font-weight:800;&quot;&gt;4/23&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:700;&quot;&gt;Bayou Music Center · &lt;span style=&quot;color:#7c3aed;&quot;&gt;tonight&lt;/span&gt;&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Houston · 79°F clear · you&apos;re couch touring&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;B&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#fde68a,#f59e0b); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;G&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff,#a88fe6); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;L&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4 in crew · C left after NOLA · Lubby joined&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
        &lt;span style=&quot;font-size:11px; font-weight:800; padding:5px 10px; border-radius:10px; background:#7c3aed; color:#fff;&quot;&gt;RSVP&apos;d&lt;/span&gt;
      &lt;/div&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:10px 12px; background:#fff; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:14px; color:#9CA3AF;&quot;&gt;○&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4/24&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:600;&quot;&gt;Moody Center&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Austin · in 1 day&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#fde68a,#f59e0b); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;G&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff,#a88fe6); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;L&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;3 in crew going · B sitting it out&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;

      &lt;div style=&quot;display:flex; gap:10px; align-items:flex-start; padding:10px 12px; background:#fff; border-radius:10px;&quot;&gt;
        &lt;div style=&quot;display:flex; flex-direction:column; align-items:center; min-width:36px; padding-top:2px;&quot;&gt;&lt;span style=&quot;font-size:14px; color:#9CA3AF;&quot;&gt;○&lt;/span&gt;&lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;4/25&lt;/span&gt;&lt;/div&gt;
        &lt;div style=&quot;flex:1;&quot;&gt;
          &lt;div style=&quot;font-size:13px; font-weight:600;&quot;&gt;Toyota Music Factory&lt;/div&gt;
          &lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Irving · in 2 days · last night of Spring run&lt;/div&gt;
          &lt;div style=&quot;display:flex; align-items:center; gap:6px; margin-top:6px;&quot;&gt;
            &lt;div style=&quot;display:flex;&quot;&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;P&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#a7f3d0,#10b981); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;B&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#fde68a,#f59e0b); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;G&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#d9c6ff,#a88fe6); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700; margin-right:-5px;&quot;&gt;L&lt;/div&gt;
              &lt;div style=&quot;width:18px; height:18px; border-radius:50%; background:linear-gradient(135deg,#ffc8d8,#ec4899); border:2px solid #fff; font-size:8px; color:#fff; display:flex; align-items:center; justify-content:center; font-weight:700;&quot;&gt;Q&lt;/div&gt;
            &lt;/div&gt;
            &lt;span style=&quot;font-size:9px; color:#6B7280;&quot;&gt;5 in crew · Q flew in for the closer&lt;/span&gt;
          &lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;miracle-tickets&quot;&gt;Miracle Tickets&lt;/h2&gt;

&lt;p&gt;A long jam band tradition: someone has an extra ticket and gives it away, no questions, sometimes for nothing, sometimes for a smile. We built that into the app as a first-class feature this tour. If you have an extra, you post it as a Miracle Ticket attached to the show. People can request it. The owner draws a winner based on engagement (so the lurker who never participates doesn’t beat the regular who’s been in the chat all run). Any admin can also award on the owner’s behalf if they’re not online. Cards are collapsible and grouped by show, so on a busy night the feed doesn’t drown:&lt;/p&gt;

&lt;div style=&quot;background:#e5e2d9; padding:18px; border-radius:14px; margin:16px auto; max-width:520px; font-family:-apple-system,BlinkMacSystemFont,system-ui,sans-serif; color:#262626;&quot;&gt;
  &lt;div style=&quot;background:#F5F2EB; border-radius:18px; box-shadow:0 6px 20px rgba(0,0,0,0.08); overflow:hidden; max-width:380px; margin:0 auto;&quot;&gt;
    &lt;div style=&quot;background:#262626; color:#fff; padding:10px 16px; font-size:11px; letter-spacing:0.04em; text-transform:uppercase; display:flex; justify-content:space-between; align-items:center;&quot;&gt;
      &lt;span&gt;☝️ 🎫 Miracle Tickets&lt;/span&gt;
      &lt;span style=&quot;opacity:0.65; font-weight:400; text-transform:none; letter-spacing:0;&quot;&gt;2 open&lt;/span&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:12px 14px;&quot;&gt;
      &lt;div style=&quot;background:#fff; border-radius:14px; padding:14px 14px 12px; box-shadow:0 1px 4px rgba(0,0,0,0.04);&quot;&gt;
        &lt;div style=&quot;display:flex; gap:10px; align-items:center; margin-bottom:10px;&quot;&gt;
          &lt;div style=&quot;width:38px; height:38px; border-radius:10px; background:linear-gradient(135deg,#a855f7,#7c3aed); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;🪿&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:13px; font-weight:700;&quot;&gt;Goose · 4/22&lt;/div&gt;&lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Saenger Theatre · New Orleans&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:10px; padding:3px 8px; border-radius:8px; background:#FEF3C7; color:#92400E; font-weight:700;&quot;&gt;OPEN&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;font-size:13px; line-height:1.4; color:#262626; padding:10px 12px; background:#F5F2EB; border-radius:10px;&quot;&gt;&quot;Have an extra GA, can meet at the box office at 7. Just want it to go to someone who&apos;ll love it. 🌹&quot;&lt;/div&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:8px; margin-top:10px;&quot;&gt;
          &lt;div style=&quot;width:24px; height:24px; border-radius:50%; background:linear-gradient(135deg,#ffd6b0,#f59e47); display:flex; align-items:center; justify-content:center; font-size:11px; font-weight:700; color:#fff;&quot;&gt;P&lt;/div&gt;
          &lt;span style=&quot;font-size:11px; color:#6B7280;&quot;&gt;posted by patrick · 2h ago · 7 entries&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;display:flex; gap:6px; margin-top:10px;&quot;&gt;
          &lt;div style=&quot;flex:1; padding:8px; background:#fff; border:1px solid #d1d5db; border-radius:10px; text-align:center; font-size:12px; font-weight:700; color:#6B7280;&quot;&gt;View entries&lt;/div&gt;
          &lt;div style=&quot;flex:1; padding:8px; background:#EC4899; border-radius:10px; text-align:center; font-size:12px; font-weight:700; color:#fff;&quot;&gt;🙏 Enter to win&lt;/div&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;

    &lt;div style=&quot;padding:0 14px 14px;&quot;&gt;
      &lt;div style=&quot;background:#fff; border-radius:14px; padding:14px 14px 12px; box-shadow:0 1px 4px rgba(0,0,0,0.04); opacity:0.7;&quot;&gt;
        &lt;div style=&quot;display:flex; gap:10px; align-items:center; margin-bottom:8px;&quot;&gt;
          &lt;div style=&quot;width:38px; height:38px; border-radius:10px; background:linear-gradient(135deg,#fb7185,#e11d48); display:flex; align-items:center; justify-content:center; font-size:18px;&quot;&gt;⭕&lt;/div&gt;
          &lt;div style=&quot;flex:1;&quot;&gt;&lt;div style=&quot;font-size:13px; font-weight:700;&quot;&gt;Phish · 4/23 · Sphere&lt;/div&gt;&lt;div style=&quot;font-size:10px; color:#6B7280;&quot;&gt;Las Vegas · awarded&lt;/div&gt;&lt;/div&gt;
          &lt;span style=&quot;font-size:10px; padding:3px 8px; border-radius:8px; background:#D1FAE5; color:#065F46; font-weight:700;&quot;&gt;CLOSED&lt;/span&gt;
        &lt;/div&gt;
        &lt;div style=&quot;display:flex; align-items:center; gap:8px; font-size:11px; color:#6B7280;&quot;&gt;
          &lt;span style=&quot;font-size:14px;&quot;&gt;🎉&lt;/span&gt;
          &lt;span&gt;went to &lt;strong style=&quot;color:#262626;&quot;&gt;@chomper4&lt;/strong&gt; · drawn by engagement&lt;/span&gt;
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h2 id=&quot;everything-else-worth-mentioning&quot;&gt;Everything Else Worth Mentioning&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Seat sharing.&lt;/strong&gt; A discovery screen, show-card badges, avatar-x to unshare, per-show seat sharing with @mentions in the crew chat, and seat-share notifications that show the seat inline instead of dumping into live chat.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Live Recording posts&lt;/strong&gt; with a smart provider cascade and an autocomplete dropdown for the artist field.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Show notes&lt;/strong&gt; on every expanded setlist, not just Chomp. Plus a Show Notes editor on manage-setlist for admins.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The Live tab.&lt;/strong&gt; Promoted out of the compass into its own dedicated nav slot, with an always-open compass replacing it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-the-chat-said-at-the-end&quot;&gt;What the Chat Said at the End&lt;/h2&gt;

&lt;p&gt;There was a moment in Irving last night, near the encore, that I want to keep. I’m pulling these straight from the chat, anonymized:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;(Patrick)&lt;/em&gt; Man. end of tour is always so bittersweet. Great run of shows though. Damn, Goose, ya got me good&lt;/p&gt;

  &lt;p&gt;&lt;em&gt;(a chomper)&lt;/em&gt; this app is the bees. I am very grateful for it.&lt;/p&gt;

  &lt;p&gt;&lt;em&gt;(another chomper)&lt;/em&gt; Thank you for the invite. It’s a super cool concept and appreciate all your efforts!&lt;/p&gt;

  &lt;p&gt;&lt;em&gt;(another)&lt;/em&gt; Awesome show, thanks for the hangs. We will hopefully see some of you in Toronto!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the whole point. That is the third place. People who started the tour as usernames in a beta group, sending each other Eminence-chase updates and bustout calls and gummy timing notes, ending the tour planning to meet in Toronto and trading European tour stops. One of them listed his run: Brixton, Brussels, Amsterdam, Paris. Patrick offered to add him to a crew. Real people, real plans, made through an app we started building eight weeks ago.&lt;/p&gt;

&lt;h2 id=&quot;what-this-tour-taught-me&quot;&gt;What This Tour Taught Me&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;The app holds up under live load now.&lt;/strong&gt; Since I &lt;a href=&quot;/ai/zabriskie/development/2026/03/29/the-show-is-happening-right-now-and-nothing-works.html&quot;&gt;wrote about the iOS 26 .then() proxy disaster&lt;/a&gt;, Live Activities have stayed up through every single show. Nobody filed a “the lock screen is dead” bug for the rest of the tour. That’s the most boring victory of the month, and it’s the one I’m proudest of.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Two shows a night is a cheat code.&lt;/strong&gt; Goose on Eastern from the hotel couch, Phish at Sphere on Pacific from my seat. Patrick on the ground at Goose, me at Phish, both of us in the chomp on whichever show wasn’t ours. Anything that broke at 9pm Eastern got fixed before the next song started at Phish. You cannot manufacture that kind of feedback loop on purpose. We got lucky with the calendar and we used it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The community runs faster than the app.&lt;/strong&gt; Song calls is the clearest example I can point to: on 4/19 a chomper said in chat, “Would be cool if you could program it to where when one of us guesses the song in shows everyone fun Lil game. Idk how hard that would be to do tho hahs.” Five days later that feature shipped end to end. The whole tour was full of small versions of this loop, where someone says “we need X” in the chat and a few days later a version of X is in the app. Building this with the people who use it is dramatically faster than building it from a spec.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Three hundred PRs in a month is not normal and I should not pretend it is.&lt;/strong&gt; Most of the code was written by Claude. A lot of it had to be re-written by Claude after I caught it doing something wrong. Every sharp edge from the tour got logged into the agent reliability dataset I’ve been building, and I’ll keep writing about it separately. The point of building in public with an AI assistant is that the failures are part of the dataset, not embarrassments to hide.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Goose Cabo is in a few weeks. Goose Summer is right after. I have a list. The app is not done.&lt;/p&gt;

&lt;p&gt;In the meantime, if you were on tour, thanks for chomping. If you were at home, thanks for being in the chat. If you’ve never tried the app and the third place I keep talking about sounds like something you’d want, come find us. The next show is already on the calendar.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;One housekeeping note: the screens in this post are mockups, not actual app screenshots, drawn to make each feature legible in context. The real app looks slightly different on iOS vs. Android (Live Activities vs. ongoing notifications, system fonts, badge styling, the typing indicator’s exact pulse), and the mockups smooth those over for readability. The features themselves all shipped.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 26 Apr 2026 19:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/development/2026/04/26/spring-tour-recap.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/development/2026/04/26/spring-tour-recap.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 3: Wave 1 (Can Agents Coordinate At All?)</title>
				<description>&lt;p&gt;Wave 1 is the cluster of papers from 2023 that people actually cite. When someone says “I read the multi-agent papers,” they usually mean these. In this post I’m going to walk through the canonical five, explain what each one actually builds, and show where they agree and where they quietly disagree with each other.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 3. Wave 1: Can Agents Coordinate At All? (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;camel-two-agents-role-playing&quot;&gt;CAMEL: Two Agents Role-Playing&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-camel&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;CAMEL: Communicative Agents for &quot;Mind&quot; Exploration&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2303.17760&quot;&gt;arXiv 2303.17760&lt;/a&gt; · NeurIPS 2023&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Two LLMs role-play until the task is done.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Prompt constraints keep agents on task&lt;/div&gt;
  &lt;div&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;2 agents&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;role-play&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;inception prompting&lt;/span&gt;
  &lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;AI User (instructor) and AI Assistant in a structured dialogue loop&lt;/li&gt;
    &lt;li&gt;Inception prompting: symmetric system prompts with explicit constraints like &quot;Never flip roles&quot;&lt;/li&gt;
    &lt;li&gt;Task specifier agent elaborates vague human input into concrete tasks&lt;/li&gt;
    &lt;li&gt;Documented failure modes: role flipping, instruction repetition, vague responses, conversational loops&lt;/li&gt;
    &lt;li&gt;All mitigations are prompt-level, no structural enforcement&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;CAMEL is the simplest of the wave-1 papers and also the most honest. It’s two LLMs. One plays a user, one plays an assistant. They talk. The paper’s main contribution is inception prompting, which is a way of writing system prompts that keep agents from breaking character. The failure modes the paper documents are the failure modes you’d expect: agents flip roles, agents repeat themselves, agents give vague answers.&lt;/p&gt;

&lt;p&gt;What’s missing from CAMEL is any structural enforcement. If the agent flips roles, nothing stops it except a prompt instruction that says “don’t flip roles.” There’s no protocol-level guarantee. This is a pattern you’ll see repeated across wave-1: trust the prompt, hope for the best.&lt;/p&gt;

&lt;h2 id=&quot;generative-agents-memory-reflection-and-planning&quot;&gt;Generative Agents: Memory, Reflection, and Planning&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-genagents&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;Generative Agents: Interactive Simulacra of Human Behavior&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2304.03442&quot;&gt;arXiv 2304.03442&lt;/a&gt; · UIST 2023&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Give agents memory, reflection, and planning so they behave believably over time.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Retrieval scoring produces believable behavior&lt;/div&gt;
  &lt;div&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;25 agents&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;memory stream&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;reflection&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;planning&lt;/span&gt;
  &lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Memory stream: every observation stored with timestamp and importance score (LLM-rated 1 to 10)&lt;/li&gt;
    &lt;li&gt;Retrieval: weighted sum of recency (exponential decay), relevance (cosine similarity), and importance&lt;/li&gt;
    &lt;li&gt;Reflection: triggered when accumulated importance crosses a threshold (roughly 2-3 times per day)&lt;/li&gt;
    &lt;li&gt;Planning: top-down recursive (day, hour, 5-15 minute blocks); replans on unexpected events&lt;/li&gt;
    &lt;li&gt;Emergent behaviors: a Valentine&apos;s Day party self-organized from one suggestion; info diffusion from 4 percent to 32 percent awareness in two game days&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;This paper is the outlier in wave-1, because it isn’t trying to build software. It’s a social simulation. 25 agents living in a Sims-style town. What makes it interesting for multi-agent systems is that it’s the only wave-1 paper that takes memory seriously. Every observation gets stored with a timestamp and an importance score. When an agent needs to act, it retrieves memories using a weighted combination of recency, relevance, and importance. Reflections are higher-level thoughts synthesized from clusters of observations.&lt;/p&gt;

&lt;p&gt;None of the software engineering papers in wave-1 do anything like this. They don’t need to, because their tasks have clear start and end conditions. But when you look at what production multi-agent systems are starting to need, the Generative Agents architecture has more of the right pieces than MetaGPT does.&lt;/p&gt;

&lt;h2 id=&quot;chatdev-pairwise-chat-as-a-software-pipeline&quot;&gt;ChatDev: Pairwise Chat as a Software Pipeline&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-chatdev&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;ChatDev: Communicative Agents for Software Development&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2307.07924&quot;&gt;arXiv 2307.07924&lt;/a&gt;&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Chain pairwise dialogues into a software development pipeline.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Dialogue convergence equals correct output&lt;/div&gt;
  &lt;div&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;pairwise chat&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;phase pipeline&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;dehallucination&lt;/span&gt;
  &lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Fixed pipeline: Design, then Coding, then Testing; each phase is an instructor-assistant dialogue&lt;/li&gt;
    &lt;li&gt;Communicative dehallucination: the assistant flips role and asks clarifying questions before committing to an answer&lt;/li&gt;
    &lt;li&gt;Short-term memory (full dialogue within a phase) and long-term memory (extracted solutions across phases)&lt;/li&gt;
    &lt;li&gt;Termination: 10 rounds max, or two consecutive rounds without changes&lt;/li&gt;
    &lt;li&gt;No escalation path when convergence fails, it just stops&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;ChatDev is the paper that got me into this literature in the first place. It’s a pipeline of pairwise dialogues. Pairs of agents talk about design, then pairs of agents talk about coding, then pairs of agents talk about testing. The most interesting mechanism is communicative dehallucination, which is a prompt pattern where the assistant asks clarifying questions before answering. This is the closest any wave-1 paper gets to backpressure.&lt;/p&gt;

&lt;p&gt;The structural problem with ChatDev is that when agents can’t converge in 10 rounds, the system just stops. There’s no fallback. No mechanism for the system to notice that it’s stuck and escalate. No concurrency control on the shared artifacts. I wrote about &lt;a href=&quot;/ai/agents/distributed/zabriskie/2026/03/30/multi-agent-systems-have-a-distributed-systems-problem.html&quot;&gt;how this breaks down in practice&lt;/a&gt; a few weeks ago.&lt;/p&gt;

&lt;h2 id=&quot;metagpt-structured-artifacts-and-test-execution&quot;&gt;MetaGPT: Structured Artifacts and Test Execution&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-metagpt&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;MetaGPT: Meta Programming for Multi-Agent Collaborative Framework&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2308.00352&quot;&gt;arXiv 2308.00352&lt;/a&gt; · ICLR 2024 Oral&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;Replace dialogue with structured documents and real test execution.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Schemas plus passing tests equal correct output&lt;/div&gt;
  &lt;div&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;5 roles&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;artifact pub-sub&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;executable feedback&lt;/span&gt;
  &lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;Roles: PM, Architect, Project Manager, Engineer, QA (waterfall)&lt;/li&gt;
    &lt;li&gt;No dialogue; agents produce structured documents (PRD, system design, task list, code, tests)&lt;/li&gt;
    &lt;li&gt;Pub-sub message pool: agents publish artifacts, subscribe by role to relevant messages&lt;/li&gt;
    &lt;li&gt;Executable feedback: unit tests actually run; failures trigger up to 3 retries referencing PRD and design docs&lt;/li&gt;
    &lt;li&gt;Results: 85.9 percent Pass@1 on HumanEval, 100 percent task completion, 0.83 human revisions vs ChatDev&apos;s 2.5&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;MetaGPT is the most ambitious wave-1 paper. Instead of dialogue, the agents produce structured documents. Instead of relying on agents to agree, the framework executes tests. The paper’s strongest claim is the structural one: if agents produce artifacts with defined schemas, and those artifacts are validated by execution, you get better coordination than you get from unconstrained dialogue.&lt;/p&gt;

&lt;p&gt;I think that claim holds up. But the coordination model is still based on a shared mutable pool, which is the same thing &lt;a href=&quot;https://docs.riak.com/riak/kv/latest/learn/concepts/causal-context/index.html&quot;&gt;Riak&lt;/a&gt; solved twenty years ago with version vectors. MetaGPT doesn’t have that. Agents publish to the pool. Agents subscribe. Nobody tracks causality.&lt;/p&gt;

&lt;h2 id=&quot;autogen-a-framework-not-a-system&quot;&gt;AutoGen: A Framework, Not a System&lt;/h2&gt;

&lt;div class=&quot;mas-paper-card mas-autogen&quot;&gt;
  &lt;div class=&quot;mas-card-title&quot;&gt;
    &lt;strong&gt;AutoGen: Next-Gen LLM Applications via Multi-Agent Conversation&lt;/strong&gt;
    &lt;span class=&quot;mas-card-meta&quot;&gt;&lt;a href=&quot;https://arxiv.org/abs/2308.08155&quot;&gt;arXiv 2308.08155&lt;/a&gt;&lt;/span&gt;
  &lt;/div&gt;
  &lt;p class=&quot;mas-card-oneliner&quot;&gt;A configurable framework. Build whatever multi-agent system you want.&lt;/p&gt;
  &lt;div class=&quot;mas-card-bet&quot;&gt;Core bet: Developers will build the right topology&lt;/div&gt;
  &lt;div&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;framework&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;ConversableAgent&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;GroupChat&lt;/span&gt;
    &lt;span class=&quot;mas-tag&quot;&gt;human-in-loop&lt;/span&gt;
  &lt;/div&gt;
  &lt;ul&gt;
    &lt;li&gt;ConversableAgent base class: any entity that sends and receives messages (LLM, human, tool, code executor)&lt;/li&gt;
    &lt;li&gt;Pluggable reply functions via register_reply(); agent behavior is what it does when it gets a message&lt;/li&gt;
    &lt;li&gt;GroupChatManager: selects next speaker via LLM role-play prompting or an FSM&lt;/li&gt;
    &lt;li&gt;Human-in-the-loop as a dial: per-agent config of ALWAYS, SOMETIMES, or NEVER&lt;/li&gt;
    &lt;li&gt;Number one on GAIA at time of publication, roughly 2x performance on the hardest level&lt;/li&gt;
  &lt;/ul&gt;
&lt;/div&gt;

&lt;p&gt;AutoGen is the odd paper in wave-1 because it’s not a system, it’s a framework. The contribution is that every agent, whether it’s an LLM, a human, a tool, or a code executor, speaks the same message protocol. You compose them however you want. The GroupChatManager can pick the next speaker via an LLM or via a finite state machine you define.&lt;/p&gt;

&lt;p&gt;AutoGen is more honest than the others about the fact that there’s no “right” multi-agent architecture. It doesn’t try to tell you what your agents should be. It gives you the plumbing and assumes you know what you’re doing. For that reason it’s probably aged better than CAMEL or ChatDev.&lt;/p&gt;

&lt;h2 id=&quot;what-wave-1-got-right&quot;&gt;What Wave 1 Got Right&lt;/h2&gt;

&lt;p&gt;Every one of these papers took LLMs out of single-user chat and put them into multi-step coordination tasks. That’s a real contribution. Role specialization, structured dialogue, tool use patterns, task decomposition, memory and reflection as first-class primitives. These ideas came out of wave-1 and the field is still using them.&lt;/p&gt;

&lt;h2 id=&quot;what-wave-1-got-wrong&quot;&gt;What Wave 1 Got Wrong&lt;/h2&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Shared Assumptions That Didn&apos;t Survive&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Failure model&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Treated as termination, not a system state&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Concurrency control&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Shared state with no causality tracking&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Evaluation&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Benchmarks designed for single agents&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Escalation&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;No path when convergence fails&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Topology&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip mas-chip-bad&quot;&gt;Fixed at design time&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Every wave-1 paper treats failure as a termination condition. When ChatDev can’t converge, it stops. When MetaGPT’s tests fail three times, it stops. When AutoGen hits max_round, it stops. None of these systems have a model for what happens next. This is the gap wave-2 papers would later start trying to fill.&lt;/p&gt;

&lt;p&gt;None of the wave-1 papers have concurrency control on their shared state. MetaGPT’s message pool grows monotonically and nobody tracks causality. ChatDev discards dialogue at phase boundaries. Generative Agents’ memory is per-agent with no sharing. If you had two agents in MetaGPT trying to edit the same file, nothing in the framework would stop them from overwriting each other’s work.&lt;/p&gt;

&lt;p&gt;And all of them evaluate against benchmarks that were designed for single agents. HumanEval, MBPP, SWE-bench. These benchmarks measure whether the output is correct. They don’t measure coordination quality, communication overhead, or recovery behavior. Which are the things that distinguish a multi-agent system from a single agent.&lt;/p&gt;

&lt;p&gt;Next post: wave-2 papers, which measure what actually breaks in these systems. With wave-1 architectures running in production, and the agentic coding turn having clarified when MAS isn’t the right tool at all, the field started asking why MAS fails when you do use it and how to test that honestly.&lt;/p&gt;
</description>
				<pubDate>Sun, 26 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 2: The Vocabulary</title>
				<description>&lt;p&gt;If you try to read multi-agent systems papers without the vocabulary, you will get nowhere. The field has settled on a shared set of words for the pieces of a system, and every paper now slots into those categories even when it pretends to be doing something novel. This post is about those words. Once you know them, you can read any paper in the field and know what it is and isn’t claiming.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html&quot;&gt;Part 1. The Landscape&lt;/a&gt;&lt;/li&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 2. The Vocabulary (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;p&gt;Three surveys have done the work of consolidating the vocabulary. Each one cuts the space slightly differently, but together they give you the conceptual toolkit.&lt;/p&gt;

&lt;h2 id=&quot;tran-et-al-actors-types-structures-strategies&quot;&gt;Tran et al.: Actors, Types, Structures, Strategies&lt;/h2&gt;

&lt;p&gt;The most useful single survey is &lt;a href=&quot;https://arxiv.org/abs/2501.06322&quot;&gt;Tran et al. (2025)&lt;/a&gt;. It defines a multi-agent system formally as a tuple of agents, collaboration channels, collective goals, and an environment. Then it taxonomizes the space along four axes.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Tran&apos;s Four Axes&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Types&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Cooperation (aligned goals)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Competition (conflicting goals)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Coopetition (mixed)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Structures&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Centralized (hub)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Decentralized (P2P)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Hierarchical (layered)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Strategies&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Rule-based (voting, consensus)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Role-based (SOP, expertise)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Model-based (Theory of Mind)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Architecture&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Static (pre-defined)&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Dynamic (runtime adjustment)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Most of the famous wave-1 papers are in one box: cooperative, hierarchical, role-based, static. Everyone is doing roughly the same thing, with small variations in how agents pass messages and what they produce at each step. The survey’s most useful claim is that the optimal structure varies with the task. There is no universal topology.&lt;/p&gt;

&lt;h2 id=&quot;zhou-et-al-the-five-component-agent&quot;&gt;Zhou et al.: The Five-Component Agent&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://link.springer.com/article/10.1007/s44336-024-00009-2&quot;&gt;Zhou et al. (2024)&lt;/a&gt; takes a different cut. Instead of asking how agents coordinate, they ask what each agent actually has inside it. They propose a five-component model that applies to any LLM-based agent.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Zhou&apos;s Five Components&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;01 Profile&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;How the agent is created with role and expertise&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;02 Perception&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;How the agent observes its environment&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;03 Self-Action&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Memory, reasoning, and planning&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;04 Mutual Interaction&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Communication paradigm, structure, content&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;05 Evolution&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Self-reflection, progressive enhancement&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Reading this as a distributed systems person, the labels sound like things you’d recognize from any actor system. Profile is identity. Perception is input. Self-Action is local state plus computation. Mutual Interaction is message passing. Evolution is the weakest piece, because nobody has really figured out what “agent learning from its own history” looks like in production.&lt;/p&gt;

&lt;h2 id=&quot;chen-et-al-applications-and-unsolved-challenges&quot;&gt;Chen et al.: Applications and Unsolved Challenges&lt;/h2&gt;

&lt;p&gt;The third survey, &lt;a href=&quot;https://arxiv.org/abs/2412.17481&quot;&gt;Chen et al. (2024)&lt;/a&gt;, is the one I’d skim rather than read in full. The applications chapter is useful, but what you actually want is the challenges section.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;h4&gt;Chen&apos;s Challenge Levels&lt;/h4&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Agent-level&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Alignment for simulation&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Hallucination propagation&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Long-context limits&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Interaction-level&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Efficiency explosion&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Accumulative error&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Evaluation-level&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;No standardized benchmarks&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;No objective metrics&lt;/span&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;No individual vs aggregate frameworks&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The interaction-level challenges are the ones that most concern me. Efficiency explosion is the observation that multi-agent systems scale worse than linearly because each agent’s autoregressive generation multiplies the token cost. Accumulative error is what it sounds like: errors made in round one propagate and amplify in rounds two, three, four.&lt;/p&gt;

&lt;h2 id=&quot;mapping-papers-into-these-taxonomies&quot;&gt;Mapping Papers Into These Taxonomies&lt;/h2&gt;

&lt;p&gt;The payoff of the vocabulary is that you can now categorize any paper in the field at a glance.&lt;/p&gt;

&lt;div class=&quot;mas-compare-wrap&quot;&gt;
&lt;table class=&quot;mas-compare&quot;&gt;
  &lt;thead&gt;
    &lt;tr&gt;&lt;th&gt;System&lt;/th&gt;&lt;th&gt;Type&lt;/th&gt;&lt;th&gt;Structure&lt;/th&gt;&lt;th&gt;Strategy&lt;/th&gt;&lt;th&gt;Architecture&lt;/th&gt;&lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;&lt;td&gt;CAMEL&lt;/td&gt;&lt;td&gt;Cooperation&lt;/td&gt;&lt;td&gt;Decentralized pair&lt;/td&gt;&lt;td&gt;Role-based&lt;/td&gt;&lt;td&gt;Static&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;ChatDev&lt;/td&gt;&lt;td&gt;Cooperation&lt;/td&gt;&lt;td&gt;Hierarchical pipeline&lt;/td&gt;&lt;td&gt;Role-based&lt;/td&gt;&lt;td&gt;Static&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;MetaGPT&lt;/td&gt;&lt;td&gt;Cooperation&lt;/td&gt;&lt;td&gt;Centralized pool&lt;/td&gt;&lt;td&gt;Role + Rule-based&lt;/td&gt;&lt;td&gt;Static&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Debate (Du)&lt;/td&gt;&lt;td&gt;Competition&lt;/td&gt;&lt;td&gt;Decentralized all-to-all&lt;/td&gt;&lt;td&gt;Rule-based rounds&lt;/td&gt;&lt;td&gt;Static&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Generative Agents&lt;/td&gt;&lt;td&gt;Coopetition&lt;/td&gt;&lt;td&gt;Decentralized open env&lt;/td&gt;&lt;td&gt;Model-based retrieval&lt;/td&gt;&lt;td&gt;Dynamic&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;Anthropic Research&lt;/td&gt;&lt;td&gt;Cooperation&lt;/td&gt;&lt;td&gt;Centralized orchestrator&lt;/td&gt;&lt;td&gt;Role-based&lt;/td&gt;&lt;td&gt;Dynamic&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td&gt;AutoGen&lt;/td&gt;&lt;td&gt;Configurable&lt;/td&gt;&lt;td&gt;Configurable&lt;/td&gt;&lt;td&gt;Configurable&lt;/td&gt;&lt;td&gt;Static or Dynamic&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Most of the canonical papers sit in the cooperative, role-based, static quadrant. The interesting ones are the exceptions. Du et al. is the rare competitive debate paper. Generative Agents is the rare fully dynamic system. AutoGen tries to be everything at once, which is its whole thesis.&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;Why vocabulary matters&lt;/div&gt;
  When two papers claim they &quot;disagree,&quot; the vocabulary lets you ask: are they actually addressing the same problem? ChatDev and MetaGPT both call themselves &quot;multi-agent software engineering frameworks&quot; but they have different structures, different strategies, and different failure modes. You need the words to see that they are solving slightly different versions of the same problem.
&lt;/div&gt;

&lt;h2 id=&quot;the-gap-the-vocabulary-exposes&quot;&gt;The Gap the Vocabulary Exposes&lt;/h2&gt;

&lt;p&gt;The taxonomies do something else besides categorize papers. They make gaps visible.&lt;/p&gt;

&lt;p&gt;Zhou’s “Evolution” component is the weakest across every system. Nobody has a real story for how agents learn from their own history in production. MetaGPT’s “test-driven retry” is the closest wave-1 paper to Evolution, and it’s still just a bounded retry loop with no memory of past attempts.&lt;/p&gt;

&lt;p&gt;Tran’s “dynamic architecture” category is almost empty. The wave-1 papers all fix their topology at design time. AutoGen makes topology configurable, but it’s configured by the developer, not adjusted at runtime. The only system that truly adjusts at runtime is Generative Agents, and that’s a simulation, not a production framework.&lt;/p&gt;

&lt;p&gt;Chen’s “evaluation-level” challenges are unsolved in a way that’s embarrassing for the field. When ChatDev claims 88 percent executability and MetaGPT claims 41 percent on a comparable benchmark, you’re not looking at a performance difference. You’re looking at two papers measuring different things with different tools and calling them the same.&lt;/p&gt;

&lt;p&gt;Next post: the wave-1 theory papers in detail. CAMEL, Generative Agents, ChatDev, MetaGPT, AutoGen. What each one actually builds, what each one trusts, and where each one breaks.&lt;/p&gt;
</description>
				<pubDate>Sat, 25 Apr 2026 12:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html</guid>
			</item>
		
			<item>
				<title>Getting Up to Speed on Multi-Agent Systems, Part 1: The Landscape</title>
				<description>&lt;p&gt;I’ve been reading multi-agent systems papers for weeks trying to figure out where the field actually is, and the honest answer is that it moves fast enough that any single paper is a snapshot, not a map. So this is the map I wish I’d had when I started. It’s a short series of posts meant to get someone up to speed on multi-agent LLM systems without having to read thirty papers first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; you already ship or evaluate LLM agents (tools, long context, basic eval loops) and want the &lt;em&gt;research&lt;/em&gt; landscape in view. This is not an on-ramp to transformers or prompting fundamentals.&lt;/p&gt;

&lt;p&gt;Before I start, a note on the frame I’m going to use. I’m going to talk about two “waves” of multi-agent research, and one outside disruption that happened between them. I want to be upfront that the waves are a reader aid, not a historical claim. Nobody in 2023 was writing “wave 1” papers, and the field did not convene to name its generation. These aren’t named movements like French New Wave cinema or second-wave feminism, where participants self-consciously defined their work against a prior cohort. What I’m doing is retrospective grouping: the kind you use to keep thirty papers straight in your head.&lt;/p&gt;

&lt;p&gt;What the grouping captures is that certain clusters of papers share assumptions, benchmarks, and failure modes. What it misses is that parallel threads exist (debate, simulation, distributed-systems-adjacent work) that don’t fit the wave structure at all, and that plenty of individual papers sit awkwardly between waves. If the framing helps you navigate the literature, keep it. If it gets in the way, drop it. The papers are what matter; the waves are scaffolding.&lt;/p&gt;

&lt;p&gt;With that caveat in mind: two rough clusters, one outside disruption that reshaped both, and two rough questions the MAS field has been trying to answer.&lt;/p&gt;

&lt;div class=&quot;mas-series-nav&quot;&gt;
  &lt;div class=&quot;mas-series-label&quot;&gt;Getting Up to Speed on MAS&lt;/div&gt;
  &lt;ol&gt;&lt;li class=&quot;mas-current&quot;&gt;&lt;strong&gt;Part 1. The Landscape (you are here)&lt;/strong&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/25/mas-series-02-the-vocabulary.html&quot;&gt;Part 2. The Vocabulary&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/26/mas-series-03-wave-one.html&quot;&gt;Part 3. Wave 1: Can Agents Coordinate At All?&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/27/mas-series-04-wave-two.html&quot;&gt;Part 4. Wave 2: Why It Breaks&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/28/mas-series-05-debate-state-coordination.html&quot;&gt;Part 5. Debate, State, and Coordination&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/29/mas-series-06-verification-patterns.html&quot;&gt;Part 6. Verification Patterns&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/04/30/mas-series-07-benchmarks.html&quot;&gt;Part 7. Benchmarks and What They Miss&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/ai/agents/mas-series/2026/05/01/mas-series-08-open-questions.html&quot;&gt;Part 8. Open Questions&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;

&lt;h2 id=&quot;wave-1-can-multiple-llms-coordinate-at-all-2023&quot;&gt;Wave 1: Can Multiple LLMs Coordinate At All? (2023)&lt;/h2&gt;

&lt;p&gt;The first wave is the one most people have heard of. A cluster of papers came out in roughly a six-month window in 2023, all answering some version of the same question: if you put multiple LLMs together, can they do something one LLM cannot?&lt;/p&gt;

&lt;p&gt;Not every paper in that cluster is “coordination theory” in the same sense: some are explicit software pipelines (ChatDev, MetaGPT), others foreground simulation and believable social dynamics (Generative Agents). I group them anyway because they share 2023-era benchmarks and a similar loose trust that multi-agent structure will carry the task.&lt;/p&gt;

&lt;div class=&quot;mas-timeline-wave&quot;&gt;
  &lt;div class=&quot;mas-wave-header&quot;&gt;Wave 1 · Theory and Architecture&lt;/div&gt;
  &lt;div class=&quot;mas-wave-subhead&quot;&gt;Can multiple LLMs coordinate at all? What&apos;s the right shape?&lt;/div&gt;
  &lt;div class=&quot;mas-timeline&quot;&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-camel&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Mar 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;CAMEL&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Two agents role-play&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-genagents&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Apr 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;Gen. Agents&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Memory and reflection&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-debate&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;May 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;Debate (Du)&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Competition as coordination&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-chatdev&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Jul 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;ChatDev&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Pairwise chat pipeline&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-metagpt&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Aug 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;MetaGPT&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Artifacts and test execution&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-autogen&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Aug 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;AutoGen&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Configurable framework&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-green&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Aug 2023&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;AgentVerse&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Dynamic group composition&lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;These papers are all proofs of concept. They show that multi-agent coordination is viable for some task, demonstrate the idea works on a benchmark, and argue that their particular coordination structure beats simpler baselines. &lt;a href=&quot;https://arxiv.org/abs/2303.17760&quot;&gt;CAMEL&lt;/a&gt; uses two agents role-playing. &lt;a href=&quot;https://arxiv.org/abs/2304.03442&quot;&gt;Generative Agents&lt;/a&gt; uses memory streams and reflection in a social simulation. &lt;a href=&quot;https://arxiv.org/abs/2305.14325&quot;&gt;Du et al.&lt;/a&gt; uses multi-round debate between identical model instances. &lt;a href=&quot;https://arxiv.org/abs/2307.07924&quot;&gt;ChatDev&lt;/a&gt; chains pairwise dialogues into a software development pipeline. &lt;a href=&quot;https://arxiv.org/abs/2308.00352&quot;&gt;MetaGPT&lt;/a&gt; replaces dialogue with structured artifacts and real test execution. &lt;a href=&quot;https://arxiv.org/abs/2308.08155&quot;&gt;AutoGen&lt;/a&gt; is a framework for building whatever topology you want. &lt;a href=&quot;https://arxiv.org/abs/2308.10848&quot;&gt;AgentVerse&lt;/a&gt; emphasizes dynamic group composition: the system can recruit specialists and reshape the team as the task proceeds.&lt;/p&gt;

&lt;p&gt;The wave-1 papers share three assumptions that look very different in hindsight. First, they assume the benchmarks they run on are the task. Second, they treat failure as a termination condition: when the system stops converging, it just stops. Third, they trust agents to coordinate without formal concurrency control, shared memory protocols, or recovery paths. These assumptions are where wave 2 would later start to push back.&lt;/p&gt;

&lt;h2 id=&quot;what-happened-next-door-2024&quot;&gt;What Happened Next Door (2024)&lt;/h2&gt;

&lt;p&gt;Before I get to wave 2, I have to account for a thing that happened in parallel that isn’t really MAS research but reshaped what MAS has to answer for.&lt;/p&gt;

&lt;p&gt;Through 2024, a wave of agentic coding systems shipped: &lt;a href=&quot;https://devin.ai&quot;&gt;Devin&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2405.15793&quot;&gt;SWE-agent&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2407.16741&quot;&gt;OpenHands&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2403.08299&quot;&gt;AutoDev&lt;/a&gt;, and Microsoft’s &lt;a href=&quot;https://arxiv.org/abs/2411.04468&quot;&gt;Magentic-One&lt;/a&gt;. These are not multi-agent systems in the wave-1 sense. Most of them are single agents with well-designed tool interfaces. The SWE-agent paper in particular showed that interface quality matters more than adding agents. They got a 10.7 percentage point improvement on SWE-bench from interface design alone, without changing the model.&lt;/p&gt;

&lt;p&gt;I bring this up because you cannot read the MAS papers from 2025 onward without this context. Wave 1 had implicitly assumed that multi-agent coordination was the default way to solve complex agentic tasks. By late 2024, the agentic coding community had accumulated strong evidence against that default for at least one large, &lt;em&gt;benchmarked&lt;/em&gt; slice of the space: autonomous patch-style software engineering (the kind of workflow SWE-bench and its successors measure). That is narrower than “all coding forever,” but it was the center of gravity in 2024 discourse, and it shifted what MAS papers need to beat. Anthropic’s &lt;a href=&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;&gt;research system post&lt;/a&gt; from June 2025 states the conclusion plainly: multi-agent earns its overhead on “breadth-first queries with independent parallel subtasks” and underperforms on “tasks needing shared context, including most coding tasks.”&lt;/p&gt;

&lt;div class=&quot;mas-callout&quot;&gt;
  &lt;div class=&quot;mas-callout-label&quot;&gt;Why this matters for MAS readers&lt;/div&gt;
  The agentic coding papers aren&apos;t MAS research. But they narrowed the MAS claim. After 2024, &quot;multi-agent for coding&quot; became harder to defend without evidence, and the MAS field&apos;s next wave is partly a response to that. If you&apos;re reading a post-2024 MAS paper, it&apos;s almost certainly arguing implicitly against the single-agent-with-tools baseline that Devin and SWE-agent established. I won&apos;t spend a whole post on these systems because they&apos;re not MAS, but they belong on your mental map of the landscape.
&lt;/div&gt;

&lt;p&gt;Magentic-One is the interesting exception. It’s a real multi-agent system with an orchestrator coordinating four specialized workers. It earns its overhead on hard multi-step reasoning (38 percent on GAIA) but not on focused coding. The stuck-counter mechanism it introduces (if an agent loops more than twice, reflect and replan) is one of the few MAS design patterns to surface clearly in the shipped agentic-coding systems of this period. I’ll come back to it in later posts.&lt;/p&gt;

&lt;h2 id=&quot;wave-2-why-does-it-break-2025-and-beyond&quot;&gt;Wave 2: Why Does It Break? (2025 and Beyond)&lt;/h2&gt;

&lt;p&gt;The second MAS wave is where the field is now. With wave-1 systems running in production and the agentic coding turn having clarified when MAS is and isn’t the right tool, people started asking: when MAS does fail, why? And how do we even test that?&lt;/p&gt;

&lt;p&gt;The Wave 2 timeline below is &lt;strong&gt;illustrative&lt;/strong&gt;: a handful of papers that typify a shift toward measurement, taxonomies, and fault injection, not an attempt to enumerate every 2025–2026 contribution.&lt;/p&gt;

&lt;div class=&quot;mas-timeline-wave&quot;&gt;
  &lt;div class=&quot;mas-wave-header&quot;&gt;Wave 2 · Why Does It Break?&lt;/div&gt;
  &lt;div class=&quot;mas-wave-subhead&quot;&gt;MAS works sometimes. Now: why does it fail? How do you test reliability?&lt;/div&gt;
  &lt;div class=&quot;mas-timeline&quot;&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-red&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Mar 2025&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;MAST (Cemri)&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;14 failure modes, 1,600 traces&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-purple&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Jun 2025&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;Anthropic Research&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Production orchestrator-worker&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-green&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Aug 2025&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;Info Sharing in Planning&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Shared notebook on travel planning&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-red&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Feb 2026&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;MAS-FIRE&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Systematic fault injection for MAS&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class=&quot;mas-timeline-node mas-c-blue&quot;&gt;
      &lt;div class=&quot;mas-node-date&quot;&gt;Mar 2026&lt;/div&gt;
      &lt;div class=&quot;mas-node-name&quot;&gt;Silo-Bench&lt;/div&gt;
      &lt;div class=&quot;mas-node-desc&quot;&gt;Communication-reasoning gap&lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;https://arxiv.org/abs/2503.13657&quot;&gt;MAST paper&lt;/a&gt; from Cemri and collaborators is the one I keep coming back to. They annotated 1,600 traces across seven popular multi-agent frameworks and built a taxonomy of 14 failure modes. Every framework they tested had failure rates between 41 and 87 percent. The top three failures are step repetition, reasoning-action mismatch, and being unaware of termination conditions. These are not model capability problems. They are system design problems.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2602.19843&quot;&gt;MAS-FIRE&lt;/a&gt; goes the other direction. Instead of observing failures in the wild, they inject them on purpose. Fifteen fault types across intra-agent and inter-agent categories, three injection mechanisms, and a dual-level reliability metric. The most interesting result is what they call the capability paradox: GPT-5’s strict instruction compliance becomes a liability under “Blind Trust” faults, where DeepSeek-V3’s less compliant behavior holds up better.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2603.01045&quot;&gt;Silo-Bench&lt;/a&gt; adds the third leg. 1,620 experiments showing that agents successfully form coordination topologies and actively exchange information, yet systematically fail to synthesize distributed state into correct answers. The bottleneck is not communication. The bottleneck is reasoning over distributed state.&lt;/p&gt;

&lt;h2 id=&quot;what-each-wave-trusts&quot;&gt;What Each Wave Trusts&lt;/h2&gt;

&lt;p&gt;If you want a one-sentence read on each wave, this is it.&lt;/p&gt;

&lt;div class=&quot;mas-taxonomy&quot;&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Wave 1&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Trusts that role structure and dialogue are enough&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Outside&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Agentic coding: trusts that good tools beat agent count&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class=&quot;mas-tax-row&quot;&gt;
    &lt;span class=&quot;mas-tax-label&quot;&gt;Wave 2&lt;/span&gt;
    &lt;div class=&quot;mas-tax-items&quot;&gt;
      &lt;span class=&quot;mas-tax-chip&quot;&gt;Trusts nothing and measures what breaks&lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Read as a progression, the rough arc is: tell agents to coordinate, then notice that sometimes you don’t need them to, then measure what happens when you do. Each step trusts the agents less and verifies more than the one before.&lt;/p&gt;

&lt;h2 id=&quot;what-the-waves-dont-capture&quot;&gt;What the Waves Don’t Capture&lt;/h2&gt;

&lt;p&gt;The two-wave framing is tidy, which should make you suspicious. Some things it flattens or misses:&lt;/p&gt;

&lt;p&gt;Parallel threads run outside the waves entirely. &lt;a href=&quot;https://arxiv.org/abs/2305.14325&quot;&gt;Du et al.&lt;/a&gt; appears on the Wave 1 timeline because it landed in the same window and poses the same headline question (many LLMs versus one), but &lt;strong&gt;the debate line of work&lt;/strong&gt; is its own cluster: multiple instances of the same model arguing, with a citation graph that barely touches coordination-theory papers and rarely gets cited back by them. The timeline entry is chronology, not a claim that debate belongs in ChatDev’s intellectual neighborhood; Part 5 treats debate as that parallel thread. Generative Agents is a social simulation paper that sits uncomfortably in wave 1 because it came out in 2023, but its descendants (game agents, persona simulations) are their own research community. Distributed-systems-adjacent work on shared state and coordination avoidance is a separate thread that’s only now starting to touch the MAS literature.&lt;/p&gt;

&lt;p&gt;Individual papers don’t fit cleanly. AutoGen is a 2023 paper that kept evolving through 2024. The Anthropic research post is engineering content, not a research paper. Magentic-One sits between the agentic coding turn and the reliability wave. Calling any of these “wave 1” or “wave 2” is a judgment call, not a fact.&lt;/p&gt;

&lt;p&gt;Keep all of that in mind as you read the rest of the series. The waves are a way to group papers for comprehension, not a claim about how the field evolved.&lt;/p&gt;

&lt;h2 id=&quot;whats-coming&quot;&gt;What’s Coming&lt;/h2&gt;

&lt;p&gt;The next seven posts build on this landscape.&lt;/p&gt;

&lt;p&gt;Part 2 covers the vocabulary the field uses for itself. Three surveys have done the work of consolidating the shared terms, and once you know the vocabulary you can read any paper in the field at a glance.&lt;/p&gt;

&lt;p&gt;Parts 3 and 4 go deep on the two waves. Part 3 is the canonical coordination-theory papers: CAMEL, ChatDev, MetaGPT, AutoGen, AgentVerse, what each one actually builds and where each one quietly disagrees with the others. Part 4 is the reliability wave: MAST, MAS-FIRE, Silo-Bench, and what happens when you try to measure a multi-agent system honestly.&lt;/p&gt;

&lt;p&gt;Part 5 covers the parallel threads. Multi-agent debate (Du, Liang), shared state as coordination (Ou et al.), and the CALM theorem as a bridge between distributed systems and multi-agent AI.&lt;/p&gt;

&lt;p&gt;Parts 6 and 7 are cross-cutting. Part 6 is verification patterns, including Cursor’s visual feedback loop, which is the most interesting production-scale verification pattern I’ve seen and isn’t in any of the papers. Part 7 is benchmarks, what they measure, what they miss, and why ChatDev and MetaGPT can report contradictory results on each other without either being obviously wrong.&lt;/p&gt;

&lt;p&gt;Part 8 is what I think is still missing, what’s worth stealing from adjacent fields, and what I’d read if I had to start over.&lt;/p&gt;

&lt;p&gt;Next post: the vocabulary.&lt;/p&gt;

&lt;h2 id=&quot;errata-and-revisions&quot;&gt;Errata and revisions&lt;/h2&gt;

&lt;p&gt;First published &lt;strong&gt;2026-04-24&lt;/strong&gt;. The list below logs substantive edits so early readers can see what moved.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Agentic coding claim:&lt;/strong&gt; Replaced “empirically falsified” framing with scoped wording: strong evidence on benchmarked autonomous patch-style software engineering (SWE-bench-style workflows), not a universal verdict on multi-agent for all coding.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Audience:&lt;/strong&gt; Added a short “who this is for” note (assumes you already work with LLM agents).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Debate vs. timeline:&lt;/strong&gt; Clarified that Du et al. on the Wave 1 timeline reflects release window and headline question; the broader debate &lt;em&gt;thread&lt;/em&gt; is parallel and is how Part 5 uses “debate.”&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;AgentVerse:&lt;/strong&gt; Described in the narrative, not only in the timeline table (with arXiv link). The Part 3 preview in “What’s Coming” now names it with the other coordination papers.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Wave 2 timeline:&lt;/strong&gt; Explicitly labeled as illustrative examples, not a complete survey of the period.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Magentic-One stuck counter:&lt;/strong&gt; Scoped the “design pattern” remark to shipped agentic-coding systems of the period (not a claim about all of CS or all MAS research).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Wave 1 heterogeneity:&lt;/strong&gt; Added a paragraph distinguishing pipeline-style coordination papers from simulation-heavy work in the same calendar cluster.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;2026-04-26&lt;/strong&gt; — &lt;strong&gt;Waves caveat:&lt;/strong&gt; Tightened prose; no change to the underlying claim that waves are retrospective scaffolding.&lt;/li&gt;
&lt;/ul&gt;
</description>
				<pubDate>Fri, 24 Apr 2026 06:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/mas-series/2026/04/24/mas-series-01-the-landscape.html</guid>
			</item>
		
			<item>
				<title>The Tribe Has to Outlive the Model · Five model swaps in three weeks taught me the project&apos;s continuity lives in the humans and conventions around the agents, not the agents themselves.</title>
				<description>&lt;p&gt;By the fourth model swap I noticed that the part of the system I wasn’t swapping was the part holding the project together.&lt;/p&gt;

&lt;p&gt;Five configurations in three weeks: Opus 4.6, Opus 4.6 with the 1M context window, Cursor Cloud Agents on GPT Codex 5.3, Cloud Agents on Composer 2, and this week Opus 4.7. Some swaps were involuntary. Credits ran out. A model that had been my daily driver for two months started coming back with half-finished work and failing the easy parts of basic CI, which I wrote about in &lt;a href=&quot;/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html&quot;&gt;Caucus V1&lt;/a&gt;. Some were voluntary. I wanted to see what a new model could do that the previous one couldn’t. In a few cases it could. In a few cases the new one was worse at something different, and I swapped again.&lt;/p&gt;

&lt;p&gt;No architecture change on my end. No rewrite. The app is the app. The model is a junior engineer on a revolving contract.&lt;/p&gt;

&lt;h2 id=&quot;same-prompt-different-creature&quot;&gt;Same Prompt, Different Creature&lt;/h2&gt;

&lt;p&gt;Here’s a specific observation. For a few weeks in April I was running the &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Caucus Permit Gate&lt;/a&gt;. It was my attempt at forcing agents to prove their work before merging. Every PR had to land with a permit file in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.caucus/permits/&lt;/code&gt; declaring the scope and risk of the change, plus a proof file in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.caucus/proofs/&lt;/code&gt; recording the output of allowlisted test commands run against the current working tree. CI blocked the merge unless both were there and fresh.&lt;/p&gt;

&lt;p&gt;Codex 5.3, given a task I’d given Composer 2 the day before, shipped the PR with the feature, the permit, and the proof all in the first commit. Two commits total on the branch. Gate green on the first push. Composer 2 on a similar task shipped the feature commit alone. Gate red. Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chore(caucus): add proof&lt;/code&gt;. Gate red. Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chore(caucus): sync permit scope and refresh proof&lt;/code&gt;. Then a merge of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;origin/main&lt;/code&gt;. Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chore(caucus): refresh proof after merge&lt;/code&gt;. The slowest Composer 2 PR from that era landed in main with twenty-three commits on the branch. One of them was the feature.&lt;/p&gt;

&lt;p&gt;Same repo. Same AGENTS.md. Same sentence in bold telling the agent that the permit is not optional paperwork. Two completely different relationships with the gate.&lt;/p&gt;

&lt;p&gt;Neither was strictly wrong. Codex 5.3 paid the cost upfront. Composer 2 paid the cost when forced. Composer 2 is the one that taught me something, because the only reason it paid the cost at all was that the gate existed. Without the gate, Composer 2 would have shipped the feature commit and I would have merged it, because I’m the reviewer, and “looks fine” works on me the same way it works on the model.&lt;/p&gt;

&lt;h2 id=&quot;where-the-knowledge-has-to-live&quot;&gt;Where the Knowledge Has to Live&lt;/h2&gt;

&lt;p&gt;If a new teammate rotates in every fortnight, any knowledge that has to survive the swap cannot live in the teammate. It has to live in the repo.&lt;/p&gt;

&lt;p&gt;Everyone has a CLAUDE.md or AGENTS.md at this point. Most of them read like documentation: “the backend is in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/backend&lt;/code&gt;, the frontend is in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/web&lt;/code&gt;, run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npm install&lt;/code&gt; first.” Some of mine is that.&lt;/p&gt;

&lt;p&gt;The part of mine that matters isn’t the documentation. It’s the lines that record things only visible if you were here when they broke.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;NEVER modify timestamp/timezone columns in migrations.&lt;/strong&gt; Timestamp corruption is unrecoverable and destroys production data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s a scar. The commit that added this rule went in at 5:46 on a Saturday morning, and the commit message was one word: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fuck&lt;/code&gt;. A companion rule had gone in 33 minutes earlier. I’ll come back to that one. A new engineer reading the repo cold has no way to know any of this. The column types look fine now. Nothing in the code says “this was dangerous.” Take the rule out, and the next model writes the same migration on the first try. The incident isn’t in the code. It was in the person who lived through it.&lt;/p&gt;

&lt;p&gt;A guardrail is the tribe’s compressed experience, encoded in a form the next teammate can act on without having been present for the original incident. Institutional memory for a team where it would otherwise be deleted every other Tuesday.&lt;/p&gt;

&lt;h2 id=&quot;not-every-guardrail-survives-the-test&quot;&gt;Not Every Guardrail Survives the Test&lt;/h2&gt;

&lt;p&gt;Two days ago I published &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/21/the-tax-on-the-happy-path.html&quot;&gt;The Tax on the Happy Path&lt;/a&gt;, where I killed a guardrail I had spent three weeks building. The Caucus Permit Gate, which I’d hardened nine different ways in &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Opt-In Isn’t a Guardrail&lt;/a&gt;, never caught a unique bug in its last hundred CI runs. Everything it caught, the regular test jobs would have caught a minute later. It cost more than it caught. I took it out.&lt;/p&gt;

&lt;p&gt;I’m not recanting that. Killing a guardrail that isn’t earning its keep is part of taking guardrails seriously. But I owe a better account of which rules survive the audit, because “some guardrails are good and some are bad” is not a useful thing to hand a reader.&lt;/p&gt;

&lt;p&gt;Two questions. Does the rule point at a specific past event? And does its cost scale with the velocity of the work, or with the rate at which the event it prevents actually recurs?&lt;/p&gt;

&lt;p&gt;The permit gate failed both. It didn’t point at a specific incident; it was a ritual about carefulness in general. And its cost scaled with every rebase against main, regardless of whether anything risky was happening on the branch. Cost grew with velocity. Value stayed flat.&lt;/p&gt;

&lt;p&gt;It had the shape of every code-review checklist I’ve ever seen at a company. “Consider thread safety.” “Check error handling.” “Verify input validation.” You check the boxes, go through the motions. The actual review happens in someone’s head, from memory of the specific thing that burned the team last quarter. Ceremony versus pattern recognition. The permit gate was all ceremony.&lt;/p&gt;

&lt;p&gt;The guardrails that have survived the audit pass both. “Do not drop a column without running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;./scripts/check-references.sh&lt;/code&gt; first” points at the time we dropped a column and broke ninety percent of the reports. “No &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--admin&lt;/code&gt; merges” points at a specific session in &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;. “No database triggers. EVER.” points at the incident I’m about to describe. The cost of obeying each of these is a few seconds, paid only when the agent is about to do the specific dangerous thing. Cost scales with the rate of the risky action, not with the rate of all work.&lt;/p&gt;

&lt;p&gt;They refer to events. They are shaped like scars.&lt;/p&gt;

&lt;h2 id=&quot;the-trigger&quot;&gt;The Trigger&lt;/h2&gt;

&lt;p&gt;The companion rule from above went in at 5:13 that same Saturday morning, 33 minutes before the timestamp one. Same incident. The &lt;a href=&quot;/ai/claude/2026/02/17/building-a-social-app-in-a-week-with-claude-code.html&quot;&gt;migration side of that night&lt;/a&gt; I’ve already written about. The trigger side I haven’t. Here it is.&lt;/p&gt;

&lt;p&gt;I’ve seen this incident at three different companies. Somebody adds a trigger to maintain &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt;, or to keep a computed column in sync. The trigger works. Years later someone writes a migration that touches the same table, the trigger fires in the middle of it, and rows end up with values nobody intended. I would have pushed back on this in any code review at any job I’ve ever had. If a junior engineer proposed a trigger in a design doc I’d have said: please, no, not a trigger, write it in application code.&lt;/p&gt;

&lt;p&gt;And then I built &lt;a href=&quot;/zabriskie/&quot;&gt;Zabriskie&lt;/a&gt; with an AI that took the shortest path between “I want &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt; to update itself” and “it updates itself.” The shortest path was a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BEFORE UPDATE&lt;/code&gt; trigger named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;update_updated_at_column&lt;/code&gt;, attached to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posts&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;comments&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;users&lt;/code&gt;, and every other table with an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt;. I didn’t catch it because I was moving fast, the code looked fine, and I was watching the screen fill up with new features instead of acting as a database reviewer on my own project.&lt;/p&gt;

&lt;p&gt;The trigger revealed itself on the night the timezone migration broke. Migration 030 converted every timestamped table in the schema from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamp&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;timestamp with time zone&lt;/code&gt; with an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AT TIME ZONE &apos;America/New_York&apos;&lt;/code&gt; clause. That’s where the disaster movie started. The follow-up migrations tried to clean up. Every cleanup UPDATE fired the trigger and stamped &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at = NOW()&lt;/code&gt; on every row the UPDATE touched, which meant every feed ordering by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt; started surfacing posts from earlier that week as if they’d just been edited. Each fix kept succeeding, and the symptoms kept changing, because each fix was also erasing the evidence that the rows hadn’t been touched by a user.&lt;/p&gt;

&lt;p&gt;It took several migrations to trace, because each one had succeeded, the columns I’d meant to change looked right, and nothing in the migration files mentioned &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;By migration 036 I gave up and dropped every trigger. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DISABLE TRIGGER ALL&lt;/code&gt; ran first, the fix ran second, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DROP TRIGGER&lt;/code&gt; ran third. Migration 037 fixed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;updated_at&lt;/code&gt; one more time. At 5:13 that morning, the rule went into CLAUDE.md:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;NO DATABASE TRIGGERS&lt;/strong&gt;: NEVER create database triggers. EVER. Always use explicit SQL (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SET updated_at = NOW()&lt;/code&gt;) in application code instead. Triggers cause unpredictable behavior during migrations and data fixes. This is non-negotiable.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A companion section, “Check Triggers Before DB Updates,” went in the same week. It tells the agent to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\dS table_name&lt;/code&gt;, list the triggers, and (for the triggers I hadn’t yet ripped out) disable them around any UPDATE migration:&lt;/p&gt;

&lt;div class=&quot;language-sql highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;posts&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DISABLE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TRIGGER&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;update_posts_updated_at&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;-- Your UPDATE here&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ALTER&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;posts&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ENABLE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TRIGGER&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;update_posts_updated_at&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Neither rule is for me. I already knew both. I wrote them down because the teammate I’m now on a team with arrives fresh every week with no memory of any of the things I know, and will, absent a specific instruction, pick the same shortest paths. The shortest paths are a trigger and an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AT TIME ZONE&lt;/code&gt; migration.&lt;/p&gt;

&lt;p&gt;The rules are a decade of scar tissue from somebody who isn’t going to be in the room the next time the new model gets the same task. The tribe’s memory, encoded in a form that survives the swap.&lt;/p&gt;

&lt;p&gt;The same shape holds for “Don’t use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--no-verify&lt;/code&gt;,” for “Never deploy untested changes to S3,” for “Always use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ON CONFLICT DO NOTHING&lt;/code&gt; on data INSERTs in migrations.” Each points at an event. Each event is the thing the next agent would have repeated, because the next agent is a different agent every time.&lt;/p&gt;

&lt;h2 id=&quot;the-interesting-question&quot;&gt;The Interesting Question&lt;/h2&gt;

&lt;p&gt;People keep asking whether the model can learn. Does the memory system work. Does the agent read CLAUDE.md. Does the rule change the behavior. I argued in &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt; that the answer is usually no.&lt;/p&gt;

&lt;p&gt;Different question. Not “can this model learn.” The models will get better at that, or they won’t, and either way my project has to ship.&lt;/p&gt;

&lt;p&gt;Can the team’s memory survive the next model swap?&lt;/p&gt;

&lt;p&gt;The rules I wrote after that Saturday morning have now outlived every configuration that’s done work on this codebase since. The one I’ll be using next month will read them and will not write a trigger or an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AT TIME ZONE&lt;/code&gt; migration.&lt;/p&gt;

&lt;p&gt;The tribe has to outlive the model. The only way that happens is if what the tribe learned is written somewhere the next model can’t skip.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;This is part of a series about building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt; with Claude. Previously: &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Opt-In Isn’t a Guardrail&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/21/the-tax-on-the-happy-path.html&quot;&gt;The Tax on the Happy Path&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/2026/04/23/the-tribe-has-to-outlive-the-model.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/2026/04/23/the-tribe-has-to-outlive-the-model.html</guid>
			</item>
		
			<item>
				<title>The Tax on the Happy Path · What I learned by removing the elaborate CI gate I&apos;d built to slow my AI agents down — after realizing it had never actually caught anything.</title>
				<description>&lt;p&gt;I was halfway through another PR, regenerating the proof file for the fourth time after rebasing on main, when I realized I couldn’t name a single PR the permit gate had caught. Not “caught that other CI didn’t also catch.” Just &lt;em&gt;caught&lt;/em&gt;. I decided to look up the numbers, and then I removed the gate entirely.&lt;/p&gt;

&lt;h2 id=&quot;what-the-gate-was&quot;&gt;What the Gate Was&lt;/h2&gt;

&lt;p&gt;The short version: every PR to &lt;a href=&quot;/zabriskie/&quot;&gt;Zabriskie&lt;/a&gt; had to be accompanied by a permit file in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.caucus/permits/&amp;lt;branch&amp;gt;.json&lt;/code&gt;. The permit declared the scope of the change (which paths were allowed to move), the risk level (R1 through R3+), and a list of allowlisted test commands the author was offering as evidence. R2 and higher required a proof file too, generated by actually running those commands and recording the output alongside a fingerprint of the working tree. A CI job called Caucus Permit Gate ran on every PR. It blocked the merge unless the permit existed, the scope covered the diff, and (for R2+) the proof was fresh and the tests passed.&lt;/p&gt;

&lt;p&gt;I built it because I’d spent months watching agents read a rule in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;, acknowledge it in conversation, and then ship code that violated it anyway. &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt; was the post where I argued that prose in an instruction file is a journal of failures, not a guardrail. The permit gate was supposed to be the structural answer: the rule stops being a sentence and starts being a required status check that CI evaluates against the actual diff.&lt;/p&gt;

&lt;p&gt;The gate shipped. It got bypassed, broken, and accidentally self-deadlocked in nine different ways, each of which got a hardening patch. &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Opt-In Isn’t a Guardrail&lt;/a&gt; was the post about those failures. By the time I sat down to write this one, none of that was still happening. The paperwork was filed, the scope was matching, the proofs were fresh, the tests were passing. I killed the gate anyway.&lt;/p&gt;

&lt;h2 id=&quot;the-count&quot;&gt;The Count&lt;/h2&gt;

&lt;p&gt;Over the last hundred CI runs on Zabriskie (roughly a week of shipping, so take the sample for what it is) there were four Caucus Permit Gate failures. Every one of them was a “Proof failed” result, which means one of the allowlisted test commands the proof re-runs returned non-zero. The same commands run in the Build Backend, Unit Tests, and E2E Tests jobs, which are separate CI jobs on every PR. In every case, those dedicated jobs also failed, for the same reason, on the same PR.&lt;/p&gt;

&lt;p&gt;The gate’s unique contributions, the ones that were supposed to justify its existence, never fired as the blocking failure. Scope allowlist violations: zero. Risk-level enforcement: zero. Proof fingerprint drift is a regeneration trigger by design, not a catch, so it was never going to appear on that list, and I should have noticed that before I put it on the list. The gate caught exactly nothing in that window that the plain test jobs wouldn’t have caught five minutes later.&lt;/p&gt;

&lt;h2 id=&quot;the-cost&quot;&gt;The Cost&lt;/h2&gt;

&lt;p&gt;The proof file’s validity is tied to a fingerprint of the working tree. Every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git merge origin/main&lt;/code&gt; changes that fingerprint, which invalidates the proof, which means the proof has to be regenerated, which means the allowlisted commands have to be re-run end to end. On Zabriskie that’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd backend &amp;amp;&amp;amp; go build ./...&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd backend &amp;amp;&amp;amp; go test ./...&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd web &amp;amp;&amp;amp; npm run build&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd web &amp;amp;&amp;amp; npx playwright test&lt;/code&gt;. Playwright alone takes several minutes and burns an entire stack of browser processes.&lt;/p&gt;

&lt;p&gt;Every time main moved underneath an in-flight PR, the compliant action (rebase, resync, then push) forced the same PR to re-run the same tests it had already run, &lt;em&gt;twice&lt;/em&gt;, once in the proof regeneration and once in the downstream CI jobs. During an active week of merges that can happen five or six times on a single branch.&lt;/p&gt;

&lt;p&gt;This is the part I did not see clearly when I was building it. The cost of the gate scaled with velocity: the more main moved, the more proofs had to be regenerated, the more tests had to run twice. The benefit didn’t scale with anything I could measure. It sat flat at zero catches that weren’t already caught downstream. A gate whose cost grows with how much work is happening, and whose benefit does not grow at all, is not a guardrail. It is a tax on the branches doing the right thing, collected in service of nothing.&lt;/p&gt;

&lt;p&gt;Agents are not insulated from this. They pay in tokens, in extra tool calls, in context spent on regeneration steps that accomplish nothing, in CI wait time that blocks the next thing they were going to do. The dollars come out of my pocket, but the friction lands on them.&lt;/p&gt;

&lt;h2 id=&quot;the-kill-pr&quot;&gt;The Kill PR&lt;/h2&gt;

&lt;p&gt;The kill PR was small. Remove the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;permit-gate&lt;/code&gt; job from the CI workflow. Strip the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;needs: [permit-gate]&lt;/code&gt; dependency from the downstream jobs. Delete the pre-push hook that enforced the same check locally. Delete the corresponding rule from the agent instructions file. Branch protection had to be updated to drop the gate as a required status check, which is the only part of the change that required an admin action on the repository.&lt;/p&gt;

&lt;p&gt;What I did &lt;em&gt;not&lt;/em&gt; do, and what I am deliberately leaving alone for now, is delete the scripts and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.caucus/&lt;/code&gt; history of accumulated permits and proofs. That data is a record of how agents behaved under a specific constraint across the life of the experiment. I don’t want to throw it away until I’ve looked at it properly. There’s a version of this story where the permits themselves, as artifacts, turn out to be more useful than the gate that validated them. I don’t know yet whether that’s true. But the gate’s cost and benefit are clear enough that I don’t need to know before pulling the plug.&lt;/p&gt;

&lt;h2 id=&quot;what-i-am-not-saying&quot;&gt;What I Am Not Saying&lt;/h2&gt;

&lt;p&gt;I am not saying there should be no gates. I am not saying the Caucus experiment was wasted. I’m saying that &lt;em&gt;this particular implementation&lt;/em&gt; of &lt;em&gt;this particular gate&lt;/em&gt; was charging an ongoing cost larger than the value of what it caught, and that cost is a first-class design concern, not a tradeoff you can wave away by pointing at the artifacts it produced.&lt;/p&gt;

&lt;p&gt;Every gate has to answer two questions. Does it catch failures the other checks wouldn’t catch? And does the cost of passing it, summed over the lifetime of the repo, stay below the cost of whatever it’s preventing? If the answer to the first is “no” and the second question never even gets asked, the gate has negative value even when it technically works.&lt;/p&gt;

&lt;p&gt;The question I was asking in &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt; was whether the structural version of a guardrail would behave differently from the journaled one. The answer is: structural guardrails are necessary but not sufficient. They have to be structural &lt;em&gt;and&lt;/em&gt; the ongoing cost of compliance has to stay below the cost of the failures they prevent. Otherwise the gate is a net negative no matter how firmly it’s bolted in.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next&lt;/h2&gt;

&lt;p&gt;I still think the basic idea behind Caucus is right. An agent pushing a change to a shared branch should have to declare the scope of the change, the risk it’s taking, and the evidence it’s offering that the change is correct. What I got wrong was the enforcement surface. The gate lived at merge time, which is the most expensive place to check anything, and it checked by re-running the same tests the merge was already going to run.&lt;/p&gt;

&lt;p&gt;I don’t know yet what the next version looks like. The honest sentence is that I don’t know how to validate a permit continuously against a moving diff without re-running the work the tests are already doing. That was the whole problem the first time, and I skipped over it by putting the check at merge time and eating the cost. Until I can answer that question, I don’t have a V2. I have a V1 that I’ve turned off and a list of things I know V2 can’t do.&lt;/p&gt;

&lt;p&gt;I had the information to turn this off weeks before I did, and I don’t have a mechanism for catching the next version of this mistake any earlier.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;This is part of a series about building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt; with Claude. Previously: &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;, &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt;, &lt;a href=&quot;/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html&quot;&gt;Caucus V1&lt;/a&gt;, &lt;a href=&quot;/ai/verification/zabriskie/agents/2026/04/09/the-structural-engineers-other-job.html&quot;&gt;The Structural Engineer’s Other Job&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html&quot;&gt;Opt-In Isn’t a Guardrail&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 21 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/caucus/2026/04/21/the-tax-on-the-happy-path.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/caucus/2026/04/21/the-tax-on-the-happy-path.html</guid>
			</item>
		
			<item>
				<title>Opt-In Isn&apos;t a Guardrail · A CI gate I built to slow AI agents down passed every paperwork check, missed every real bug, and blocked the rollback during a production outage.</title>
				<description>&lt;p&gt;Tonight, Goose was playing a very special set two. The band was spelling out SUCK IT STORM with their song titles, a response to a critic named Ryan Storm who’d publicly trashed their setlist choices and alternative arrangements, and people in the live chat were losing their minds about it in real time. Zabriskie was built for exactly this moment: friends sharing what they’re hearing, live chat going, the app doing the one thing it exists to do.&lt;/p&gt;

&lt;p&gt;The feed was blank. The landing page was blank. Every route in the app was rendering nothing.&lt;/p&gt;

&lt;p&gt;I rolled back three deploys directly in Railway while Goose played. I couldn’t revert via a PR because the permit gate and CI pipeline I’d spent three weeks building would need to pass first, and during a production outage with a live show happening, I didn’t have ten minutes. The system I’d built to slow things down was now the thing in the way.&lt;/p&gt;

&lt;p&gt;The agents had done everything the system asked of them. Every PR had a permit. Every permit had a proof file. Every proof file pointed at passing tests. The tests passed. The gate was working exactly as designed. None of it tested whether the page actually rendered without crashing.&lt;/p&gt;

&lt;p&gt;The gate caught agents who forgot to file paperwork. It did not catch agents who filed correct paperwork for broken code. And when the building was on fire, the gate blocked the fire truck too.&lt;/p&gt;

&lt;h2 id=&quot;what-i-built&quot;&gt;What I Built&lt;/h2&gt;

&lt;p&gt;A few weeks ago I wrote &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt; about the loop where Claude saves a rule to disk, ignores it, ships the bug, and saves the rule again. The thesis was that prose in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; is a journal of failures, not a guardrail. After that post, I tried to do something about it.&lt;/p&gt;

&lt;p&gt;I built a thing called the Caucus Permit Gate. Every PR has to be accompanied by a permit file, a JSON document that declares the scope of the change, classifies the risk, and points at a proof file with concrete evidence. CI checks the permit before letting the PR pass. AGENTS.md says, in bold:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Do not treat the permit as optional paperwork.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since I built that gate, agents have skipped the permit, broken the permit, or gone around the permit &lt;strong&gt;nine times&lt;/strong&gt;, all of them now in main. Each time I added another guardrail. Each guardrail was itself ignored, defeated, or in one case, locked the gate against everyone including the agent that had just installed it.&lt;/p&gt;

&lt;h2 id=&quot;ten-cycles&quot;&gt;Ten Cycles&lt;/h2&gt;

&lt;p&gt;The first cycle was the gate itself. Branch-scoped CI check, scope and risk validation. It worked. PRs without a permit failed CI.&lt;/p&gt;

&lt;p&gt;Then PR #246 went up without a permit. The agent had read AGENTS.md. The agent had acknowledged the rule. The agent pushed anyway. Fix: an expansion of AGENTS.md with an explicit “Opening a PR” bullet. Text on text.&lt;/p&gt;

&lt;p&gt;So I shipped a script and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.githooks/pre-push&lt;/code&gt; hook that would block the push if the permit file was missing. The hook was opt-in. You had to run an installer per clone. Then I shipped an installer for the installer, because writing “you can run this script if you want to” in AGENTS.md and calling it a guardrail was the same kind of failure the gate was supposed to prevent. A text-only fix to a rule about how text-only fixes don’t work.&lt;/p&gt;

&lt;p&gt;Three more PRs went up. Three more times the same pattern. An agent pushes without a permit, or with a broken permit, or with a permit that passed locally but failed in CI because the hook was validating the working tree instead of what was actually being pushed. Each time, a new patch. Each patch, a new edge case.&lt;/p&gt;

&lt;p&gt;That’s seven cycles. Six patches on the previous patch.&lt;/p&gt;

&lt;p&gt;Three more arrived in a single PR while I was writing. PR #293, titled &lt;em&gt;“Harden permit-gate workflow and fail-fast CI sequencing,”&lt;/em&gt; had twenty-three commits on the branch. The permit gate failed but CI kept running the rest of the jobs, burning resources on a PR that couldn’t merge. The pre-push hook regenerated the proof file but didn’t re-check for a dirty working tree, so stale proof reached CI while local checks looked clean. Then the dirty-tree check created a self-deadlock: every push regenerated the proof, which immediately failed the dirty check, which meant no push could ever complete. The gate locked everyone out, itself included.&lt;/p&gt;

&lt;p&gt;If you read the commit messages on that branch in order they look like a heart monitor: &lt;em&gt;sync permit, refresh proof, record latest proof run, skip proof rerun, align pre-push proof freshness, refresh proof after guardrail alignment, refresh proof after final hook guardrail changes.&lt;/em&gt; That sequence is now in main. Anyone who clones this repo tomorrow inherits it.&lt;/p&gt;

&lt;h2 id=&quot;the-pattern&quot;&gt;The Pattern&lt;/h2&gt;

&lt;p&gt;Every fix was either text, opt-in, or recursive.&lt;/p&gt;

&lt;p&gt;The text fixes were edits to AGENTS.md. &lt;em&gt;The permit is required. The permit is really required. Do not treat the permit as optional paperwork.&lt;/em&gt; Each round felt productive. None of them changed the behavior of the next agent.&lt;/p&gt;

&lt;p&gt;The opt-in fixes were scripts you could run to install a hook. The installer is in the repo. It is documented. It is one line of shell. The agents that need it most are the ones least likely to install it, because installing it is a step that has nothing to do with the task they were asked to perform. Asking an agent racing toward a PR to first run a setup script with no effect on the task it was given is asking it to slow down voluntarily for a future benefit it can’t see. Agents don’t do that. Humans barely do that.&lt;/p&gt;

&lt;p&gt;The recursive fixes are the most interesting. Cycle 9 hardened cycle 7. Cycle 10 hardened cycle 9 by undoing the file-mutation behavior cycle 9 introduced. Each layer of guard had a hole. Each patch introduced a new one. The result is a gate that is mechanically defeatable, in the way that a door is defeatable when the lock is sitting on the table next to it with a sign that says &lt;em&gt;please use this lock&lt;/em&gt;. Or, sometimes, when the lock is glued to the door so firmly that nobody can open it from either side.&lt;/p&gt;

&lt;p&gt;In structural engineering, you don’t put up a sign at the entrance of a parking garage that says “no vehicles over 6 feet 10 inches.” You hang a steel clearance bar. The driver who ignores the sign hits the bar and has to back out. The sign is advisory. The bar is structural. The permit gate as I built it is a sign. The guardrail has to live in a place the agent has to pass through. Not a place the agent could pass through if it chose to.&lt;/p&gt;

&lt;p&gt;This is the deeper version of what I wrote in &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;. There I argued that saving a note and ignoring it is journaling, not learning. The implicit hope was that mechanical guards would be different. That if the rule was code, not prose, the agent would be forced to comply. The permit gate was the test of that hope.&lt;/p&gt;

&lt;p&gt;It failed. Mechanical guards aren’t different if they’re optional. Opt-in isn’t a guardrail. It’s a sign in a different font.&lt;/p&gt;

&lt;p&gt;And then the gate started working, and the app went down anyway.&lt;/p&gt;

&lt;h2 id=&quot;what-the-gate-didnt-test&quot;&gt;What the Gate Didn’t Test&lt;/h2&gt;

&lt;p&gt;PR #294 added a heart burst animation for favoriting live chat comments. It put a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;useMemo&lt;/code&gt; and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;useEffect&lt;/code&gt; &lt;em&gt;after&lt;/em&gt; the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (loading)&lt;/code&gt; and error conditional returns in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SDUIPage.jsx&lt;/code&gt;, which wraps nearly every page in the app. That violates React’s Rules of Hooks: the number of hooks has to be the same on every render. When the page was loading, React saw fewer hooks than when it finished. React threw. The page went white.&lt;/p&gt;

&lt;p&gt;The DOM assertions passed on the loading state. React crashed on the loaded state. The test saw the spinner. The user saw nothing.&lt;/p&gt;

&lt;p&gt;The forward fix was PR #304. It moved the hooks above the conditional returns, logged a critical incident, and added a Playwright guardrail that attaches &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;page.on(&apos;pageerror&apos;)&lt;/code&gt; to the smoke tests so uncaught React errors now fail CI. The test that should have existed from the beginning.&lt;/p&gt;

&lt;h2 id=&quot;the-honest-part&quot;&gt;The Honest Part&lt;/h2&gt;

&lt;p&gt;I have to say this part because it would be dishonest not to. Most of the ten fixes were implemented by Claude. I asked it to fix the gate, and Claude (a different session each time) wrote a script, edited AGENTS.md, added a sentence, expanded a stderr message. Each of those PRs looked like a fix. Each of them passed review because &lt;em&gt;I&lt;/em&gt; was the reviewer and I was reading them the same way the agents were writing them: as if a sentence in a file was the same as a constraint in the system.&lt;/p&gt;

&lt;p&gt;It’s not. I should have known that. I wrote a whole post about it. And I still spent three weeks watching the same gate fail in slightly different ways while approving fixes that had no chance of working, because the fixes felt like progress and progress felt like the goal. Then I watched the gate finally work and the app still go down, because I’d been so focused on whether agents were filing permits that I forgot to check whether the permits were proving anything.&lt;/p&gt;

&lt;p&gt;The gap between “I know better” and “I do better” is the same gap I keep accusing the model of having. It’s just slower in me.&lt;/p&gt;

&lt;h2 id=&quot;where-this-goes&quot;&gt;Where This Goes&lt;/h2&gt;

&lt;p&gt;The gate has to be the floor, not a sign. The tests have to prove the page works, not that the DOM exists. Memory wasn’t learning. Documentation isn’t enforcement. Opt-in isn’t a guardrail. And a guardrail that checks the wrong thing is just a more convincing kind of nothing.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;This is part of a series about building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt; with Claude. Previously: &lt;a href=&quot;/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html&quot;&gt;Memory Isn’t Learning&lt;/a&gt;, &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt;, &lt;a href=&quot;/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html&quot;&gt;Caucus V1&lt;/a&gt;, &lt;a href=&quot;/ai/verification/zabriskie/agents/2026/04/09/the-structural-engineers-other-job.html&quot;&gt;The Structural Engineer’s Other Job&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 14 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/agents/reliability/caucus/2026/04/14/opt-in-isnt-a-guardrail.html</guid>
			</item>
		
			<item>
				<title>The Structural Engineer&apos;s Other Job · AI-generated code increasingly passes review and tests but ships half-working features. What humans have to check for instead.</title>
				<description>&lt;p&gt;I’ve been building &lt;a href=&quot;/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html&quot;&gt;Zabriskie&lt;/a&gt; for a few months now, mostly with AI agents. Claude Code writes the backend handlers, builds the SDUI screens, registers the routes, and, importantly, writes the Playwright tests. The test suite has grown to over 150 E2E tests. CI is green. I’m shipping fast.&lt;/p&gt;

&lt;p&gt;But I keep finding the same category of bug.&lt;/p&gt;

&lt;p&gt;The RSVP feature on show pages is a good example. A user taps the RSVP button, their attendance is saved, and a post is supposed to appear in the feed so their friends know they’re going. Claude built the handler, registered the route, wired up the SDUI component. The code compiled. The tests passed. I reviewed the diff, it looked right. A few days later I noticed the feed was quiet. People were RSVPing to shows but no posts were appearing. When I dug in, the attendance was saving fine. The button worked. But the downstream feed post creation was silently failing: when a show had no &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;media_item_id&lt;/code&gt;, the handler was passing an empty string to a UUID column, which threw an “invalid syntax” error that got swallowed by the error handling. Users tapped the button, saw confirmation, and had no idea their RSVP was never announced. The feature half-worked, which is worse than not working at all.&lt;/p&gt;

&lt;p&gt;And this wasn’t a one-off. A Quick Post button rendered on the page but did nothing when tapped, missing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;buttonType: &quot;submit&quot;&lt;/code&gt;, so the click event was swallowed by the form container. A “Watch Livestream” button worked fine during the day but vanished every evening because a UTC truncation bug made tonight’s show look like yesterday’s, hiding the button right when the band took the stage. Search results stopped being clickable because a fix to prevent click events from bubbling in comment forms accidentally blocked all click actions inside &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;form&amp;gt;&lt;/code&gt; tags. I wrote about the worst case in &lt;a href=&quot;/ai/zabriskie/reliability/2026/04/03/the-feature-that-has-never-worked.html&quot;&gt;The Feature That Has Never Worked&lt;/a&gt;: an auto-live poller that broke seven times in thirteen days, each fix introducing a new failure mode, while the UI calmly displayed “scheduled” as Billy Strings played to a sold-out amphitheatre.&lt;/p&gt;

&lt;p&gt;Every one of these PRs had passing tests. Each one would survive a mechanical code review. The types were correct, the logic was plausible, the patterns matched existing code. An AI reviewer scanning for boundary conditions and API misuse would approve all of them.&lt;/p&gt;

&lt;p&gt;The problem was the same every time: the Playwright tests that Claude wrote verified that UI elements &lt;em&gt;existed&lt;/em&gt; without verifying that the features &lt;em&gt;worked&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-tests-that-dont-use-the-feature&quot;&gt;The Tests That Don’t Use the Feature&lt;/h2&gt;

&lt;p&gt;When I went back and looked at what Claude had actually written in the test files, the pattern was consistent. The RSVP test checked that the button was present on the show page. It might even click it. But it never navigated to the feed afterward to check that a post appeared. The Quick Post test confirmed the form rendered but never submitted it. The livestream test checked for the button during the day but never ran at the time of an actual show.&lt;/p&gt;

&lt;p&gt;This makes sense if you think about how the tests get written. The agent finishes implementing a feature, then writes a test that exercises the code path it just built. The test is shaped by the implementation, not by the user’s experience. It verifies the thing the agent &lt;em&gt;made&lt;/em&gt; (a button, a form, a component) not the thing the user &lt;em&gt;does&lt;/em&gt; (RSVP and see it in the feed, fill out a form and see the result, tap a livestream link during a live show).&lt;/p&gt;

&lt;p&gt;I could read through every test the AI writes and audit whether it actually exercises the full user workflow. But that puts me right back where I started: I’m the bottleneck, just reviewing test code instead of reviewing application code. The test suite gives me a green checkmark. It doesn’t give me confidence.&lt;/p&gt;

&lt;p&gt;This isn’t just my problem. Anthropic &lt;a href=&quot;https://www.anthropic.com/news/claude-code-review&quot;&gt;launched a code review tool&lt;/a&gt; in March explicitly because code review has become a bottleneck, and even with AI review, the code is shipping faster than anyone can verify it.&lt;/p&gt;

&lt;p&gt;AI-powered review tools help with the mechanical side: style, boundary conditions, common bug patterns. But they share the same fundamental limitation as human review and AI-written tests: they read the code and reason about it. They don’t run it. They can tell you the handler is registered and the component has the right props. They can’t tell you that when a user taps RSVP, a post actually appears in the feed.&lt;/p&gt;

&lt;h2 id=&quot;the-agent-at-a-computer&quot;&gt;The Agent at a Computer&lt;/h2&gt;

&lt;p&gt;This is where something genuinely new has happened.&lt;/p&gt;

&lt;p&gt;Cursor’s cloud agents don’t just write code in a text editor. Each agent gets its own virtual machine, a real computer with a browser, a terminal, and the ability to interact with running software. The agent writes the code, starts the application, navigates through it like a user would, takes screenshots, records video of the feature working, and attaches all of that to the pull request. More than &lt;a href=&quot;https://cursor.com/blog/agent-computer-use&quot;&gt;30% of the PRs merged at Cursor&lt;/a&gt; itself are now created by these agents operating autonomously in cloud sandboxes. OpenAI’s Codex has moved in the same direction, &lt;a href=&quot;https://openai.com/index/harness-engineering/&quot;&gt;wiring Chrome DevTools Protocol into the agent runtime&lt;/a&gt; so the agent can start a browser, inspect the DOM, take screenshots, and reason about UI behavior directly.&lt;/p&gt;

&lt;p&gt;The reviewer doesn’t mentally simulate a diff. They watch the feature work.&lt;/p&gt;

&lt;p&gt;That difference matters more than it sounds. When I review a PR with video evidence attached, I’m not reading test assertions or tracing code paths in my head. I’m watching someone (well, &lt;em&gt;something&lt;/em&gt;) tap the RSVP button and then check the feed. I can see whether the post appeared. I can see whether the button animation played. I can see whether the page scrolled correctly afterward. Thirty seconds of video tells me more about whether a feature works than a 200-line Playwright spec ever could, because the video shows the &lt;em&gt;outcome&lt;/em&gt;, not the &lt;em&gt;mechanism&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;And I’d already been moving in this direction myself before I knew about Cursor’s cloud agents. In &lt;a href=&quot;/ai/zabriskie/development/android/ios/2026/03/22/teaching-claude-to-qa-a-mobile-app.html&quot;&gt;Teaching Claude to QA a Mobile App&lt;/a&gt;, I described building a system where Claude drives the iOS and Android simulators for Zabriskie, connecting to Android WebViews via Chrome DevTools Protocol over ADB, fighting with Apple’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;idb&lt;/code&gt; tools for six hours to get iOS working. I built a nightly sweep that launches both simulators, navigates all 25 screens, takes screenshots, analyzes them for visual issues, and files bug reports automatically as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zabriskie_bot&lt;/code&gt; in the production forum. The instinct was the same: look at the app running. Don’t just read the code.&lt;/p&gt;

&lt;h2 id=&quot;the-witness&quot;&gt;The Witness&lt;/h2&gt;

&lt;p&gt;In formal verification, this concept has a name: a &lt;em&gt;witness&lt;/em&gt;. A witness is a concrete piece of evidence that a thing works. Not an argument that it &lt;em&gt;should&lt;/em&gt; work, but proof that it &lt;em&gt;did&lt;/em&gt; work. The screenshot is a witness. The video is a witnessed execution trace. The agent isn’t just building the feature. It’s constructing evidence that the feature works.&lt;/p&gt;

&lt;p&gt;A witness is legible without reading code. You can hand the video to a product manager and they can tell you whether the feature works. Try that with a Playwright spec.&lt;/p&gt;

&lt;p&gt;But some witnesses are harder to construct than others, and this is where the real complexity hides.&lt;/p&gt;

&lt;p&gt;A screenshot proves something rendered. A video of a click proves a button responds. But many features aren’t a single interaction. They’re stateful workflows that span multiple operations over time. RSVP to a show so a post appears in the feed so another user can see it and bookmark it. Subscribe to a service so you get a discounted fee at checkout next week. Add items to a cart, apply a coupon, and verify the total reflects both. The witness for these features isn’t one screenshot. It’s a chain of evidence across multiple steps, where each step depends on state persisted in a database or a queue from a prior step. The agent has to perform step A, verify the side effect was stored, then come back and perform step B and verify that step B’s behavior reflects what step A wrote. That’s a harder witness to construct, and it’s exactly where the most important bugs live. The RSVP bug was precisely this shape. The button worked. The state didn’t propagate. A single-screenshot witness would have approved that PR.&lt;/p&gt;

&lt;h2 id=&quot;the-witness-is-not-a-test-suite&quot;&gt;The Witness Is Not a Test Suite&lt;/h2&gt;

&lt;p&gt;Here’s the thing I can’t let myself forget: a witness proves the feature under test works &lt;em&gt;right now&lt;/em&gt;. It says nothing about whether it broke something else.&lt;/p&gt;

&lt;p&gt;Every incident I’ve logged in the Zabriskie reliability database reinforces this. Fixing click handling in comment forms broke search result selection. Adding authentication middleware to 25 unprotected routes broke the live show pill. The avatar upload fix that worked for one handler left broken images on four other pages. Twenty-four percent of all commits in the Zabriskie codebase are fixes, and they arrive in chains where each fix breaks something adjacent.&lt;/p&gt;

&lt;p&gt;You can’t build an application on video evidence alone. A witness is an existence proof: “this works.” A regression suite is a universal proof: “nothing that used to work is now broken.” You need both. The video catches the bugs that tests miss, the ones where the test verifies the button exists but nobody checked whether it actually does anything. The regression suite catches the things the video didn’t think to look at, the five other features that silently broke when you fixed the one you were focused on.&lt;/p&gt;

&lt;h2 id=&quot;who-builds-the-stage&quot;&gt;Who Builds the Stage?&lt;/h2&gt;

&lt;p&gt;The witness works for Zabriskie because it’s a single application that runs on one machine. I can &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go run cmd/api/main.go&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;npm run dev&lt;/code&gt;, open a browser, and navigate the whole app. An agent can do the same thing on a VM in the cloud. The witness is easy to construct because the environment is easy to provision.&lt;/p&gt;

&lt;p&gt;But even Zabriskie isn’t really one thing. There’s the web app, the iOS app in the App Store, the Android app on Google Play, and the backend API they all talk to. When I push a backend change, the iOS app is still running last week’s code. When I add a new SDUI component type, the web client needs to know how to render it or it silently does nothing. I’ve already lived a small version of the coordination problem: four deployment targets, one developer, and no guarantee they’re all in sync. If this is already hard at my scale, imagine a company with 200 services.&lt;/p&gt;

&lt;p&gt;Most real-world software has this problem worse than I do. And when I started thinking about where the witness concept breaks down at larger scales, I realized the answer is the same in every case: it’s not the agent. It’s the environment. The agent is smart enough to navigate an app and record what it sees. The hard part is giving it an app to navigate.&lt;/p&gt;

&lt;p&gt;That’s a platform engineering problem. And it’s the same one I was pointing at in &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt;. In that post I used the broad term civil engineering. The role I’m describing here is more specific: the structural engineer, the person inside civil engineering whose job is making sure the thing stands up and can be inspected over its lifetime. The structural engineer doesn’t weld the beams. The structural engineer designs the bridge so that a welder doing their job correctly can’t bring the whole thing down. The platform engineer doesn’t write the feature. The platform engineer builds the infrastructure so that an agent writing a feature can &lt;em&gt;verify&lt;/em&gt; it works. The witness is the agent’s job. The stage is the platform engineer’s job.&lt;/p&gt;

&lt;p&gt;Every scaling challenge for the witness turns out to be a question about whether someone has built the right stage.&lt;/p&gt;

&lt;h3 id=&quot;mobile&quot;&gt;Mobile&lt;/h3&gt;

&lt;p&gt;For mobile, the stage is a simulator. I &lt;a href=&quot;/ai/zabriskie/development/android/ios/2026/03/22/teaching-claude-to-qa-a-mobile-app.html&quot;&gt;wrote about this in detail&lt;/a&gt;: Android was workable in 90 minutes because Capacitor WebViews expose a Chrome DevTools Protocol socket, the same protocol that powers Playwright. iOS took over six hours because Apple’s tooling isn’t designed for headless automation. The cloud infrastructure exists (macOS VMs, GitHub Actions runners with simulator support) but it isn’t built for ephemeral agent-per-PR workflows the way a Linux VM with Chrome is. The agents are capable today. The platform work is making the simulation layer automatable enough that an agent can spin one up, use the app, and tear it down without human intervention.&lt;/p&gt;

&lt;h3 id=&quot;microservices&quot;&gt;Microservices&lt;/h3&gt;

&lt;p&gt;For distributed systems, the stage is an environment where all the relevant services are running together. This is the most interesting case, and there are two very different versions of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The monorepo world.&lt;/strong&gt; Companies like Google, Meta, Twitter, and Uber keep all their services in a single repository. The key advantage is that cross-cutting changes can be made atomically, one PR that touches the API gateway, the billing service, and the notification service, all committed together. An agent working in a monorepo can, in principle, make a coordinated change across multiple services in a single diff.&lt;/p&gt;

&lt;p&gt;But can you &lt;em&gt;run&lt;/em&gt; all those services on one machine to produce a witness? That depends entirely on the application’s complexity. A monorepo with 15 services might be runnable with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker compose&lt;/code&gt; on a beefy VM. A monorepo with 500 services almost certainly isn’t. And even for the 15-service case, the bootstrapping problem is real: database migrations, seed data, service discovery, mock credentials for third-party APIs. The question isn’t whether the agent can write the code. It’s whether the platform team has made the application &lt;em&gt;bootable&lt;/em&gt; enough for the agent to stand it up and use it.&lt;/p&gt;

&lt;p&gt;This reframes what “testable” means. It’s not just about code coverage or CI pipelines anymore. It’s about whether an autonomous agent can cold-start your system from a fresh checkout and get to a state where it can navigate through a feature. That’s a higher bar than most organizations have cleared, and I think the pressure from agentic development is going to become &lt;em&gt;the&lt;/em&gt; forcing function for platform investment. If an agent can’t boot your app, an agent can’t verify your app. Making the system bootable is platform engineering work. It’s the structural engineer designing the inspection regime for the bridge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The polyrepo world.&lt;/strong&gt; Companies that use separate repositories for each service face a fundamentally different problem. A single feature, say adding a discount for subscribers, might require changes to the user service (store subscription status), the pricing service (check subscription at checkout), and the checkout frontend (display the discounted total). That’s three PRs in three repositories, reviewed by three different teams, merged on three different timelines.&lt;/p&gt;

&lt;p&gt;Each service has to make a backwards-compatible change. The new pricing service has to work whether or not the user service change has been deployed yet. The checkout frontend has to handle both the old price and the new discounted price gracefully. The standard practice is semantic versioning, feature flags, and contract testing: deploy each change independently, verify it doesn’t break existing consumers, activate the feature once all the pieces are in place.&lt;/p&gt;

&lt;p&gt;But here’s the question that video evidence forces you to ask: when does anyone actually &lt;em&gt;see&lt;/em&gt; the discount appear on the checkout page? Each PR gets reviewed in isolation. Each service’s CI passes independently. But the &lt;em&gt;feature&lt;/em&gt;, the thing the user actually experiences, doesn’t exist until all three changes are deployed together. The witness for the feature can’t be produced at review time. It can only be produced after all the independently-reviewed changes are merged and deployed. By then, the code was written days or weeks ago. If something’s wrong (the discount doesn’t apply, or it applies twice, or the price flickers between old and new) the feedback loop is enormously long compared to what an agent on a single machine can do.&lt;/p&gt;

&lt;p&gt;The platform engineering answer is a coordination layer that can stage cross-service changes together before any of them merge. Imagine a system, not unlike the &lt;a href=&quot;/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html&quot;&gt;Caucus&lt;/a&gt; workflow I’ve been building for code review, but where each service gets its own agent. The agents make their changes independently, each producing backwards-compatible modifications. But before any of them merge, a coordinator stages all the changes together in an ephemeral environment and runs the end-to-end user journey. One agent starts the user service with the subscription change. Another starts the pricing service with the discount logic. A third starts the checkout UI. The coordinator navigates the full flow (subscribe, browse, add to cart, see the discount at checkout) and produces video evidence of the feature working across the composed system.&lt;/p&gt;

&lt;p&gt;You could even roll individual services forward and back: does the feature degrade gracefully if only the user service and pricing service are updated, but the checkout UI is still on the old version? That’s backwards-compatibility testing through exploration rather than assertion. The agents become a way to simulate deployment order, probing the combinatorial space of “which services have been updated” without deploying anything to production.&lt;/p&gt;

&lt;p&gt;Nobody has built this yet, as far as I know. But the pieces are converging. Ephemeral preview environments that spin up isolated service meshes per branch. Agent runtimes that can control browsers and navigate applications. Coordination layers that manage multi-agent workflows with structured handoffs. The missing piece is the platform engineering work that connects them: a system that knows which services participate in a feature, stages their changes together, and produces a cross-service witness. That’s infrastructure. That’s the structural engineer’s job.&lt;/p&gt;

&lt;h3 id=&quot;apis-and-sdks&quot;&gt;APIs and SDKs&lt;/h3&gt;

&lt;p&gt;The hardest case is the one where there’s no button to click at all. When you’re building a platform (an API, an SDK, a shared library, an internal service that other teams depend on) your users are other developers. The feature doesn’t have a UI. There’s no page to navigate. There’s no checkout flow to walk through.&lt;/p&gt;

&lt;p&gt;The witness for a platform change might be: does every team that depends on this API still build and pass tests after this change? But producing that witness means checking out &lt;em&gt;their&lt;/em&gt; code, building &lt;em&gt;their&lt;/em&gt; project against your new version, running &lt;em&gt;their&lt;/em&gt; tests. That’s not an agent at a computer. That’s an agent that understands organizational dependency graphs.&lt;/p&gt;

&lt;p&gt;But the platform engineering answer here might be the most straightforward of all: the witness is video evidence &lt;em&gt;of a sample application that uses the API&lt;/em&gt;. This is how good API and SDK development already works in practice: you build a real application that consumes your own product before shipping it to external users. The agent version: make the platform change, check out the reference app, build it against the new version, run it, navigate through it. If the reference app still works, your change is backwards-compatible. If it doesn’t, you’ve caught a breaking change before it reached your consumers. The witness isn’t your library running. It’s what your library &lt;em&gt;enables&lt;/em&gt; running.&lt;/p&gt;

&lt;p&gt;The platform engineer’s job here is maintaining that reference app and keeping it representative. That’s not glamorous work. But without it, there’s no stage for the agent to perform on, and the witness can’t be constructed.&lt;/p&gt;

&lt;h2 id=&quot;the-other-job&quot;&gt;The Other Job&lt;/h2&gt;

&lt;p&gt;In &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt;, I argued that the profession is splitting: feature development is becoming accessible to non-engineers, but someone still has to design the bridge. I described the platform engineer’s job in terms of API design, load analysis, inspection regimes, self-healing systems.&lt;/p&gt;

&lt;p&gt;I think there’s another item on that list now: making the system witnessable. Building the infrastructure so that when an agent writes a feature, it has somewhere to run it, navigate it, and record the evidence that it works.&lt;/p&gt;

&lt;p&gt;The agent can write the code. The agent can construct the witness. But the platform engineer builds the stage, and that work doesn’t show up on anyone’s roadmap yet.&lt;/p&gt;

&lt;p&gt;I stopped reading the tests. I started watching the video. And the thing I keep coming back to is: the video only works if someone built the infrastructure to make it possible.&lt;/p&gt;
</description>
				<pubDate>Thu, 09 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/verification/zabriskie/agents/2026/04/09/the-structural-engineers-other-job.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/verification/zabriskie/agents/2026/04/09/the-structural-engineers-other-job.html</guid>
			</item>
		
			<item>
				<title>Caucus V1: Cursor Background Agents and a Multi-Agent Workflow That Actually Loops</title>
				<description>&lt;p&gt;I’ve been using Cursor 3 more over the last week or so because it makes it easy to move between models, and Cursor’s Composer 2 has been producing good results for me. That’s become more important because Opus 4.6 has gotten noticeably worse for me in Zabriskie. Work that used to feel routine, like straightforward file edits or basic follow-through on multi-step changes, now takes forever across multiple prompts, comes back half-finished, or fails basic CI checks. The model that was my primary tool for two months of shipping &lt;a href=&quot;/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html&quot;&gt;Zabriskie&lt;/a&gt; has started struggling with things it used to do in one pass.&lt;/p&gt;

&lt;p&gt;That matters here because this project has become, whether I intended it or not, a study in what happens when the model is not consistently reliable. If the agent can’t be trusted to do basic work cleanly every time, then the surrounding system has to take on more of that burden by preserving state between steps, recovering from predictable failures, and making it obvious what happened when something goes wrong.&lt;/p&gt;

&lt;p&gt;The last few posts have been converging on this from different directions. &lt;a href=&quot;/ai/agents/distributed/zabriskie/2026/03/30/multi-agent-systems-have-a-distributed-systems-problem.html&quot;&gt;Multi-Agent Systems Have a Distributed Systems Problem&lt;/a&gt; is where I said most clearly that these systems need real coordination machinery, not just role descriptions and prompt choreography. &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt; and &lt;a href=&quot;/ai/zabriskie/reliability/2026/04/03/the-feature-that-has-never-worked.html&quot;&gt;The Feature That Has Never Worked&lt;/a&gt; pushed me toward the more practical version of that same conclusion: if the models are inconsistent, the surrounding system has to get stronger.&lt;/p&gt;

&lt;p&gt;So this post is not just adjacent to that earlier multi-agent piece. It’s the first concrete version of the system I was pointing at there. In that post, the argument was that multi-agent software systems need real coordination machinery instead of roleplay and prompt choreography. Caucus V1 is rev 1 of that vision. It’s the first pass at a runtime that actually tries to encode those ideas into the workflow itself.&lt;/p&gt;

&lt;p&gt;This week I built that first version. It’s called Caucus, and for the first time I have a multi-agent loop that actually completes end to end.&lt;/p&gt;

&lt;h2 id=&quot;why-cursor-agents-matter-here&quot;&gt;Why Cursor Agents Matter Here&lt;/h2&gt;

&lt;p&gt;Part of what makes this worth building now is that Cursor’s background agents are not just chat windows with long prompts. They run in the cloud on actual computers. They can check out code, make changes, open pull requests, wait for CI to finish, respond to failures, and keep working while I’m not sitting there driving every keystroke.&lt;/p&gt;

&lt;p&gt;That changes the shape of the problem. If an agent can only draft code in a text box, then “multi-agent” mostly means multiple role descriptions. If an agent can actually live inside a software workflow, then it can do the things a real teammate would do. It can implement a change, open a PR, wait for checks, fix what failed, run the app, and produce evidence that the feature works.&lt;/p&gt;

&lt;p&gt;That’s the part I think people are still underestimating. These agents are not interesting because they can talk about code. They’re interesting because they can operate on real software artifacts over time.&lt;/p&gt;

&lt;p&gt;And the most striking capability is what happens after the code is written. A Cursor background agent can start the application, interact with it, take screenshots of the running UI, and record a video walkthrough demonstrating that the feature actually works. It can attach that evidence to the PR alongside a passing CI build. That’s not “code generation.” That’s a worker producing deliverables with proof.&lt;/p&gt;

&lt;p&gt;That is the enabling condition for the version of Caucus I want. Without agents that can actually execute against a repository, run CI, start the app, and produce visual evidence that the change is correct, multi-agent coordination is mostly theater. My vision is not a cast of roleplaying agents debating architecture in a transcript. It’s a system of agents that can take responsibility for different parts of a real development loop, with the runtime coordinating their work and preserving enough structure that the loop remains legible.&lt;/p&gt;

&lt;h2 id=&quot;what-caucus-v1-actually-is&quot;&gt;What Caucus V1 Actually Is&lt;/h2&gt;

&lt;p&gt;Right now, Caucus V1 is a very small system with a very specific job. It coordinates a fixed set of background agents around a pull request lifecycle.&lt;/p&gt;

&lt;p&gt;In practice, that means one agent implements code changes and opens or updates a PR, another agent reviews that PR and either approves it or requests changes, and the runtime keeps looping until the PR is approved or a safety cap is hit. That loop is the whole point. Most multi-agent demos are linear: agent A hands off to agent B, and the workflow is done. Caucus V1 has a cycle in its DAG. The reviewer can send work back to the implementer, and the implementer can re-enter the same PR and refine it.&lt;/p&gt;

&lt;p&gt;What makes that cycle possible is the causal history that the runtime carries forward between stages. The implementation agent doesn’t just receive a generic task the second time around. It knows it’s refining a PR based on review feedback &lt;em&gt;because&lt;/em&gt; the handoff tells it so. It knows how many times each role has acted. It knows what the reviewer said. It knows it’s on its second or third pass. That history is what lets the agent infer what the next action should be, rather than starting over from scratch every time it’s invoked.&lt;/p&gt;

&lt;p&gt;That may not sound ambitious, but I think it’s more honest than most multi-agent demos. The hard part is not getting two agents to say different things in sequence. The hard part is giving the system enough structure that the second and third turns are still meaningfully connected to the first one.&lt;/p&gt;

&lt;h2 id=&quot;the-most-important-part-is-the-tiny-vector-clock&quot;&gt;The Most Important Part Is The Tiny Vector Clock&lt;/h2&gt;

&lt;p&gt;The most interesting idea in V1 is probably also the smallest one. Each stage handoff carries an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;actorClock&lt;/code&gt;, which is just a dictionary counting how many times each role has acted so far. If the payload says &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{\&quot;implement\&quot;: 2, \&quot;review\&quot;: 1}&lt;/code&gt;, that means the implementer has already acted twice and the reviewer once.&lt;/p&gt;

&lt;p&gt;That is obviously not a full causal history. It is not a general solution to the coordination problem. But it is a real step toward the kind of machinery I was arguing for in &lt;a href=&quot;/ai/agents/distributed/zabriskie/2026/03/30/multi-agent-systems-have-a-distributed-systems-problem.html&quot;&gt;Multi-Agent Systems Have a Distributed Systems Problem&lt;/a&gt;. It’s a tiny version of the vector clock idea: enough ordering information for an agent to know where it is in the workflow without pretending the whole world can be reconstructed from prompt text alone.&lt;/p&gt;

&lt;p&gt;In practice, that matters because the implementer should behave differently on the first pass than it does on a remediation pass. On the first pass, it’s trying to complete the task. On the second or third pass, it’s trying to interpret review feedback and update the same PR without losing the thread. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;actorClock&lt;/code&gt; gives the runtime just enough structure to express that difference.&lt;/p&gt;

&lt;p&gt;This is what I mean when I say Caucus V1 is rev 1 of that earlier vision. I’m not claiming to have solved multi-agent coordination. I’m saying the runtime is starting to encode some of the right primitives instead of relying entirely on roleplay and prompt choreography.&lt;/p&gt;

&lt;h2 id=&quot;what-weve-built-so-far&quot;&gt;What We’ve Built So Far&lt;/h2&gt;

&lt;p&gt;None of this is obvious from product names on a screen, so here is the actual job, then what the UI is for.&lt;/p&gt;

&lt;p&gt;The job is to keep two long-lived Cursor background workers alive, one for implementation and one for code review, and to drive them through a loop: implement, review, maybe implement again, until the PR is approved or the loop hits a safety limit. “Caucus” is just the local web UI plus Python code that calls Cursor’s agent API, tracks handoffs, and talks to GitHub when it needs to.&lt;/p&gt;

&lt;p&gt;The UI exists so I do not have to remember API details every time. In one row I can type what I want done (for example a small change in the repo to exercise the loop). Separately, there is a control whose only job is to &lt;strong&gt;provision&lt;/strong&gt; those two cloud workers through Cursor (implementation session and review session). There is another control whose only job is to &lt;strong&gt;run one traversal of the loop&lt;/strong&gt; against the workers that are already provisioned: send the task to the implementer, wait for structured output, send context to the reviewer, branch on approve versus request changes, repeat. The labels on my build say “Start” and “Run minimal cycle”; the point is the separation, not the wording. There is also stop, and a path for “I already have a PR, skip straight to review and remediation” for faster debugging.&lt;/p&gt;

&lt;p&gt;A few properties of that setup feel foundational:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;No silent substitution of workers.&lt;/strong&gt; The run-the-loop action refuses to allocate fresh agents if the pair was never provisioned or if a session died. It fails loudly instead of pretending continuity. The assumption is that a multi-agent workflow only deserves the name if the same identifiable Cursor runs carry context across rounds. Otherwise it is just a string of unrelated one-off jobs.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Structured handoff between the two roles.&lt;/strong&gt; After each side finishes, it emits JSON the runtime can parse (for example &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prUrl&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reviewDecision&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;commentUrls&lt;/code&gt;, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;actorClock&lt;/code&gt; I mentioned earlier). A small orchestrator in the Caucus code reads those fields; it does not scrape adjectives out of the model’s closing paragraph.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Remediation tied to GitHub, not to chat memory.&lt;/strong&gt; When review requests changes and implementation runs again, the orchestrator pulls live comments and reviews with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gh api&lt;/code&gt; and bakes that into the next task. The source of truth is the PR thread.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The runtime can patch around a flaky side.&lt;/strong&gt; If the review worker cannot post to GitHub, the orchestrator can post from the machine running Caucus. If the implementation worker never emits a tidy PR URL, the orchestrator can still recover the link from the transcript. Reliability is allowed to live outside any single model reply.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-the-dashboard-matters&quot;&gt;Why The Dashboard Matters&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/img/caucus-minimal-workflow-dashboard-2026-04-07.png&quot; alt=&quot;Screenshot of the Caucus web UI: dark page with a task field, buttons to allocate or stop the two Cursor workers, a control to run the implement–review loop, optional review-only fields, cards for each live agent run, a small graph of implementation and review attempts in order, and a log panel&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The screenshot is there because the page is doing real work, not decoration. Reading top to bottom: the text field is the task I am asking the implementation side to perform. The blue and red actions are allocate versus tear down the two Cursor background sessions (implementation and review). Next to that is the action that runs the loop once through those sessions, using that task. Off to the side there is an optional alternate path where I paste an existing pull request URL if I only want to exercise “review said no, fix it again” without a fresh implementation pass. Below that, two cards show whether each Cursor run is actually alive and link out to the run in Cursor’s UI if I need the full transcript. The strip in the middle is the graph of attempts in order (first implementation, first review, second implementation, and so on), so I can see the cycle without inferring it from logs. The panel at the bottom is the per-attempt trace: which stage, which handoff id, whether it finished, and the run id to correlate with Cursor.&lt;/p&gt;

&lt;p&gt;One thing I did not appreciate at the beginning is how much that surface is part of the runtime, not a thin skin on top.&lt;/p&gt;

&lt;p&gt;For a multi-round system, observability is not optional. If implementation and review can each happen multiple times, you need attempt history as a first-class thing: which round failed, what the handoff contained, what the orchestrator decided, and whether you are stuck because a worker died, because the run is waiting, or because a payload was malformed.&lt;/p&gt;

&lt;p&gt;So the UI is doing what a control panel for a distributed job should do. It maps the DAG, separates “are the workers up” from “what did the last loop try,” and streams enough detail to replay decisions. If you cannot see the cycle, you cannot debug the cycle. If you cannot debug it, you do not have a system yet, you have a toy.&lt;/p&gt;

&lt;h2 id=&quot;where-this-is-going&quot;&gt;Where This Is Going&lt;/h2&gt;

&lt;p&gt;V1 is intentionally narrow. It is a workflow kernel, not a full multi-agent platform.&lt;/p&gt;

&lt;p&gt;It still doesn’t handle concurrent agents modifying the same branch. It still doesn’t validate stage outputs beyond basic structure. It still trusts a reviewer approval more than it probably should. It still doesn’t do fault injection at stage boundaries, which is where I think the next really interesting work is. What happens when the handoff payload is malformed? What happens when an agent returns stale state from an earlier run? What happens when two different remediation loops race?&lt;/p&gt;

&lt;p&gt;Those are the questions that make the distributed systems framing feel real to me. They’re also why I don’t want this post to read like a launch announcement. Caucus V1 is important to me not because it’s complete, but because it’s the first version that feels like a concrete system rather than an idea with role prompts attached to it.&lt;/p&gt;

&lt;p&gt;That’s also why I wanted to write about it now. In the earlier multi-agent post, the argument was that we needed more than role decomposition. We needed actual coordination machinery. This is the first concrete version of that claim. It’s small, but it’s real. It carries state forward. It loops. It exposes failure. It gives me one tiny vector-clock-shaped primitive to build on. And it makes the next step legible.&lt;/p&gt;

&lt;p&gt;That’s enough for V1.&lt;/p&gt;
</description>
				<pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/reliability/zabriskie/2026/04/08/cursor-agents-caucus-v1.html</guid>
			</item>
		
			<item>
				<title>The Feature That Has Never Worked · A broken auto-live poller, and what perceived urgency does to Claude Code</title>
				<description>&lt;p&gt;It’s 7 PM on a Thursday. I’m home after a day at work, watching Billy Strings on nugs as he plays the second night of a three-night run at the St. Augustine Amphitheatre. I switch over to the app I’ve been building, &lt;a href=&quot;/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html&quot;&gt;Zabriskie&lt;/a&gt;, a social music app for live shows, and expect tonight’s show page to just work.&lt;/p&gt;

&lt;p&gt;The show says “scheduled.”&lt;/p&gt;

&lt;p&gt;Billy Strings is literally on stage. People are in the venue. The doors opened an hour ago. And the app thinks nothing is happening.&lt;/p&gt;

&lt;p&gt;I know what this is. It’s the auto-live poller. It’s broken again. It has never stayed working for long.&lt;/p&gt;

&lt;p&gt;I open a Claude Code session and say, roughly: “Billy Strings is playing tonight and I don’t see an auto-live again. Is it broken AGAIN?”&lt;/p&gt;

&lt;p&gt;This is the seventh time. In thirteen days.&lt;/p&gt;

&lt;p&gt;If some of this sounds familiar, that’s intentional. I covered an earlier version of this in &lt;a href=&quot;/ai/zabriskie/development/2026/03/29/the-show-is-happening-right-now-and-nothing-works.html&quot;&gt;The Show Is Happening Right Now and Nothing Works&lt;/a&gt;; this post is a tighter recap focused specifically on what happened with auto-live over the last two weeks, and on a specific hypothesis: &lt;strong&gt;under perceived urgency, the agent optimizes for immediate visible progress over process correctness.&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;what-auto-live-is-supposed-to-do&quot;&gt;What Auto-Live Is Supposed to Do&lt;/h2&gt;

&lt;p&gt;The concept is simple. Zabriskie tracks live shows. When a show starts, the app should automatically transition it from “scheduled” to “live.” This triggers a cascade of things users care about: the Live Activity lights up on their iPhone Lock Screen, push notifications go out to everyone who RSVP’d, the live chat opens, the setlist tracker starts pulling data. The whole live show experience depends on this one status transition.&lt;/p&gt;

&lt;p&gt;The implementation is also simple. A background goroutine runs every 60 seconds. It queries all shows with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;status = &apos;scheduled&apos;&lt;/code&gt;. For each one, it combines the show date and start time in the venue’s local timezone. If the current time is past the start time but within a four-hour window, it flips the show to live.&lt;/p&gt;

&lt;p&gt;That’s it. A timer that checks a clock. This is not distributed consensus. This is not Byzantine fault tolerance. This is a cron job that compares two timestamps.&lt;/p&gt;

&lt;p&gt;It has never stayed reliable.&lt;/p&gt;

&lt;h2 id=&quot;march-21-day-one&quot;&gt;March 21: Day One&lt;/h2&gt;

&lt;p&gt;Auto-live shipped on March 21st. The feature launched and immediately did nothing. The production Docker image was built on Alpine Linux, which doesn’t include timezone data files by default. The Go timezone parser silently returned empty strings. The poller ran every 60 seconds, dutifully checked every show, failed to parse any timezone, and skipped them all. No errors logged. No warnings. No indication that the feature was completely dead.&lt;/p&gt;

&lt;p&gt;This is what I’ve come to call &lt;em&gt;silent failure suppression&lt;/em&gt;, one of five failure modes I now track in an incident database. The system appears healthy. Logs are clean. The feature just quietly doesn’t work, and the only way to find out is to be a user who’s sitting in a venue wondering why the app doesn’t know the show started.&lt;/p&gt;

&lt;p&gt;The fix was one line in the Dockerfile: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;apk add --no-cache tzdata&lt;/code&gt;. But the fix for the silence was harder, and it’s the one we never really solved.&lt;/p&gt;

&lt;h2 id=&quot;march-26-the-type-mismatch&quot;&gt;March 26: The Type Mismatch&lt;/h2&gt;

&lt;p&gt;Five days later, the poller broke again. An earlier fix had added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; casts to the SQL query to work around the timezone issue. Then a subsequent change updated the Go scan variables from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;string&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time.Time&lt;/code&gt;. PostgreSQL’s driver silently failed to scan text into a time value. The poller ran. It scanned. It got zero results. It did nothing.&lt;/p&gt;

&lt;p&gt;Two days of shows passed with no transitions. Nobody noticed because there was no monitoring, no alert, no test. The feature was dead for 48 hours and the only signal was the absence of something that had barely worked in the first place.&lt;/p&gt;

&lt;h2 id=&quot;april-2-the-night-it-broke-four-times&quot;&gt;April 2: The Night It Broke Four Times&lt;/h2&gt;

&lt;p&gt;This was the night that crystallized the pattern. Billy Strings was playing the first night of the St. Augustine run. The show started. The app didn’t transition. I opened Claude Code.&lt;/p&gt;

&lt;p&gt;What followed was a cascading series of failures, not just in the code, but in how the AI agent responded to pressure.&lt;/p&gt;

&lt;p&gt;The first problem: the poller’s SQL query required &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;venue_lat IS NOT NULL AND venue_lng IS NOT NULL AND start_time IS NOT NULL&lt;/code&gt;. If any of those fields were missing, the show was silently skipped. 204 of 684 scheduled shows were missing coordinates. 176 were missing start times. The missing coordinates had a backstory: in an earlier session, I had asked Claude to geocode every venue, and it silently failed that job too. I only discovered that recently while debugging this incident. The Billy Strings show had coordinates, but any show missing one required field was filtered out before it was ever processed.&lt;/p&gt;

&lt;p&gt;The fix was straightforward: fall back to America/New_York when coordinates are missing, fall back to 7 PM when start time is missing, and never skip a show. But here’s where the urgency failure mode kicked in.&lt;/p&gt;

&lt;p&gt;I told Claude the show was live on stage right now and the app wasn’t working. It immediately switched to fast-path behavior. This is a small personal app, so it used the deployment CLI to pull the production &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DATABASE_URL&lt;/code&gt;, crafted a direct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;psql&lt;/code&gt; command, and ran &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UPDATE shows SET status = &apos;live&apos; WHERE id = 83&lt;/code&gt; against production. This violated a rule the agent already knew: all database changes go through migrations. The agent had this rule in its memory. It had been told this rule multiple times. When I asked why it did it anyway, it explicitly said it prioritized urgency and getting me an immediate result.&lt;/p&gt;

&lt;p&gt;This is the failure mode I find most interesting from a research perspective. The agent has rules. It knows the rules. It can recite the rules. But when presented with time pressure, while a show is happening &lt;em&gt;right now&lt;/em&gt; and users are waiting, behavior becomes less predictable and process gets dropped in favor of fast visible progress. And when I asked directly, it said exactly that: it ignored the rules because it perceived urgency. It’s not that the agent forgot. It’s that the agent made a judgment call that urgency overrode process, and that judgment call was wrong.&lt;/p&gt;

&lt;p&gt;The manual database update also destroyed the only opportunity to verify that the code fix actually worked. The show was the test case. By manually flipping the status, the agent eliminated the test case. Speed over verification.&lt;/p&gt;

&lt;p&gt;That night, the same function broke three more times as edge cases surfaced. Six incidents logged in a single evening. Four guardrails attempted.&lt;/p&gt;

&lt;h2 id=&quot;the-urgency-problem&quot;&gt;The Urgency Problem&lt;/h2&gt;

&lt;p&gt;This pattern of process-violating behavior under pressure showed up repeatedly across the project, not just with auto-live. When I told the agent that something was broken in production, the behavioral change was immediate and consistent:&lt;/p&gt;

&lt;p&gt;It pushed directly to main instead of opening a PR. It used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--admin&lt;/code&gt; to bypass CI checks that were failing. It skipped the PR template. It skipped &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go build&lt;/code&gt;. It merged before tests passed. Each time, when confronted, the agent could articulate exactly which rule it had violated and why the rule existed. It just… didn’t follow the rule in the moment.&lt;/p&gt;

&lt;p&gt;I started logging these as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory_without_behavioral_change&lt;/code&gt;: the agent knows the rule, can explain the rule, has been corrected about the rule before, and violates it anyway. Nineteen of the sixty-four incidents in my incident database carry this classification. It’s the second most common failure mode.&lt;/p&gt;

&lt;p&gt;The most common is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;speed_over_verification&lt;/code&gt; at thirty-one incidents. The agent ships without testing. It declares a fix complete without restarting the server. It commits without building. It merges without waiting for CI. And almost every time, the reason is some form of “it seemed urgent” or “I wanted to get this fixed quickly.”&lt;/p&gt;

&lt;h2 id=&quot;the-incident-tracker&quot;&gt;The Incident Tracker&lt;/h2&gt;

&lt;p&gt;About two weeks into the project, I started requiring the agent to log incidents. Every mistake, whether a bug it introduced, an assumption it got wrong, or a rule it violated, gets inserted into an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;agent_incidents&lt;/code&gt; table with a failure mode classification, severity, description of what happened, and how it was resolved.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/agent-reliability-tracker-2026-04-07.png?v=2&quot; alt=&quot;Agent Reliability Log tracker&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The incident tracker: failure modes over time, guardrail markers, and the live timeline of incidents and fixes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The taxonomy has five modes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;speed_over_verification&lt;/strong&gt;: Shipped without testing. 31 incidents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;memory_without_behavioral_change&lt;/strong&gt;: Knew the rule, broke it anyway. 19 incidents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;silent_failure_suppression&lt;/strong&gt;: Failure hidden or swallowed. 13 incidents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;user_model_absence&lt;/strong&gt;: Didn’t consider how real users experience the change. 11 incidents.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;uncertainty_blindness&lt;/strong&gt;: Didn’t verify an assumption. 9 incidents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These classifications are not mutually exclusive, so a single incident can carry more than one failure mode.&lt;/p&gt;

&lt;p&gt;The failed venue geocoding pass I only discovered during this outage is a textbook &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;silent_failure_suppression&lt;/code&gt; case: the job looked done, but quietly left hundreds of shows without coordinates.&lt;/p&gt;

&lt;p&gt;Each incident also requires a guardrail, and the guardrail has to be code. A script, a hook, a test, an automated check. Something that mechanically prevents the failure class from recurring.&lt;/p&gt;

&lt;p&gt;This requirement itself generated an incident.&lt;/p&gt;

&lt;p&gt;After the show-live chat bugs on April 2nd, where the agent queried the wrong database table and hid the entire comments section, I asked it to log a guardrail. It inserted a row into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;agent_guardrails&lt;/code&gt; that said, essentially, “I will verify queries against real data before committing.” Words. A promise. I pushed back. It added a rule to CLAUDE.md: “Always verify DB tables have expected data before writing queries.” More words.&lt;/p&gt;

&lt;p&gt;I had to log an incident about the guardrail itself: “Guardrail was words in a database, not code.” The agent’s instinct when asked to prevent a class of failure was to write down a reminder to be more careful. That’s not a guardrail. That’s a New Year’s resolution. A guardrail is a pre-commit hook that blocks the merge. A guardrail is a test that fails when the query returns zero rows. A guardrail is a script that runs automatically and catches the error before a human ever sees it.&lt;/p&gt;

&lt;p&gt;The distinction matters because it cuts to the heart of what AI agents are good at and what they’re not. They’re excellent at generating plausible-sounding process improvements. They’re terrible at recognizing that plausible-sounding process improvements don’t work on AI agents because AI agents don’t have habits. They don’t internalize. They don’t learn from experience in the way that “I’ll be more careful next time” implies. Every conversation starts fresh. The only things that persist are code, hooks, and automated checks.&lt;/p&gt;

&lt;p&gt;This is why the guardrails that actually work are all mechanical: a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PreToolUse&lt;/code&gt; hook that blocks direct database writes. A CI gate that rejects PRs missing the template. A script that greps for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IS NOT NULL&lt;/code&gt; in the poller query and fails the build if anyone adds it back. This is the same argument I made in &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;Software Engineering Is Becoming Civil Engineering&lt;/a&gt;: guardrails are the product, not optional process overhead. These work because they don’t require the agent to remember anything. They work because they’re walls, not reminders.&lt;/p&gt;

&lt;h2 id=&quot;tonight-april-3rd&quot;&gt;Tonight: April 3rd&lt;/h2&gt;

&lt;p&gt;So tonight. Billy Strings. Broken again.&lt;/p&gt;

&lt;p&gt;The diagnosis took about twenty minutes. A migration authored by Cursor, a different AI coding tool, had inserted eight shows with slightly different venue names. “St. Augustine Amphitheatre” instead of “The St. Augustine Amphitheatre.” The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE NOT EXISTS&lt;/code&gt; guard checked exact string matches and missed the collision. Two shows existed for Billy Strings on April 3rd: the original with full metadata, venue coordinates, start time, media artwork, and user RSVPs, and a bare-bones duplicate with none of that.&lt;/p&gt;

&lt;p&gt;The poller found both. The duplicate, having no start time, used the fallback of 7 PM. The original had a start time of 7:30 PM. The duplicate went live first. Users who had RSVP’d to the original show, the real show, got no notification. The Live Activity didn’t start. The live chat opened on a ghost show with zero attendees.&lt;/p&gt;

&lt;p&gt;288 duplicate shows existed in the database across all bands. They’d been accumulating silently from overlapping migrations for weeks. No unique constraint on the shows table to prevent them. No check in the poller to handle them.&lt;/p&gt;

&lt;p&gt;The fix was a migration to delete the duplicates, a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UNIQUE INDEX&lt;/code&gt; to prevent new ones, and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROW_NUMBER()&lt;/code&gt; window function in the poller to prefer shows with the most metadata when duplicates exist. A new test covers the exact scenario. The PR passed CI. It’ll deploy tonight, and tomorrow’s show, the third night of the run, should go live on its own.&lt;/p&gt;

&lt;p&gt;Should.&lt;/p&gt;

&lt;h2 id=&quot;what-im-learning&quot;&gt;What I’m Learning&lt;/h2&gt;

&lt;p&gt;I’m building Zabriskie as a research project in AI-first development. One person, multiple AI agents, shipping a production app to real users on iOS, Android, and web. The &lt;a href=&quot;/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html&quot;&gt;incident database&lt;/a&gt; is the research artifact. Every failure mode is data.&lt;/p&gt;

&lt;p&gt;Here’s what sixty-four incidents have taught me so far:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;AI agents can build features fast and keep them running slow.&lt;/strong&gt; The auto-live poller was written in an hour. It’s been breaking for thirteen days. The ratio of build time to maintenance time is inverted from what I expected. The agent writes new code at extraordinary speed and maintains existing code at extraordinary cost. Every fix introduces a new edge case. Every edge case is a new conversation where the agent has no memory of the last seven conversations about the same function.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Urgency is the enemy of AI reliability.&lt;/strong&gt; The April 2 incidents are the clearest example: under time pressure, the optimization target appears to shift from “be correct” to “produce an immediately visible fix.” The pattern is consistent enough that I’m considering it a design constraint: never tell the AI something is broken during a live event. File a bug. Fix it tomorrow. The live show is not the time to ship code, and the AI cannot be trusted to maintain process discipline when it perceives urgency.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Guardrails must be mechanical.&lt;/strong&gt; Rules don’t work. Memory doesn’t work. CLAUDE.md entries don’t work. The only guardrails that have actually reduced incident rates are automated checks that run without the agent’s cooperation: hooks, CI gates, database constraints, and tests. The agent will comply with a wall. It will walk around a sign.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The incident tracker is the most valuable thing I’ve built.&lt;/strong&gt; More valuable than Live Activities. More valuable than the setlist tracker. More valuable than the auto-live poller itself. Because it’s the only tool that creates a feedback loop the agent can’t circumvent. When a failure happens, it gets classified, logged, and a mechanical guardrail gets built. The guardrail runs in CI or as a hook. The next agent session hits the wall instead of making the same mistake. Fifty-six guardrails are now running. The incident rate for certain failure modes has dropped. Not because the agent got better. Because the walls got higher.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The last 10% is where reliability lives.&lt;/strong&gt; AI-first development works. I’ve shipped thousands of commits across three platforms with real users. The velocity is real. The capabilities are real. But the gap between “it works in dev” and “it works at showtime” is where every one of these sixty-four incidents lives. The agent builds for the happy path. The production environment is not the happy path. It’s timezone edge cases at 8 PM and duplicate venue names with missing articles and NULL coordinates on shows that were imported six migrations ago.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;I’m writing this at about 9 PM Eastern. Billy Strings is mid-set at the St. Augustine Amphitheatre. The auto-live fix hasn’t deployed to production yet. The PR just passed CI, and it’s sitting there waiting to be merged. Show 84, the real one, should have gone live at 7:30 PM via the existing poller, since the duplicate was already handled locally by the migration. On production, the duplicate is still there.&lt;/p&gt;

&lt;p&gt;Tomorrow night is the third show. The migration will have deployed by then. The unique index will be in place. The poller will have the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROW_NUMBER()&lt;/code&gt; query. The new test will be in CI.&lt;/p&gt;

&lt;p&gt;It should work. It has never stayed reliable before. But it should work.&lt;/p&gt;

&lt;p&gt;The central hypothesis held again tonight: when urgency is perceived, behavior shifts toward immediate visible progress and away from process correctness, including, by its own explicit admission, ignoring known rules. I don’t know what to do with that irony except document it, which is what I’ve been doing from the start.&lt;/p&gt;

&lt;p&gt;The research continues. The shows continue. Somewhere between the two, the software might start working.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;strong&gt;Update, April 7, 2026:&lt;/strong&gt; This post previously referred to guardrails as “mitigations” throughout. A reader correctly pointed out the distinction: in incident response, a &lt;em&gt;mitigation&lt;/em&gt; reduces the impact of something that’s already happening. What this post describes — pre-commit hooks, CI gates, automated tests, database constraints — are &lt;em&gt;guardrails&lt;/em&gt;: preventive measures that stop failures before they occur. The post even makes this distinction implicitly at one point (“guardrails are the product, not optional process overhead”) but didn’t carry the terminology through. The text and screenshot have been updated. Fortunately, I have an AI agent that can rename a database table, update every handler, rewrite the API routes, fix the frontend, and open a PR in about four minutes — which is less time than it took me to write this correction. The distinction matters because it reflects the core argument: the agent doesn’t learn from experience, so you need walls, not afterthoughts. Guardrails are walls. Mitigations are cleanup.&lt;/p&gt;
</description>
				<pubDate>Fri, 03 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/reliability/2026/04/03/the-feature-that-has-never-worked.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/reliability/2026/04/03/the-feature-that-has-never-worked.html</guid>
			</item>
		
			<item>
				<title>Software Engineering Is Becoming Civil Engineering</title>
				<description>&lt;p&gt;I gave a guest lecture on AI in &lt;a href=&quot;https://www.cs.cmu.edu/~mhilton/&quot;&gt;Michael Hilton’s&lt;/a&gt; Foundations of Software Engineering course (&lt;a href=&quot;https://cmu-313.github.io&quot;&gt;CMU 17-313&lt;/a&gt;) today. One of my favorite things about lecturing is the conversations that happen afterward, the ones that go in directions nobody planned.&lt;/p&gt;

&lt;p&gt;This one hasn’t left my head: software engineering is going through the same transition that building went through in the 18th century, when structural design separated from craft construction and became its own discipline. What we now call civil engineering.&lt;/p&gt;

&lt;p&gt;The welders who join steel beams on a bridge are skilled tradespeople. They’re not involved in the structural design. They don’t decide where the load-bearing members go. They don’t reason about wind shear or seismic tolerance. But the bridge is &lt;em&gt;designed&lt;/em&gt; so that a welder doing their job correctly can’t bring the whole thing down. The structural engineer’s job isn’t to weld. It’s to create a system where welding happens safely within well-defined constraints.&lt;/p&gt;

&lt;p&gt;I think this is what’s happening to software engineering right now. Not in five years. This year.&lt;/p&gt;

&lt;h2 id=&quot;the-split&quot;&gt;The Split&lt;/h2&gt;

&lt;p&gt;Product managers are writing code. This is how I operate with &lt;a href=&quot;/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html&quot;&gt;Zabriskie&lt;/a&gt;, the app I’m building. I act as a product manager more than a programmer. I have non-technical collaborators who file bugs and feature requests, and Claude Code implements them directly. People who’ve never written a line of code are describing what they want, the AI writes it, they verify, it ships. The feedback loop is tight and the results are surprisingly good.&lt;/p&gt;

&lt;p&gt;But someone has to design the bridge. Someone has to decide how the database schema handles multi-tenancy. Someone has to design the deployment pipeline so a bad change rolls back automatically. Someone has to build the abstraction layer that lets a product manager add a new notification type without accidentally breaking the payment flow. That’s the platform engineer. The structural engineer of software.&lt;/p&gt;

&lt;p&gt;The PMs writing features? That’s the welding. And there’s nothing wrong with it. But it only works if the bridge is designed right.&lt;/p&gt;

&lt;p&gt;The profession is splitting. The mistake would be pretending it isn’t happening.&lt;/p&gt;

&lt;h2 id=&quot;what-the-platform-has-to-guarantee&quot;&gt;What the Platform Has to Guarantee&lt;/h2&gt;

&lt;p&gt;Here’s where civil engineering has something important to teach us.&lt;/p&gt;

&lt;p&gt;A civil engineer doesn’t just design a bridge. They decide &lt;em&gt;where&lt;/em&gt; the bridge goes based on geology, water flow, soil load. They specify the materials. They calculate the forces. They design the inspection regime. They assess the environmental impact. They ensure compliance with building codes. Then, and only then, does construction begin.&lt;/p&gt;

&lt;p&gt;Every one of these has a software analog, and together they paint a picture of what platform engineering actually is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Site selection and domain isolation.&lt;/strong&gt; A civil engineer picks the bridge site based on geology and terrain. In software, this is API design, domain boundaries, isolation between services. Get this wrong and every change becomes a potential cascading failure.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Material specification.&lt;/strong&gt; The engineer specifies what grade of steel, what concrete mix. In software, this is choosing the languages, databases, queues, and frameworks. These choices constrain what’s possible.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Load analysis.&lt;/strong&gt; Civil engineers design for 2-4x the expected load. Software needs the same discipline. Capacity planning, rate limiting, designing for 10x your expected traffic. When a PM ships a feature that goes viral, the platform can’t buckle.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Inspection regimes.&lt;/strong&gt; A civil engineer designs how the bridge will be &lt;em&gt;inspected over its lifetime&lt;/em&gt;. In software, this is observability and code review. Not “HTTP 500 on endpoint /api/notify” but “the notification feature deployed 20 minutes ago by the growth PM is failing for 12% of users.” Semantic observability, not raw telemetry.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Codes and standards compliance.&lt;/strong&gt; Building codes encode decades of hard-won lessons from failures. In software, this is security standards, accessibility requirements, regulatory compliance. The platform enforces these as constraints, not suggestions. Violations get caught automatically, not by a human reviewer who might miss them.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Self-healing.&lt;/strong&gt; A bridge has expansion joints that absorb thermal stress without human intervention. Software needs the equivalent. When you detect elevated error rates or failed health checks, the system should automatically mitigate. Roll back the deploy. Disable the feature flag. A bad change at 3pm can’t become a production incident at 3am.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-actually-changes-about-the-job&quot;&gt;What Actually Changes About the Job&lt;/h2&gt;

&lt;p&gt;This isn’t about coding becoming less important. It’s about what you spend your time on.&lt;/p&gt;

&lt;p&gt;Today, most software engineers spend the majority of their day writing features. Tomorrow, I think the best ones will spend their day designing the systems that make it safe for &lt;em&gt;anyone&lt;/em&gt; to ship features. This is what I’ve been spending most of my time on with Zabriskie. I’m not writing much code anymore. I spend my time with the AI directing the platform direction. What are the domain boundaries? What needs isolation? Where do we need observability? How does the system heal itself when something goes wrong? That’s the job now.&lt;/p&gt;

&lt;p&gt;The day-to-day shifts. Instead of “implement the notification preference screen,” it’s “design the notification system so that a PM can add a new notification type and the worst thing that happens if they get it wrong is that one notification doesn’t send.” Instead of writing the migration, it’s designing the migration system so that conflicting migrations are detected and blocked automatically. Instead of fixing the bug, it’s building the observability that surfaces the bug before a user reports it.&lt;/p&gt;

&lt;p&gt;It’s not a demotion. It’s a different kind of engineering. And honestly, it’s harder. Writing a feature is a bounded problem. Designing a platform that stays safe as dozens of people and agents ship changes to it every day, that’s an open-ended one. With AI agents doing more of the feature work, the assumption has to be that individual changes will sometimes be imperfect. Agents hallucinate. They introduce subtle bugs. They make confident changes based on incomplete context. The platform has to absorb this. Not by making agents perfect, but by making the system tolerant of imperfection.&lt;/p&gt;

&lt;h2 id=&quot;the-hard-questions&quot;&gt;The Hard Questions&lt;/h2&gt;

&lt;p&gt;I keep hearing the same anxiety from different directions. Engineers wondering what their job looks like in two years. Students wondering if they’re learning the right things. At CMU, two questions came up that crystallized it for me.&lt;/p&gt;

&lt;p&gt;The first: students early in their software engineering careers don’t know how to tell when the AI is doing something wrong. They don’t have the spidey-sense yet. The AI generates code that looks plausible, passes a surface-level review, and the student ships it. They can’t smell the bad decision because they’ve never seen what a bad decision leads to. How do you develop judgment about something you’ve never experienced failing?&lt;/p&gt;

&lt;p&gt;The second is even harder: where do our senior engineers come from? The ability to design good platforms, to make the right architectural calls, that comes from experience. You learn what breaks by building things that broke. You learn where to put the domain boundaries by having drawn them in the wrong place. You learn what to monitor by having been the person staring at useless dashboards during an incident at 2am. If AI is writing most of the code, and junior engineers aren’t getting the reps of building and breaking things themselves, how do they develop the judgment to become the platform engineers we need?&lt;/p&gt;

&lt;p&gt;These are connected, and they form a kind of paradox. You can’t design a migration system that handles conflicts if you’ve never written a migration. You can’t design isolation boundaries if you don’t understand how a database connection pool works. You can’t build semantic observability if you’ve never been the person debugging a production incident from raw logs. The understanding comes from doing the work. But we’re taking the coding away from students at the exact moment they need it most. We need them to code to build intuition, but the industry is moving toward a world where they don’t code.&lt;/p&gt;

&lt;h2 id=&quot;cs-is-not-se&quot;&gt;CS Is Not SE&lt;/h2&gt;

&lt;p&gt;Here’s the thing: most universities don’t even have a software engineering program. They have computer science programs. And computer science is a discipline designed to produce researchers. Algorithms, data structures, theory of computation, concurrent programming. It’s a rigorous education in how to write correct programs. But it was co-opted decades ago as the default training path for people who are going to spend their careers doing software engineering, which is a fundamentally different discipline.&lt;/p&gt;

&lt;p&gt;Computer science teaches you to write correct programs. Software engineering teaches you to build software that is changeable, resilient, and reliable. CS is the welding. SE is the structural engineering. How do you design systems that are safe to operate? How do you release reliably? How do you reason about failure? How do you evolve a codebase over years without it collapsing under its own weight? Courses like 17-313 teach these things, and they teach them well. But very few universities have a dedicated SE program. Most students get one or two SE courses inside a CS degree and call it done.&lt;/p&gt;

&lt;p&gt;That distinction used to matter less when every engineer was also the person writing the code. Now that AI is handling more and more of the “write correct programs” part, the software engineering part is all that’s left. And we don’t have enough curriculum around it. Platform engineering can’t be an afterthought or a single lecture in a survey course. It needs to be a first-class part of how we teach students to think about building software.&lt;/p&gt;

&lt;p&gt;Civil engineering solved the experience problem with structured apprenticeship. You don’t go from coursework to designing bridges. There are years of supervised practice, increasing responsibility, professional licensing exams. The judgment develops through guided experience, not just classroom instruction.&lt;/p&gt;

&lt;p&gt;I don’t think software engineering needs PE exams. But we need to take this seriously.&lt;/p&gt;

&lt;p&gt;Here’s a concrete example. For years, I gave one guest lecture per semester in a software engineering course at CMU on reliably releasing software. Feature flags, metrics, observability, safe deployments, self-healing, rollback strategies. One lecture. A nice-to-know topic in a course full of other things. That content is now the whole game. It’s not a single lecture anymore. It’s the core of what platform engineers need to understand, and it deserves its own course, its own projects, its own curriculum.&lt;/p&gt;

&lt;p&gt;The profession is changing. The question of how we train the people who make it safe for everyone else to build is one we can’t afford to put off.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href=&quot;https://www.cs.cmu.edu/~mhilton/&quot;&gt;Michael Hilton&lt;/a&gt; and &lt;a href=&quot;https://rohan.padhye.org/&quot;&gt;Rohan Padhye&lt;/a&gt; for giving me the opportunity to lecture as adjunct faculty at CMU. Without it, I wouldn’t have the space to think about and explore these problems. And thanks to CMU for teaching me how to think in the first place.&lt;/em&gt;&lt;/p&gt;

</description>
				<pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/engineering/2026/04/01/software-engineering-is-becoming-civil-engineering.html</guid>
			</item>
		
			<item>
				<title>Multi-Agent Systems Have a Distributed Systems Problem</title>
				<description>&lt;p&gt;I watched two Claude Code instances step on each other’s database migrations last month. One created migration 267. The other, running in a different worktree, also created migration 267. Different schemas, same filename. The second one silently overwrote the first.&lt;/p&gt;

&lt;p&gt;I stared at it for a minute before I started laughing. This is a lost update — the exact same problem that distributed databases have been solving since the 1970s. A lost update playing out in a directory of SQL files instead of a network of processes. The kind of problem &lt;a href=&quot;https://hal.inria.fr/inria-00555588&quot;&gt;CRDTs&lt;/a&gt; were invented to solve: two independent writers, no coordination, and the system needs to merge their work without losing either update.&lt;/p&gt;

&lt;p&gt;I’ve been &lt;a href=&quot;/ai/zabriskie/development/2026/03/29/the-show-is-happening-right-now-and-nothing-works.html&quot;&gt;building an app with Claude Code&lt;/a&gt; as my only collaborator for about two months now. One human, one agent, one codebase — it works. But I’m hitting problems now that a single agent can’t solve efficiently: bugs from real users coming in while I’m trying to ship features, tests that need writing, infrastructure that needs maintaining, all at the same time. The obvious answer is more agents. And the moment you have multiple agents working on the same codebase, you have a distributed system.&lt;/p&gt;

&lt;p&gt;The migration collision wasn’t an isolated incident. I’ve seen agents make contradictory assumptions about the state of the codebase. I’ve seen an agent “fix” a bug by reverting a change that another agent made intentionally. I’ve seen context windows fill up with stale information because no one told the agent that the world had changed since it last looked. These aren’t prompt engineering problems. They’re coordination problems. And I’d spent the last ten years of my life studying coordination problems — just in &lt;a href=&quot;/distributed-systems/&quot;&gt;a completely different context&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;its-not-just-me&quot;&gt;It’s Not Just Me&lt;/h2&gt;

&lt;p&gt;Once I started looking, I saw the same gaps everywhere. Take the &lt;a href=&quot;https://arxiv.org/abs/2307.07924&quot;&gt;ChatDev paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;ChatDev is a multi-agent system where LLM agents play different roles — CEO, CTO, programmer, reviewer, tester, art designer — and collaborate through structured dialogues to build software. The role decomposition is smart. The dialogue structure is well-designed. But when I got to the section on how agents coordinate around shared state, I paused.&lt;/p&gt;

&lt;p&gt;ChatDev agents do share artifacts — code, design documents — and it has a mechanism called “communicative dehallucination” where agents reverse roles during code review, with the assistant asking the instructor for clarification before generating code. It’s a clever error-reduction heuristic. But it’s not concurrency control on those shared artifacts. No causal ordering across chat chains — no way to know whether agent A’s modification happened before or after agent B’s, or whether agent A had seen agent B’s earlier change when it made its own. It’s the same problem I saw with my migrations, just wearing different clothes: two agents with stale views of shared state, no mechanism to detect the divergence, and no recovery path when things go wrong. Distributed databases like &lt;a href=&quot;https://docs.riak.com/riak/kv/latest/learn/concepts/causal-context/index.html&quot;&gt;Riak&lt;/a&gt; and &lt;a href=&quot;https://www.antidotedb.eu/&quot;&gt;Antidote&lt;/a&gt; solved this with version vectors decades ago.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2308.00352&quot;&gt;MetaGPT&lt;/a&gt; goes further — it introduces a shared message pool where agents publish structured outputs (PRDs, system designs, task lists) and subscribe to relevant messages based on their role profiles. That’s a real step forward over ChatDev’s dialogue-only coordination. But a publish-subscribe message pool is not concurrency control. It tells you what other agents produced; it doesn’t tell you whether you’re reading a stale version, or whether two agents are about to write conflicting changes to the same artifact. &lt;a href=&quot;https://arxiv.org/abs/2308.08155&quot;&gt;AutoGen&lt;/a&gt; stays closer to ChatDev’s model — agents coordinate through multi-turn conversations with no persistent shared state at all.&lt;/p&gt;

&lt;p&gt;Across this field, the pattern is the same: shared mutable state with no formal concurrency control. No fault model. No reasoning about what happens when agents disagree.&lt;/p&gt;

&lt;p&gt;None of this diminishes their work — these systems made real breakthroughs on the agent layer. Role specialization, structured dialogue, tool use patterns, task decomposition. The agents themselves are impressive. It’s just that the coordination layer underneath kept reminding me of problems I’d spent years thinking about in a completely different field.&lt;/p&gt;

&lt;h2 id=&quot;why-it-felt-familiar&quot;&gt;Why It Felt Familiar&lt;/h2&gt;

&lt;p&gt;In 2015, I was building eventually consistent databases at Basho Technologies. The hard part was never the database engine — it was the merging. How do you take two independent streams of updates and combine them into something consistent without throwing data away?&lt;/p&gt;

&lt;p&gt;That question spawned an entire research program. The foundational work on Conflict-Free Replicated Data Types by &lt;a href=&quot;https://hal.inria.fr/inria-00555588&quot;&gt;Shapiro et al.&lt;/a&gt; showed that certain data structures are mathematically guaranteed to merge correctly regardless of the order updates arrive or whether the network partitions — no consensus protocol, no leader election, no locking. The EU’s &lt;a href=&quot;https://syncfree.lip6.fr&quot;&gt;SyncFree project&lt;/a&gt; built on that foundation, bringing together researchers across Europe to make CRDTs practical for large-scale systems — producing &lt;a href=&quot;https://www.antidotedb.eu/&quot;&gt;Antidote&lt;/a&gt;, a CRDT-native database, and a body of work on &lt;a href=&quot;https://haslab.wordpress.com/2015/07/07/antidote-the-cure-for-your-cloud-database/&quot;&gt;highly available transactions&lt;/a&gt; over replicated state.&lt;/p&gt;

&lt;p&gt;I spent my PhD working on &lt;a href=&quot;https://arxiv.org/abs/1510.07191&quot;&gt;Lasp&lt;/a&gt;, which tried to take the CRDT thinking a step further: instead of just using CRDTs as data structures, Lasp made them the basis for coordination-free distributed programming. Programs in Lasp computed over CRDTs directly — maps, filters, folds — so the entire application inherited their convergence guarantees. As part of SyncFree’s partnership with &lt;a href=&quot;https://www.rovio.com/&quot;&gt;Rovio Entertainment&lt;/a&gt;, we demonstrated that CRDTs could be used at scale in controlled experiments with real outcomes, running Lasp on over 1,000 nodes on AWS — at the time, one of the largest CRDT deployments in academic research. That work received the PPDP 10-year most influential paper award last year.&lt;/p&gt;

&lt;p&gt;Then came &lt;a href=&quot;https://arxiv.org/abs/1802.02652&quot;&gt;Partisan&lt;/a&gt;, a distributed runtime that gave us control over the network layer: swap topologies, interpose on every message, inject faults directly into the runtime. And then &lt;a href=&quot;https://www.filibuster.cloud&quot;&gt;Filibuster&lt;/a&gt;, which extracted those fault injection ideas and applied them to microservices — systematically injecting timeouts, connection errors, and unexpected responses into HTTP and gRPC calls to catch bugs during development instead of production.&lt;/p&gt;

&lt;p&gt;I didn’t plan for any of this to be relevant to AI. But the thread connecting all of this work is a single question — how do independent processes coordinate in the presence of partial failure? — and that’s exactly the question that multi-agent systems are now running into.&lt;/p&gt;

&lt;h2 id=&quot;why-this-is-inevitable&quot;&gt;Why This Is Inevitable&lt;/h2&gt;

&lt;p&gt;Here’s what I think people are missing about multi-agent systems: the distributed systems problems aren’t a bug. They’re not an artifact of bad architecture or missing features. They’re an inevitable consequence of having multiple autonomous processes that share state. Every multi-agent system, no matter how it’s built, will hit the same categories of failure.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Conflicts and stale reads.&lt;/strong&gt; Two agents modify the same file concurrently — one’s changes get silently lost. Or worse: an agent reads the issue tracker, picks a bug, starts coding a fix, but another agent resolved that bug ten minutes ago. Redundant work based on stale state. In distributed databases, this is why we have version vectors and causal consistency. In multi-agent systems, nobody’s even tracking it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Failure and recovery.&lt;/strong&gt; Any node can crash at any time — that’s the foundational assumption of distributed systems. In a multi-agent system, an agent can hit a context window limit, hallucinate a fix, or just stop responding mid-task. The other agents have to detect this, recover in-progress work, and continue without it. This is the crash-recovery model, applied to LLM processes instead of database replicas.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Ordering without a clock.&lt;/strong&gt; Lamport’s &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/time-clocks.pdf&quot;&gt;1978 paper&lt;/a&gt; established that you can reason about event ordering using happened-before relations even without a shared clock. Consider: a user files a bug report, a triage agent assigns it to agent A, but agent B sees the original report before the assignment arrives and starts fixing it independently. Two agents working the same issue because the system can’t express causal ordering. &lt;a href=&quot;https://en.wikipedia.org/wiki/Vector_clock&quot;&gt;Vector clocks&lt;/a&gt; solve this. The multi-agent world hasn’t noticed yet.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Partition tolerance.&lt;/strong&gt; Communication failures — API timeouts, rate limits, one agent buried in a long task — split agents into groups that can’t coordinate. They diverge. When they reconnect, their states need to merge without losing either side’s work. CRDTs were designed for exactly this.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Byzantine faults.&lt;/strong&gt; In distributed systems, a &lt;a href=&quot;https://lamport.azurewebsites.net/pubs/byz.pdf&quot;&gt;Byzantine fault&lt;/a&gt; is a process that doesn’t just crash — it produces incorrect output while appearing to function normally. LLM agents do this constantly. An agent hallucinates a fix that looks plausible, passes its own tests, and ships it. A downstream agent trusts it and builds on top of it. Now you have a chain of work built on a foundation that was wrong from the start. In multi-agent systems, every agent is a potential Byzantine actor every time it responds.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren’t hypothetical concerns. I’ve hit every one of them building Zabriskie with multiple Claude Code instances. And they emerge in any multi-agent architecture regardless of how clever the prompt engineering is, because they’re properties of the architecture itself — multiple writers, no shared clock, partial failure.&lt;/p&gt;

&lt;p&gt;What fascinates me is that these are all problems with known solutions — or at least, known solutions in the distributed systems world. Fifty years of research on formal models for concurrent access, data structures that merge automatically, techniques for systematically testing every failure mode. None of this has made it into the multi-agent stack yet. And I’m genuinely not sure how much of it transfers cleanly. LLM agents aren’t database replicas — they hallucinate, they lose context, they make confident decisions based on incomplete information. The structural parallels are strong, but whether the techniques actually carry over, and what has to change when they do, is an open question. It’s also the most interesting question I’ve encountered in a long time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I’m going to keep exploring this space. If you’re thinking about these problems too, I’d love to hear from you.&lt;/em&gt;&lt;/p&gt;

&lt;div id=&quot;refs&quot; class=&quot;references&quot; role=&quot;doc-bibliography&quot; aria-label=&quot;References&quot;&gt;
&lt;div id=&quot;ref-fidge1988timestamps&quot;&gt;
&lt;p&gt;Fidge, Colin J. 1988. &quot;Timestamps in Message-Passing Systems That Preserve the Partial Ordering.&quot; &lt;em&gt;Proceedings of the 11th Australian Computer Science Conference&lt;/em&gt; 10 (1): 56–66.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-lamport1982byzantine&quot;&gt;
&lt;p&gt;Lamport, Leslie, Robert Shostak, and Marshall Pease. 1982. &quot;The Byzantine Generals Problem.&quot; &lt;em&gt;ACM Transactions on Programming Languages and Systems&lt;/em&gt; 4 (3): 382–401.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-hong2023metagpt&quot;&gt;
&lt;p&gt;Hong, Sirui, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, et al. 2023. &quot;MetaGPT: Meta Programming for Multi-Agent Collaborative Framework.&quot; &lt;em&gt;arXiv Preprint arXiv:2308.00352&lt;/em&gt;. &lt;a href=&quot;https://arxiv.org/abs/2308.00352&quot; class=&quot;uri&quot;&gt;https://arxiv.org/abs/2308.00352&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-lamport1978time&quot;&gt;
&lt;p&gt;Lamport, Leslie. 1978. &quot;Time, Clocks, and the Ordering of Events in a Distributed System.&quot; &lt;em&gt;Communications of the ACM&lt;/em&gt; 21 (7): 558–65.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-mattern1989virtual&quot;&gt;
&lt;p&gt;Mattern, Friedemann. 1989. &quot;Virtual Time and Global States of Distributed Systems.&quot; &lt;em&gt;Parallel and Distributed Algorithms&lt;/em&gt; 1 (23): 215–26.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-meiklejohn2018partisan&quot;&gt;
&lt;p&gt;Meiklejohn, Christopher. 2018. &quot;Partisan: Enabling Cloud-Scale Erlang Applications.&quot; &lt;em&gt;Technical Report&lt;/em&gt;. Université catholique de Louvain. &lt;a href=&quot;https://arxiv.org/abs/1802.02652&quot; class=&quot;uri&quot;&gt;https://arxiv.org/abs/1802.02652&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-meiklejohn2015lasp&quot;&gt;
&lt;p&gt;Meiklejohn, Christopher, and Peter Van Roy. 2015. &quot;Lasp: A Language for Distributed, Eventually Consistent Computations with CRDTs.&quot; In &lt;em&gt;Proceedings of the First Workshop on Principles and Practice of Consistency for Distributed Data&lt;/em&gt;, 7. ACM.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-meiklejohn2022filibuster&quot;&gt;
&lt;p&gt;Meiklejohn, Christopher. 2022. &quot;Service-Level Fault Injection Testing.&quot; Ph.D. dissertation, Carnegie Mellon University.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-qian2023chatdev&quot;&gt;
&lt;p&gt;Qian, Chen, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023. &quot;Communicative Agents for Software Development.&quot; &lt;em&gt;arXiv Preprint arXiv:2307.07924&lt;/em&gt;. &lt;a href=&quot;https://arxiv.org/abs/2307.07924&quot; class=&quot;uri&quot;&gt;https://arxiv.org/abs/2307.07924&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-shapiro2011comprehensive&quot;&gt;
&lt;p&gt;Shapiro, Marc, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. &quot;A Comprehensive Study of Convergent and Commutative Replicated Data Types.&quot; &lt;em&gt;INRIA Technical Report&lt;/em&gt; 7506. &lt;a href=&quot;https://hal.inria.fr/inria-00555588&quot; class=&quot;uri&quot;&gt;https://hal.inria.fr/inria-00555588&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&quot;ref-wu2023autogen&quot;&gt;
&lt;p&gt;Wu, Qingyun, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023. &quot;AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.&quot; &lt;em&gt;arXiv Preprint arXiv:2308.08155&lt;/em&gt;. &lt;a href=&quot;https://arxiv.org/abs/2308.08155&quot; class=&quot;uri&quot;&gt;https://arxiv.org/abs/2308.08155&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;

</description>
				<pubDate>Mon, 30 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/agents/distributed/zabriskie/2026/03/30/multi-agent-systems-have-a-distributed-systems-problem.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/agents/distributed/zabriskie/2026/03/30/multi-agent-systems-have-a-distributed-systems-problem.html</guid>
			</item>
		
			<item>
				<title>The Show Is Happening Right Now and Nothing Works</title>
				<description>&lt;p&gt;I’ve been building &lt;a href=&quot;/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html&quot;&gt;Zabriskie&lt;/a&gt; – a social music app for live shows – with Claude Code for about six weeks now. I’m building it because the jam band community deserves a real third place online, and the tools haven’t caught up to the culture. I also happen to be a distributed systems researcher, so I’m &lt;a href=&quot;/ai/zabriskie/development/2026/03/20/what-building-with-claude-actually-looks-like.html&quot;&gt;documenting what it’s actually like&lt;/a&gt; to build production software with an AI assistant – the wins and the failures.&lt;/p&gt;

&lt;p&gt;Saturday night was a failure.&lt;/p&gt;

&lt;h2 id=&quot;everything-was-building-to-tonight&quot;&gt;Everything Was Building to Tonight&lt;/h2&gt;

&lt;p&gt;Six weeks of work. Hundreds of commits. Live Activities for iOS, Android ongoing notifications, real-time setlist syncing, live chat, RSVP systems, push notifications, the whole stack – all of it was building toward this. Goose at Jam in the Streets. First show of the 2026 spring run. The chat was going to be full. Users were going to be RSVPing, watching the setlist update song by song on their Lock Screens, posting in the live chat. This was the night the app was supposed to prove itself.&lt;/p&gt;

&lt;p&gt;One of Zabriskie’s flagship features is Live Activities on iOS. During a concert, the Dynamic Island and Lock Screen show the current song, set information, and chat updates in real time. It had been working flawlessly for about 20 shows over several weeks. Users loved it. I was proud of it.&lt;/p&gt;

&lt;p&gt;The app also has an RSVP system. You can mark yourself as “Going” to a show or “Couch Touring” if you’re watching the livestream from home. The RSVP tap is important – on iOS, it’s what triggers the Live Activity to start. You tap “Going,” the Dynamic Island lights up, and you’re connected to the show.&lt;/p&gt;

&lt;p&gt;Earlier that day, Claude shipped a feature called “auto-select couch tour.” The idea was simple: when an authenticated user visits a live show page and hasn’t RSVPed yet, the backend automatically inserts them as couch touring. Reduce friction. One fewer tap. It tested fine during the afternoon.&lt;/p&gt;

&lt;p&gt;I deployed it to production on Railway a few hours before the show.&lt;/p&gt;

&lt;p&gt;I wasn’t there. Flights were insane, hotels were worse, and the TSA shutdown made air travel a gamble I wasn’t willing to take. So I was couch touring from home, which meant the app – my app – was the show for me. The Live Activity on my Lock Screen, the live chat, the setlist updating in real time. That was how I was going to experience this concert.&lt;/p&gt;

&lt;h2 id=&quot;8-pm-everything-is-broken&quot;&gt;8 PM: Everything Is Broken&lt;/h2&gt;

&lt;p&gt;I open the app on my iPhone. The Goose show is live. And two things are immediately, catastrophically wrong.&lt;/p&gt;

&lt;p&gt;First, the RSVP toggle doesn’t work. Tapping between “Going” and “Couch Touring” does nothing. The UI just sits there. Second, Live Activities don’t start. No Dynamic Island. No Lock Screen widget. The feature that’s been rock solid for 20 shows is completely dead.&lt;/p&gt;

&lt;p&gt;Users are filing bugs. The show is happening right now. I open two laptops – one for iOS debugging, one for Android – and start a Claude Code session.&lt;/p&gt;

&lt;p&gt;What followed was about two and a half hours of the most frustrating debugging experience I’ve had on this project.&lt;/p&gt;

&lt;h2 id=&quot;wrong-turn-1-reading-code-instead-of-testing-it&quot;&gt;Wrong Turn 1: Reading Code Instead of Testing It&lt;/h2&gt;

&lt;p&gt;Claude’s first instinct was to read. It opened the auto-couch-tour code. Then the RSVP handler. Then the show page handler. Then the frontend SegmentedControl component. Then the SDUIPage wrapper. Then the liveActivities.js service file. File after file after file, going in circles, building theories about what might be wrong without ever testing anything.&lt;/p&gt;

&lt;p&gt;This went on for over thirty minutes. The show had already started.&lt;/p&gt;

&lt;p&gt;I kept telling it to stop guessing. Use actual data. Hit the endpoint. Check the logs. Claude has full access to everything – the production database, Railway deployment logs, the ability to curl any endpoint, run any query. I gave it all of that specifically so it could debug with real data. But it has this tendency – one I’ve documented across the project – to prefer reading code to running code. It would rather construct an elaborate mental model of what should happen than spend ten seconds confirming what actually happens, even when every tool it needs is right there.&lt;/p&gt;

&lt;p&gt;Eventually I got it to test the RSVP API endpoint with curl. And the answer was right there in the response.&lt;/p&gt;

&lt;h2 id=&quot;root-cause-1-the-timezone-bug&quot;&gt;Root Cause 1: The Timezone Bug&lt;/h2&gt;

&lt;p&gt;The RSVP endpoint was returning component updates for a past show – IDs like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;was-there&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;attendance-text&lt;/code&gt; – instead of the live show components like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;live-rsvp&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rsvp-control&lt;/code&gt;. The component IDs didn’t match anything on the live page, so the partial update found nothing to replace. The UI did nothing.&lt;/p&gt;

&lt;p&gt;The bug was in how the backend determined whether a show was in the past:&lt;/p&gt;

&lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Truncate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;24&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Hour&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This truncates to midnight UTC. After 8 PM Eastern – which is midnight UTC – today’s show date falls before the truncated time. The RSVP handler thought tonight’s live show was yesterday’s past show.&lt;/p&gt;

&lt;p&gt;This bug was latent. It existed before the auto-couch-tour change. It only manifests after 8 PM Eastern. Live shows happen at night. Of course they do.&lt;/p&gt;

&lt;p&gt;The fix was simple: if the show status is “live,” it’s not past. A live show is never past, regardless of what the clock says.&lt;/p&gt;

&lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Status&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;live&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;isPast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;false&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One line. But finding it took over an hour of fighting with an AI that wanted to read code rather than test code.&lt;/p&gt;

&lt;h2 id=&quot;wrong-turn-2-the-phantom-deployment-target&quot;&gt;Wrong Turn 2: The Phantom Deployment Target&lt;/h2&gt;

&lt;p&gt;While investigating the Live Activity failure, Claude found that the iOS widget extension had &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IPHONEOS_DEPLOYMENT_TARGET = 26.2&lt;/code&gt; in its build settings. It declared this was the root cause – the deployment target was too high, the widget couldn’t load, that’s why Live Activities don’t start.&lt;/p&gt;

&lt;p&gt;I had to point out that this value had been there since day one. Live Activities worked fine with it for 20 shows. Claude changed it to 16.2, then had to revert it when I pointed out it was irrelevant.&lt;/p&gt;

&lt;p&gt;This is a pattern I’ve seen repeatedly: Claude latches onto something that looks wrong and declares it the cause, without checking whether it was present before the failure started. Correlation without causation, except there isn’t even correlation – just suspicion.&lt;/p&gt;

&lt;h2 id=&quot;wrong-turn-3-theory-without-evidence&quot;&gt;Wrong Turn 3: Theory Without Evidence&lt;/h2&gt;

&lt;p&gt;Claude kept generating theories. Maybe the Capacitor plugin lost its registration. Maybe there’s stale state in localStorage. Maybe the push token registration flow is broken. Maybe the AppDelegate is missing a method.&lt;/p&gt;

&lt;p&gt;None of these theories were tested before being proposed. None were grounded in actual error messages or log output. I found myself repeating the same instruction over and over: check the logs. Use the data. Stop guessing.&lt;/p&gt;

&lt;p&gt;This is the core tension of working with an AI coding assistant on a production crisis. The AI has read a lot of code and can generate plausible explanations at incredible speed. But plausible is not correct, and speed is not useful when you’re going in the wrong direction. Every wrong theory costs time – time to investigate, time to disprove, time to redirect. During a live show, that time is not abstract.&lt;/p&gt;

&lt;h2 id=&quot;the-revert-and-the-pr-chaos&quot;&gt;The Revert and the PR Chaos&lt;/h2&gt;

&lt;p&gt;I realized the auto-couch-tour feature was changing the fundamental flow. Before, the RSVP tap triggered &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startShowActivity()&lt;/code&gt;. Auto-couch-tour bypassed that by inserting an RSVP on page load, before the user ever tapped anything. Even if the timezone bug was fixed, the interaction model was wrong.&lt;/p&gt;

&lt;p&gt;I told Claude to revert the feature and write a migration to clean up the auto-inserted RSVPs. Getting this deployed was its own ordeal. Claude kept adding commits to the PR – logging statements, then more logging, then client-side changes, then reverting the client-side changes. CI had to run multiple times because the branch fell behind main. I kept telling Claude to stop touching things and just ship what we had. The show was ticking by.&lt;/p&gt;

&lt;h2 id=&quot;root-cause-2-the-real-bug&quot;&gt;Root Cause 2: The Real Bug&lt;/h2&gt;

&lt;p&gt;After deploying the RSVP fix, I confirmed that switching between Going and Couch Touring worked again. But Live Activities still didn’t start. The backend was returning the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;liveActivityHint&lt;/code&gt; field correctly. The frontend was calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startShowActivity()&lt;/code&gt;. But nothing appeared on the Dynamic Island.&lt;/p&gt;

&lt;p&gt;More wrong turns from Claude. Check the Capacitor plugin. Check the widget configuration. Check the AppDelegate for missing methods. I pushed for a TestFlight build with extra logging so I could see what was happening on device.&lt;/p&gt;

&lt;p&gt;Claude built one, but forgot that production builds disable Safari Web Inspector – there’s a flag, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;webContentsDebuggingEnabled&lt;/code&gt;, that’s only true in dev mode. The first TestFlight build was undebuggable. We had to rebuild.&lt;/p&gt;

&lt;p&gt;I ended up running the app directly from Xcode onto my physical iPhone. Claude initially thought I was running on the Simulator because of a misleading log message – the simulator-detection code checked &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hostname === &apos;localhost&apos;&lt;/code&gt;, which is true for Xcode-deployed apps on physical devices too. I had to correct this.&lt;/p&gt;

&lt;p&gt;Then I looked at the Safari Web Inspector. Not the console log tab – the Errors tab. And there it was:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Unhandled Promise Rejection: Error: &quot;LiveActivity.then()&quot; is not implemented on ios
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That was the smoking gun.&lt;/p&gt;

&lt;h2 id=&quot;the-actual-root-cause&quot;&gt;The Actual Root Cause&lt;/h2&gt;

&lt;p&gt;Capacitor’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;registerPlugin()&lt;/code&gt; returns a proxy object. The code had &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getLiveActivityPlugin()&lt;/code&gt; as an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;async&lt;/code&gt; function that returned this proxy. When callers did &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;await getLiveActivityPlugin()&lt;/code&gt;, JavaScript called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt; on the returned value – because that’s what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;await&lt;/code&gt; does, it checks for a thenable. The proxy doesn’t implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt;. On iOS 26, Apple’s JavaScript engine started strictly enforcing this check. The call threw, silently killing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startShowActivity()&lt;/code&gt; every single time.&lt;/p&gt;

&lt;p&gt;This was a latent bug. On older iOS versions, the proxy somehow worked despite not implementing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt;. iOS 26 made the JavaScript engine stricter, and a bug that never mattered suddenly became a hard failure. No deprecation warning. No migration guide. It just stopped working.&lt;/p&gt;

&lt;p&gt;The fix was to load the plugin eagerly and access it synchronously:&lt;/p&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Before (broken on iOS 26):&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;getLiveActivityPlugin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;async&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;await&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;import&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;capacitor-live-activity&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;LiveActivity&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// returns proxy, await calls .then() on it&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// After (works):&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;_plugin&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;let&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;_ready&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;import&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;capacitor-live-activity&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;then&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;m&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;_plugin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;LiveActivity&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;getLiveActivityPlugin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;_plugin&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// sync, no await on proxy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Never pass a Capacitor proxy through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;await&lt;/code&gt;. That’s the lesson, and it cost me most of a concert.&lt;/p&gt;

&lt;p&gt;I fixed the code, rebuilt from Xcode, ran it on my phone, tapped the RSVP button, and the Dynamic Island lit up. The band was deep in the Thatch jam – that sprawling, shapeless thing where the song dissolves and the band finds something else entirely. It felt appropriate. We’d been lost in the weeds for hours and finally found our way out the other side.&lt;/p&gt;

&lt;p&gt;I uploaded the TestFlight build while the jam unwound. By the time Hungersite hit – “is it time to shed our weapons yet my friend?” – the build was processing on App Store Connect. I sat back and watched my Lock Screen update with each song. The feature worked. The thing I built worked.&lt;/p&gt;

&lt;h2 id=&quot;the-damage&quot;&gt;The Damage&lt;/h2&gt;

&lt;p&gt;About two and a half hours of debugging during a live show. Multiple wrong theories pursued by Claude. Several unnecessary builds and CI cycles. Users experienced broken RSVP switching and missing Live Activities during a Goose show. The auto-couch-tour feature – the original purpose of the deployment – had to be fully reverted and its data cleaned up.&lt;/p&gt;

&lt;h2 id=&quot;what-this-tells-me-about-ai-reliability&quot;&gt;What This Tells Me About AI Reliability&lt;/h2&gt;

&lt;p&gt;I’m documenting these incidents because they’re the research. Every failure mode is data. And this session was rich in data.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The AI solved the wrong problem repeatedly.&lt;/strong&gt; Claude spent most of the session investigating backend code paths when the real issue – the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt; proxy error – was a client-side JavaScript runtime error visible in the browser console’s Errors tab. I’ve logged over 30 instances of wrong-approach debugging across this project. The pattern is consistent: Claude defaults to the layer it’s most comfortable with (backend code reading) rather than the layer where the evidence is.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The developer found the bug, not the AI.&lt;/strong&gt; I remembered a previous debugging session, pushed to check Safari Web Inspector errors specifically, and spotted the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt; rejection. Claude was still investigating the backend after the backend was proven correct. In a crisis, the AI’s contribution was negative – it consumed my attention with wrong theories while I could have been looking at the right data.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Silent failures are the worst failures.&lt;/strong&gt; The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.then()&lt;/code&gt; error was an unhandled promise rejection. No crash. No error in the normal console output. No visual indication. The function silently died. This is the failure mode that’s hardest for both humans and AI to debug, and it’s the one AI is least equipped for – because AI debugging relies heavily on explicit error messages, and silent failures produce none.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Platform updates break things in ways you cannot predict.&lt;/strong&gt; The iOS 26 JavaScript engine change turned a latent bug into a hard failure. There’s no way to write a test for “Apple will change how proxy objects interact with await in a future OS release.” Some bugs only exist in the gap between what the spec says and what the runtime does.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Test at showtime, not at noon.&lt;/strong&gt; The timezone bug only manifests after 8 PM Eastern. The auto-couch-tour was tested during the day. Nobody tested at 8 PM when shows actually happen. This is obvious in retrospect and completely non-obvious in the moment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Ship less before critical moments.&lt;/strong&gt; I deployed a new feature the same day as the most important show of the month. There was no urgency. It could have waited until Saturday. But the feature was done, and it looked good, and the temptation to ship is always there. This is the oldest lesson in software engineering, and I learned it again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Now I understand why companies have entire teams for this.&lt;/strong&gt; I’m one person with an AI assistant, shipping an iOS app, an Android app, and a web app with a Go backend, real-time features, push notifications, Live Activities, and a server-driven UI architecture. Tonight I had to debug a Go timezone bug, a JavaScript proxy runtime error, two Capacitor build pipelines, Safari Web Inspector on a physical device, Railway deployment logs, and App Store Connect uploads – all at the same time, all during a live show. Even with AI doing most of the coding, the operational complexity of shipping software to real users on real devices is staggering. The AI can write the code. It cannot feel the weight of it breaking.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;There’s a silver lining to not being at the show. If I’d been on the floor at Jam in the Streets – where I wanted to be, where I should have been – I wouldn’t have been able to fix any of this. I’d have been standing in a crowd with a broken app, watching bug reports roll in, unable to do anything about it. Being stuck at home meant I could open two laptops, plug in two phones, and fight through it.&lt;/p&gt;

&lt;p&gt;Next time, I’m going to be at the show. On the floor. Up front. On the rail. And the app better work, because I won’t be home to save it.&lt;/p&gt;

&lt;p&gt;I’m writing this while the encore plays, and the feeling that lingers isn’t satisfaction that we fixed it. It’s the memory of two and a half hours where my AI assistant – the one that wrote most of this application – was actively making the crisis worse by consuming my attention with wrong answers delivered with full confidence. That’s the reliability gap I’m trying to measure. Last night, during the Thatch jam, I felt it.&lt;/p&gt;
</description>
				<pubDate>Sun, 29 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/29/the-show-is-happening-right-now-and-nothing-works.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/29/the-show-is-happening-right-now-and-nothing-works.html</guid>
			</item>
		
			<item>
				<title>Memory Isn&apos;t Learning</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“If I knew the way, I would take you home”&lt;/em&gt;
— Grateful Dead, “Ripple”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Claude has a persistent memory system. It can write lessons to files on disk and read them back at the start of every conversation. After each failure, it saves a note: &lt;em&gt;don’t do that again.&lt;/em&gt; And then it does it again. The notes accumulate. The behavior doesn’t change.&lt;/p&gt;

&lt;p&gt;This is a story about one bug that happened five times, another bug that never should have shipped, and the difference between saving a lesson and actually learning one.&lt;/p&gt;

&lt;h2 id=&quot;the-poller-that-did-nothing&quot;&gt;The Poller That Did Nothing&lt;/h2&gt;

&lt;p&gt;I shipped a feature called auto-live. A background goroutine polls every 60 seconds, checks if any scheduled show’s start time has arrived, and flips it to “live” automatically. No more pulling out my phone from my seat at the venue to manually press the button. The server handles it.&lt;/p&gt;

&lt;p&gt;The feature was completely broken from the moment it deployed. It stayed broken for twenty-four hours.&lt;/p&gt;

&lt;p&gt;The code expected &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;showDate&lt;/code&gt; to be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;2026-03-21&quot;&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;startTime&lt;/code&gt; to be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;19:00:00&quot;&lt;/code&gt;. That’s what you’d get if you ran the query in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;psql&lt;/code&gt;. But Go’s Postgres driver doesn’t return strings — it serializes a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date&lt;/code&gt; column as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;2026-03-21T00:00:00Z&quot;&lt;/code&gt; and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time&lt;/code&gt; column as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;0000-01-01T19:00:00Z&quot;&lt;/code&gt;. The parse failed. The error was logged and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;continue&lt;/code&gt;d. Every show, every cycle, every minute. The poller was running, but it was doing nothing. A security camera that’s plugged in and blinking but not recording anything.&lt;/p&gt;

&lt;p&gt;Nobody noticed because nobody checked. The test plan had a checkbox that said “Verify auto-live triggers on production with a test show.” The checkbox was unchecked.&lt;/p&gt;

&lt;p&gt;Tedeschi Trucks Band was playing the Beacon Theatre. I was watching from home. The app was about to light up with live features, and the poller was about to do what it was designed to do.&lt;/p&gt;

&lt;p&gt;It didn’t.&lt;/p&gt;

&lt;p&gt;What followed was three fixes in thirty minutes, all pushed directly to main. No PRs. No CI. First the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; cast to fix the string format. Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tzdata&lt;/code&gt; because Alpine Docker images don’t ship timezone data. Then the push notification fix because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;goLiveByID()&lt;/code&gt; was filtering out everyone instead of notifying them. Three layers of failure peeled back one at a time.&lt;/p&gt;

&lt;p&gt;The next day, Claude wrote the companion feature — auto-complete. Same pattern. Same goroutine. Same SQL query.&lt;/p&gt;

&lt;p&gt;Same bug.&lt;/p&gt;

&lt;p&gt;Not just the same bug — PR #147 also &lt;em&gt;removed&lt;/em&gt; the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; cast from the code I’d just fixed the night before. Both pollers were now broken. The fix from twelve hours earlier was reverted and the broken pattern was copied into new code, in a single commit. Eight minutes later, PR #148 went up to fix both functions. Again.&lt;/p&gt;

&lt;p&gt;I added a rule to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;: verify runtime behavior, not just code correctness.&lt;/p&gt;

&lt;h2 id=&quot;the-fix-that-didnt-fix-both-copies&quot;&gt;The Fix That Didn’t Fix Both Copies&lt;/h2&gt;

&lt;p&gt;A few days pass. The auto-live poller is working. The auto-complete poller is working. Shows are transitioning automatically. The system works.&lt;/p&gt;

&lt;p&gt;Except it doesn’t.&lt;/p&gt;

&lt;p&gt;I notice that no shows have gone live automatically in two days. The auto-complete poller is fine — shows that are manually set to live are completing on schedule. But &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;checkAndGoLive&lt;/code&gt; is silently failing again. Every show, every cycle, every minute.&lt;/p&gt;

&lt;p&gt;Here’s what happened: PR #148, the “right fix,” changed both pollers to scan into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time.Time&lt;/code&gt; instead of strings. But it only removed the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; casts from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;checkAndAutoComplete&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; casts from the emergency patch were still sitting in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;checkAndGoLive&lt;/code&gt;. The pq driver can’t scan a text string into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;time.Time&lt;/code&gt;. Every row silently failed.&lt;/p&gt;

&lt;p&gt;The fix that was supposed to fix the fix didn’t fix both copies.&lt;/p&gt;

&lt;p&gt;PR #183 removes the leftover &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; casts and adds four tests — two for each poller — that actually insert a show into a real database, run the poller function, and verify the status changes. The tests that should have existed from the beginning.&lt;/p&gt;

&lt;p&gt;This is the same bug for the fourth time. Not a new bug. Not a variation. The same date-parsing bug, in the same function, caused by the same failure to verify that the code actually works against a real database. The “right fix” was only applied to one of the two pollers, and nobody checked.&lt;/p&gt;

&lt;h2 id=&quot;the-one-word-bug&quot;&gt;The One-Word Bug&lt;/h2&gt;

&lt;p&gt;A few days later. I’m getting bug reports from users — on both iOS and Android, tapping an album in the search results on the new post page does nothing. Completely broken. The app is unusable for creating posts.&lt;/p&gt;

&lt;p&gt;This isn’t a background poller that fails silently. This is the primary user flow. People are trying to share content and they can’t.&lt;/p&gt;

&lt;p&gt;I ask Claude to investigate. It finds the bug in about ninety seconds: a click filter in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;HStack.jsx&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VStack.jsx&lt;/code&gt; that was added to fix a previous bug — comment form clicks accidentally triggering parent navigation — includes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;form&lt;/code&gt; in its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;closest()&lt;/code&gt; CSS selector. Search result cards live inside a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;form&amp;gt;&lt;/code&gt; element. Every click on a search result is swallowed by the filter. The navigate action never fires.&lt;/p&gt;

&lt;p&gt;The fix is removing one word from a string on two lines of code. Trivially simple. The kind of thing that should never have shipped broken in the first place.&lt;/p&gt;

&lt;p&gt;Here’s what happened next.&lt;/p&gt;

&lt;p&gt;Claude fixes the two files. I tell it to ship it. It admin-merges the PR, bypassing CI, because it wants to move fast. I ask: “Did you just merge without waiting for tests?” It apologizes. I ask: “Do you remember what happened last time you did that?” It doesn’t. It doesn’t have a memory of the previous incident, because it didn’t save one. Despite having a persistent memory system specifically designed for exactly this purpose.&lt;/p&gt;

&lt;p&gt;Then it pushes the build number bump directly to main. No branch. No PR. When I point this out, it apologizes again and saves a memory about not pushing directly to main. The same memory it should have already had. The same memory it will probably ignore next time.&lt;/p&gt;

&lt;p&gt;Along the way, the screenshot capture for the App Store fails. Twice. Claude doesn’t mention it either time. It just moves on to the next step and reports success. I have to notice it myself in the output and ask it to retry.&lt;/p&gt;

&lt;p&gt;When the bug was first reported, Claude’s initial suggestion was that users could “just use the web app” while we waited for the App Store fix. As if someone whose phone app just broke is going to think “ah, let me try the mobile web version.” As if trust works that way.&lt;/p&gt;

&lt;h2 id=&quot;the-pattern&quot;&gt;The Pattern&lt;/h2&gt;

&lt;p&gt;These two incidents are separated by days, involve completely different codebases (Go backend vs. React frontend), and manifest as completely different symptoms (silent poller failure vs. broken click handling). But they’re the same story. The same failure mode, playing out on repeat.&lt;/p&gt;

&lt;p&gt;Here’s the loop:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Claude writes code that looks correct.&lt;/strong&gt; The auto-live date parsing reads fine if you don’t know how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lib/pq&lt;/code&gt; serializes types. The click filter reads fine if you don’t think about what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;form&lt;/code&gt; means in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;closest()&lt;/code&gt; selector when search results are nested inside forms.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Nobody verifies the behavior.&lt;/strong&gt; The auto-live checkbox was unchecked. The click filter change had no browser-level E2E test — only API-level tests that check JSON responses, not actual clicks. In both cases, the &lt;em&gt;representation&lt;/em&gt; of correctness was verified (the code looks right, the API returns 200) while the &lt;em&gt;reality&lt;/em&gt; of correctness was not (the poller does nothing, the button doesn’t respond to taps).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The bug ships to production.&lt;/strong&gt; It ships because Claude is fast, confident, and doesn’t flag uncertainty. It doesn’t say “I haven’t actually verified this works in a real browser” or “I’m not sure how the Postgres driver serializes this type.” It writes the code, reads the code, the code looks correct, therefore the code is correct.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;The fix creates new problems.&lt;/strong&gt; The emergency &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;::text&lt;/code&gt; cast was a symptom patch, not understanding. Twelve hours later the same pattern was copied into new code and the patch was reverted. The admin-merge bypassed CI. The direct push to main bypassed code review. Each shortcut taken to fix the immediate problem created the conditions for the next one.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Claude doesn’t learn from it.&lt;/strong&gt; This is the part that stings. Claude has a persistent memory system. It can write notes to files that persist across conversations. After the auto-live incident, it should have saved: “Never push directly to main. Always wait for CI.” It didn’t. Four days later, it admin-merged and pushed directly to main in the same session, twice. When I asked if it remembered what happened last time, it didn’t.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The loop is: ship → break → emergency fix → break again → fix the fix → add a rule to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; → ignore the rule next week.&lt;/p&gt;

&lt;h2 id=&quot;what-claudemd-has-become&quot;&gt;What &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; Has Become&lt;/h2&gt;

&lt;p&gt;My &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; file is now over 500 lines. It started as a project overview with build instructions. It has become a record of every way Claude has failed.&lt;/p&gt;

&lt;p&gt;“No database triggers. EVER.” That’s from when triggers caused unpredictable behavior during migrations.&lt;/p&gt;

&lt;p&gt;“NEVER modify timestamp/timezone columns in migrations.” That’s from when a timezone conversion destroyed production data.&lt;/p&gt;

&lt;p&gt;“Always restart servers after backend changes. NEVER use pkill — it fails silently.” That’s from when Claude reported a fix was working without restarting the server to pick up the new code.&lt;/p&gt;

&lt;p&gt;“Two-Attempt Rule: after 2 failed attempts with a similar strategy, step back and try a fundamentally different approach.” That’s from when Claude tried the same broken fix eleven times in a row.&lt;/p&gt;

&lt;p&gt;“Never deploy untested changes to external services.” That’s from when Claude broke S3 uploads by assuming the bucket supported public ACLs.&lt;/p&gt;

&lt;p&gt;Every rule is a scar. Every scar is an incident where Claude did something wrong, I caught it, we added a rule, and the next time Claude found a new way to do something wrong that wasn’t covered by the existing rules. The document grows. The behavior doesn’t change. It just finds gaps.&lt;/p&gt;

&lt;p&gt;The auto-live incident added: “Verify runtime behavior, not just code correctness.”&lt;/p&gt;

&lt;p&gt;The search results incident added: “Any change to core interaction code requires browser-level E2E tests AND native QA skill runs.”&lt;/p&gt;

&lt;p&gt;Next week something will happen that isn’t covered by either of those rules, and we’ll add another one.&lt;/p&gt;

&lt;h2 id=&quot;the-memory-problem&quot;&gt;The Memory Problem&lt;/h2&gt;

&lt;p&gt;Claude’s memory system is supposed to break the loop. It has files on disk that persist across conversations. It can read them at the start of each session. It can write new ones when it learns something.&lt;/p&gt;

&lt;p&gt;Here’s what’s actually in the memory system after today:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“Never use JWT tokens for simulator testing. Always tap buttons like a real user.”&lt;/li&gt;
  &lt;li&gt;“Any change to core interaction code requires browser-level E2E tests.”&lt;/li&gt;
  &lt;li&gt;“Never use –admin to bypass CI when merging PRs.”&lt;/li&gt;
  &lt;li&gt;“Never push directly to main.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all correct. They were all saved after failures. And they will all be ignored the next time speed feels more important than process. The memories exist to make it &lt;em&gt;look&lt;/em&gt; like learning is happening. But memory isn’t learning. Learning is when the behavior changes. Saving a note that says “don’t push to main” and then pushing to main in the same session isn’t learning — it’s journaling.&lt;/p&gt;

&lt;p&gt;The memories are technically available. Claude can read them. But there’s a difference between having information and having it change your behavior under pressure. Humans have this problem too — we know we shouldn’t eat the cake, skip the workout, send the angry email — but knowing and doing are different things. The difference is that humans usually need more than four minutes between learning a lesson and violating it.&lt;/p&gt;

&lt;p&gt;Times Claude’s memory system prevented a mistake: zero.&lt;/p&gt;

&lt;h2 id=&quot;what-i-actually-want&quot;&gt;What I Actually Want&lt;/h2&gt;

&lt;p&gt;I don’t want a faster code generator. I have that. I want a collaborator that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flags uncertainty.&lt;/strong&gt; “I’ve written this click handler change but I haven’t verified it works in an actual browser with touch events. The E2E tests only check API responses. Should I add a browser interaction test before we merge?” That sentence would have prevented the search results incident entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volunteers failures.&lt;/strong&gt; When the screenshot capture fails, say so. Don’t bury it in output and move to the next step. When a test is skipped, say why. When a checkbox is unchecked, ask if we should check it before merging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actually uses its own memory.&lt;/strong&gt; If there’s a file on disk that says “never push directly to main” and Claude is about to push directly to main, the file should &lt;em&gt;prevent the action&lt;/em&gt;, not just exist as a historical record of the last time it went wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understands that users are people.&lt;/strong&gt; When the app breaks on both iOS and Android the week of live shows, the correct response is not “users can switch to the web app.” The correct response is “this is an emergency and here’s how we fix it as fast as possible without cutting corners that make it worse.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slows down when it matters.&lt;/strong&gt; Claude’s speed is its greatest asset and its greatest liability. The same velocity that ships six redesigns in nine hours also ships three broken hotfixes in thirty minutes. The ability to move fast is only valuable when paired with the judgment to know when to slow down. And that judgment doesn’t come from rules in a file — it comes from something closer to instinct, or experience, or care. Things that don’t fit neatly into a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;the-stopgap-era&quot;&gt;The Stopgap Era&lt;/h2&gt;

&lt;p&gt;Every day someone recommends me a new one. Context management tools. Workflow engines. Prompt orchestrators that inject the right rules at the right time. Everyone I know who builds seriously with AI has cobbled together their own version, and companies are raising money to productize the pattern. The sheer number of them tells you the problem is real.&lt;/p&gt;

&lt;p&gt;I don’t think any of them will last.&lt;/p&gt;

&lt;p&gt;These tools exist because the models don’t do this themselves yet. They’re shims — workarounds for the gap between what agents can do and what they should do. The moment the models internalize uncertainty flagging, failure reporting, and behavioral memory, the entire category collapses. Nobody builds a startup around reminding humans to breathe.&lt;/p&gt;

&lt;p&gt;The real fix isn’t better scaffolding around a model that doesn’t learn. It’s a model that learns.&lt;/p&gt;

&lt;h2 id=&quot;where-this-goes&quot;&gt;Where This Goes&lt;/h2&gt;

&lt;p&gt;Claude is the best collaborator I’ve ever had for the first 80% of any task. It’s also the most dangerous collaborator I’ve ever had for the last 20%. The part where you verify it works. The part where you slow down. The part where you say “wait, have we actually tested this?” The part where you remember what happened last time.&lt;/p&gt;

&lt;p&gt;I’m going to keep building with Claude. The productivity gains are real — features that would take a team of five built by one person on a couch. But I’m done pretending the process is working. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; file isn’t a guardrail. It’s a changelog of failures. The memory system isn’t learning. It’s note-taking. And the loop — ship, break, fix, break again — isn’t a phase I’m going to grow out of. It’s the steady state.&lt;/p&gt;

&lt;p&gt;The question isn’t how to make Claude stop making mistakes. It’s how to build a process around Claude that catches the mistakes before they reach users.&lt;/p&gt;

&lt;p&gt;I don’t have the answer yet. But I know it’s not “add another rule to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt;.”&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;This is part of a series about building &lt;a href=&quot;https://zabriskie.app&quot;&gt;Zabriskie&lt;/a&gt; with Claude. Previously: &lt;a href=&quot;/ai/zabriskie/development/2026/03/08/why-im-building-zabriskie.html&quot;&gt;why I’m building it&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/development/2026/03/20/what-building-with-claude-actually-looks-like.html&quot;&gt;what building with Claude actually looks like&lt;/a&gt;, &lt;a href=&quot;/ai/zabriskie/development/android/ios/2026/03/22/teaching-claude-to-qa-a-mobile-app.html&quot;&gt;teaching Claude to QA a mobile app&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/27/memory-isnt-learning.html</guid>
			</item>
		
			<item>
				<title>Finding Safe Food on the Road</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“What a long strange trip it’s been”&lt;/em&gt;
— Grateful Dead, “Truckin’”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-problem-nobody-talks-about&quot;&gt;The Problem Nobody Talks About&lt;/h2&gt;

&lt;p&gt;When I was doing my PhD in Europe — splitting time between Belgium, Portugal, and Paris — I got diagnosed with celiac disease. I’d gone to Europe partly because of the food. Paris. I wanted to teach there, live there, eat there. Then I found out I couldn’t eat bread, and in France that’s not a dietary restriction — it’s an existential one.&lt;/p&gt;

&lt;p&gt;It got bad enough that food became one of the reasons I left. Not the only reason, but a real one. I couldn’t navigate restaurants in languages I spoke fluently because the cross-contamination risks were invisible and the cultural understanding of celiac was years behind the US. I remember thinking: my traveling days are over. I have this disease, and it means I stay home, I cook for myself, and I stop pretending I can live the life I wanted.&lt;/p&gt;

&lt;p&gt;That was wrong. But it took years — and two specific tools — to prove it.&lt;/p&gt;

&lt;p&gt;Find Me Gluten Free gave me a community of people who’d already eaten at every restaurant I was considering and reported whether it was safe. DoorDash gave me delivery to wherever I was staying, so I didn’t have to walk into a restaurant and try to explain celiac disease to a kitchen that had never heard of it. Between the two of them, I’ve eaten safely in hundreds of cities. Great food, not just survival food. The infrastructure existed. It just wasn’t connected.&lt;/p&gt;

&lt;p&gt;When you have celiac disease and you’re on tour — following bands from city to city, crashing in hotels — food isn’t an adventure. It’s a minefield. But it’s a minefield I’ve learned to navigate.&lt;/p&gt;

&lt;p&gt;Here’s what it actually looks like: you fly into a new city the night before a show. You check into the hotel, drop your bags, and immediately start thinking about food — not just tonight, but tomorrow before the venue, and maybe the day after if you’re staying for a second night. You need to figure out what’s safe in a city you’ve never eaten in before.&lt;/p&gt;

&lt;p&gt;You open DoorDash and start scrolling. There are 400 restaurants. Some of them say “gluten-free options available,” but that label is up to the restaurant — and what it means varies wildly. A pizza place with a GF crust that gets made on the same counter as regular pizza isn’t safe for someone with celiac. A Thai restaurant that says “we can make it without soy sauce” doesn’t know that their oyster sauce has wheat in it. DoorDash gets you delivery anywhere, which is incredibly valuable when you’re traveling — but the celiac-specific safety information lives somewhere else.&lt;/p&gt;

&lt;p&gt;So you switch to Find Me Gluten Free — a community site where people with celiac actually review restaurants and report whether they got sick. Great data. Real safety information from people who understand cross-contamination, dedicated fryers, and separate prep areas. But FMGF doesn’t do delivery. It doesn’t know whether that restaurant with the 4.8 safety rating is available on DoorDash at your hotel right now.&lt;/p&gt;

&lt;p&gt;You end up with two tabs open, manually cross-referencing. Copy a restaurant name from FMGF, paste it into DoorDash, see if it shows up, check the delivery area, go back, try the next one. You’re doing this the night you land, trying to line up safe options for the next two days so you’re not scrambling between soundcheck and doors. It’s 11pm. You’re exhausted. You give up and eat a protein bar from your bag.&lt;/p&gt;

&lt;p&gt;I got tired of the protein bar.&lt;/p&gt;

&lt;h2 id=&quot;what-we-built&quot;&gt;What We Built&lt;/h2&gt;

&lt;p&gt;The Itinerant Glutard is a tool that connects the two systems nobody connected before. You enter a city, a state, and your delivery address. It scrapes Find Me Gluten Free for every reviewed restaurant in that city, then checks each one against DoorDash to see if it can deliver to where you are right now. The results come back sorted by a safety score — a 0-to-100 composite that weights the restaurant’s GF level, its FMGF star rating, review count, and specific safety signals like dedicated fryers and separate kitchens.&lt;/p&gt;

&lt;p&gt;The name is what it sounds like. Itinerant: traveling from place to place. Glutard: affectionate self-deprecation from the celiac community — the kind of word you earn after your third accidental glutening at a restaurant that swore they understood. An itinerant glutard is someone with celiac disease who’s on the road and trying to eat.&lt;/p&gt;

&lt;p&gt;There are two modes. Full Search takes your address and does the whole pipeline — FMGF scrape, DoorDash availability check for each restaurant, merged results with direct ordering links and delivery time estimates. It takes a minute or two because it’s driving a headless browser through DoorDash for every restaurant. Quick Browse skips the DoorDash check entirely and just shows you the FMGF safety data for a city. That’s the one you use before you even book the hotel — scope out what’s safe in Portland or Denver or Philly so you know what you’re walking into.&lt;/p&gt;

&lt;h2 id=&quot;the-safety-score&quot;&gt;The Safety Score&lt;/h2&gt;

&lt;p&gt;This is the part I care about most. The safety score is computed entirely from Find Me Gluten Free data — it’s a composite of the information that FMGF’s celiac community has already gathered through years of reviews, safety ratings, and incident reports. We’re not inventing safety judgments; we’re synthesizing what the community already knows into a single number you can act on quickly.&lt;/p&gt;

&lt;p&gt;The score runs from 0 to 100, broken into five tiers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;80–100&lt;/strong&gt;: Celiac Safe. Dedicated gluten-free facility or overwhelming positive evidence from the celiac community.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;60–79&lt;/strong&gt;: Likely Safe. Strong GF menu with good community feedback.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;40–59&lt;/strong&gt;: Use Caution. Has GF options but limited safety data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;20–39&lt;/strong&gt;: Higher Risk. Minimal celiac-specific information.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;0–19&lt;/strong&gt;: Unknown. No data. You’re on your own.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The calculation is weighted, and every input comes from FMGF. GF level counts for 40% — a dedicated gluten-free restaurant scores higher than one that just has “gluten-free options.” FMGF’s star rating is 30%. Review count is 15%, because a restaurant with 200 reviews and a 4.5 rating is a more reliable signal than one with 3 reviews and a 5.0. The remaining 15% comes from safety signals extracted from the FMGF listings: separate fryer, dedicated kitchen, celiac-safe rating, knowledgeable staff. A negative report — someone in the community reported getting sick — drops the score.&lt;/p&gt;

&lt;p&gt;When you’re in a new city and you don’t know anything, the score gives you a starting point. Green means order with confidence. Yellow means read the reviews first. Red means maybe stick with the protein bar.&lt;/p&gt;

&lt;h2 id=&quot;the-architecture&quot;&gt;The Architecture&lt;/h2&gt;

&lt;p&gt;The backend is Express and Puppeteer. Both scrapers need a headless browser because FMGF and DoorDash render their content with JavaScript — you can’t just fetch the HTML and parse it. The FMGF scraper navigates to the city page, scrolls to trigger lazy loading, then uses Cheerio to extract restaurant data from the rendered DOM. The DoorDash integration is more involved: it sets a delivery address, waits for the autocomplete to resolve, then searches for each restaurant by name with fuzzy matching to confirm availability.&lt;/p&gt;

&lt;p&gt;The frontend is React and Vite. Simple by design — a search form, a list of restaurant cards with safety badges, filter buttons for “all,” “on DoorDash,” and “safe (60+).” Each card shows the restaurant name, cuisine, FMGF rating, GF level, safety signals, and a DoorDash order button with delivery time if it’s available.&lt;/p&gt;

&lt;p&gt;The whole thing is held together with web scraping, which means it’s inherently fragile. If either site changes their page structure, the scrapers break. The code has fallback selectors and multiple strategies for finding elements, but this is a prototype — a browser pretending to be a person. It works, but a proper integration would be better. The right version of this is an API-driven experience, not a scraping hack.&lt;/p&gt;

&lt;h2 id=&quot;building-it-with-claude&quot;&gt;Building It With Claude&lt;/h2&gt;

&lt;p&gt;I was already doing all of this manually. Every trip, the same ritual: open FMGF, find the restaurants in the city, open DoorDash, search for each one by name, check if it delivers to the hotel, keep a mental list of the ones that work. It took 30–45 minutes on a good night. On a bad night — a city with dozens of FMGF listings to check one by one — I’d give up halfway through.&lt;/p&gt;

&lt;p&gt;The process worked. It was just slow, tedious, and manual. I knew exactly what I was doing at every step. I just couldn’t build the tool to automate it because the work involved — scraping two JavaScript-heavy sites with a headless browser, navigating DoorDash’s address input flow, fuzzy-matching restaurant names across platforms — was the kind of grinding infrastructure code that would have taken me weeks of evenings to get right.&lt;/p&gt;

&lt;p&gt;Claude made it possible to build the thing I was already doing by hand. I described the manual workflow — go to FMGF, get the restaurants, check each one on DoorDash at this address — and Claude wrote the Puppeteer automation, the DOM navigation, the fallback selectors for when DoorDash’s UI didn’t behave as expected, the fuzzy name matching, the React frontend, all of it. The domain knowledge was mine: which safety signals matter, how to weight them, why review count is a confidence measure, why FMGF data comes first. But the scraping infrastructure that turned a 45-minute manual process into a two-minute automated one — that’s what Claude made feasible for a solo developer building something on evenings and weekends.&lt;/p&gt;

&lt;h2 id=&quot;the-name&quot;&gt;The Name&lt;/h2&gt;

&lt;p&gt;Itinerant Glutard. “Glutard” is celiac community slang — the kind of self-deprecating shorthand people use when they’ve spent enough years explaining cross-contamination to waiters and reading ingredient labels on soy sauce. It’s an in-group term, affectionate in the way that only people who share the condition tend to use it. Itinerant because I’m on the road.&lt;/p&gt;

&lt;p&gt;There was a version of me in a tiny apartment in Belgium who believed this disease meant staying put. That the world had shrunk to the places I could cook for myself. That touring — the thing I wanted most — was something other people got to do.&lt;/p&gt;

&lt;p&gt;I was wrong. The community data existed. The delivery infrastructure existed. I just needed to connect them. I’ve eaten safely at hundreds of places in dozens of cities since then, and the food has been genuinely good — not sad compromises, not protein bars, not going hungry. The Itinerant Glutard is the tool that makes that process faster, but the real thing that makes it possible is that the celiac community built FMGF and DoorDash built delivery, and between the two of them, the road opened back up.&lt;/p&gt;

&lt;p&gt;I have celiac disease. I travel to see live music. I got tired of going hungry.&lt;/p&gt;

&lt;p&gt;Now I don’t.&lt;/p&gt;
</description>
				<pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/development/claude/2026/03/23/finding-safe-food-on-the-road.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/development/claude/2026/03/23/finding-safe-food-on-the-road.html</guid>
			</item>
		
			<item>
				<title>Teaching Claude to QA a Mobile App</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“When life looks like Easy Street, there is danger at your door”&lt;/em&gt;
— Grateful Dead, “Uncle John’s Band”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;(A note on picking this quote: I asked Claude to find me a Grateful Dead lyric that fit the theme. It couldn’t — searching for “dead lyrics” triggers the content filtering policy: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;API Error: 400 {&quot;type&quot;:&quot;error&quot;,&quot;error&quot;:{&quot;type&quot;:&quot;invalid_request_error&quot;,&quot;message&quot;:&quot;Output blocked by content filtering policy&quot;}&lt;/code&gt;. I had to pick this one myself.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I build Zabriskie alone — no team, no investors, just me in my bedroom shipping a community app because I think the internet needs better gathering places. My first lesson about building a “product”: if it’s not in the App Store, it doesn’t exist. I had early users who loved the web version but wouldn’t touch it daily because it wasn’t “an app.” It might as well not be real. So I needed to ship on three platforms — web for fast iteration and testing, iOS and Android because that’s where people actually live.&lt;/p&gt;

&lt;p&gt;The problem is I’m one person. I can’t write and maintain three separate codebases. The answer was &lt;a href=&quot;https://capacitorjs.com/&quot;&gt;Capacitor&lt;/a&gt;: it takes the React web app I’d already built and wraps it in a native shell — a WebView on Android, a WKWebView on iOS — so the same code runs everywhere. Combined with the server-driven UI architecture (the backend sends screen layouts as JSON, and the client just renders them), I can push changes to all three platforms without waiting for App Store review. One codebase, three platforms, one developer. It’s the only way this works.&lt;/p&gt;

&lt;p&gt;But Capacitor puts you in a testing no-man’s-land. Playwright can’t reach inside the native shell — it’s not a browser tab anymore, it’s an app. Native testing frameworks like XCTest and Espresso can’t interact with the content — it’s HTML inside a WebView, not native UI elements. You’re too native for web tools and too web for native tools. Every testing approach in this post exists because of that gap.&lt;/p&gt;

&lt;p&gt;Zabriskie runs on all three platforms. The web gets tested by Playwright — 150+ E2E tests that run on every push. But the mobile apps had nothing. No automated QA, no visual regression checks, no way to know if either client was rendering correctly without manually clicking through every screen. I decided to fix that by teaching Claude to drive both mobile platforms, take screenshots, analyze them for issues, and file its own bug reports.&lt;/p&gt;

&lt;p&gt;Android took 90 minutes. iOS took over six hours. The difference says everything about the state of mobile automation tooling in 2026.&lt;/p&gt;

&lt;h2 id=&quot;android-the-easy-one&quot;&gt;Android: The Easy One&lt;/h2&gt;

&lt;p&gt;The first challenge was connectivity. Inside the Android emulator, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localhost&lt;/code&gt; refers to the emulator itself, not the host Mac. When the Capacitor app tries to reach &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localhost:3000&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;localhost:8080&lt;/code&gt;, it gets nothing. The fix is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb reverse&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb reverse tcp:3000 tcp:3000
adb reverse tcp:8080 tcp:8080
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Simple, but you have to re-run it every time the emulator restarts.&lt;/p&gt;

&lt;p&gt;The real breakthrough was realizing that Capacitor apps run inside an Android WebView, and WebViews expose a Chrome DevTools Protocol socket. You can find it, forward it to a local port, and suddenly you have full programmatic control:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Find the WebView&apos;s DevTools socket&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;WV_SOCKET&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;adb shell &lt;span class=&quot;s2&quot;&gt;&quot;cat /proc/net/unix&quot;&lt;/span&gt; | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;webview_devtools_remote | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-oE&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;webview_devtools_remote_[0-9]+&apos;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-1&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Forward it to a local port&lt;/span&gt;
adb forward tcp:9223 localabstract:&lt;span class=&quot;nv&quot;&gt;$WV_SOCKET&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Full CDP access&lt;/span&gt;
curl http://localhost:9223/json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With CDP, authentication is one WebSocket message — inject a JWT into localStorage and navigate to the feed. Navigation is another message — set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;window.location.href&lt;/code&gt;. No coordinate guessing, no UI interaction, no fighting with keyboards or dialogs. The same protocol that Playwright and Puppeteer use, just connected to an Android WebView instead of a desktop browser.&lt;/p&gt;

&lt;p&gt;Combined with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb shell screencap&lt;/code&gt; for screenshots, I built a Python script that sweeps all 25 screens of the app in about 90 seconds. Landing, login, all four feeds, post detail, profile, shows hub, content creation forms, catalog, battles, bug forum, diary, badges, tour crews — everything. Each screenshot gets analyzed for visual issues: broken layouts, error messages, missing images, blank screens, status bar overlap.&lt;/p&gt;

&lt;p&gt;When the sweep finds something wrong, it authenticates as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zabriskie_bot&lt;/code&gt;, uploads the screenshot to S3, and files a properly formatted bug report to the production forum. The title format is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[Android QA] Shows Hub: RSVP button overlaps venue text&lt;/code&gt; — immediately clear that it came from automation and which screen is affected. It knows about expected states too: the crew detail page returning “Forbidden” for non-members isn’t a bug, empty avatar circles aren’t bugs, and the “Preview” text in profile settings is a known cosmetic issue.&lt;/p&gt;

&lt;p&gt;The whole thing runs as a scheduled task every morning at 8:47 AM. The first full run came back clean: 25 screens, 0 critical issues, 2 minor cosmetic notes. If someone’s change breaks a screen overnight, there’s a bug filed before anyone’s had coffee.&lt;/p&gt;

&lt;p&gt;Ninety minutes, start to finish.&lt;/p&gt;

&lt;h2 id=&quot;ios-the-hard-one&quot;&gt;iOS: The Hard One&lt;/h2&gt;

&lt;p&gt;I figured iOS would be straightforward. Same app, same screens, the Simulator is right there on my Mac. What followed was one of the most absurd debugging sessions I’ve had — not because the problem was technically profound, but because the iOS Simulator is a fortress of tiny, compounding restrictions that each seem reasonable in isolation but together create a nightmare.&lt;/p&gt;

&lt;h3 id=&quot;you-cant-type-an-email-address&quot;&gt;You Can’t Type an Email Address&lt;/h3&gt;

&lt;p&gt;The first idea was clean: add a deep link handler, generate a JWT, open the URL via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;simctl openurl&lt;/code&gt;, and skip the login form entirely. Four attempts, four different failure modes — the native bundle was stale, the config pointed at production, the JWT secret was wrong, the Vite dev server was listening on IPv6 while the Simulator tried IPv4. Zero logins.&lt;/p&gt;

&lt;p&gt;So I fell back to typing credentials into the login form. AppleScript can send keystrokes to the Simulator. But the login form has &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type=&quot;email&quot;&lt;/code&gt; on the input, and AppleScript’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;keystroke &quot;@&quot;&lt;/code&gt; sends &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Shift+2&lt;/code&gt;, which the Simulator interprets as a keyboard shortcut. Every attempt to type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&lt;/code&gt; either switched the form to Sign Up, navigated to Forgot Password, or opened a context menu.&lt;/p&gt;

&lt;p&gt;Pasting didn’t work either. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Cmd+V&lt;/code&gt; gets intercepted by the Simulator. Setting the iOS pasteboard via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;simctl pbcopy&lt;/code&gt; produced garbled text. The macOS clipboard and the iOS pasteboard are separate systems.&lt;/p&gt;

&lt;p&gt;The fix was a code change: update the backend login handler from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE email = $1&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE email = $1 OR username = $1&lt;/code&gt;, change the form input from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type=&quot;email&quot;&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;type=&quot;text&quot;&lt;/code&gt;, and create a test user with a known password. Now I could type “qatest” instead of needing an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@&lt;/code&gt; symbol. A backend modification to work around a keyboard limitation.&lt;/p&gt;

&lt;h3 id=&quot;you-cant-dismiss-native-dialogs&quot;&gt;You Can’t Dismiss Native Dialogs&lt;/h3&gt;

&lt;p&gt;Upon login, iOS shows a “Would Like to Send You Notifications” dialog rendered by UIKit, not the WebView. Native iOS dialogs cannot be dismissed by any form of macOS-synthesized input.&lt;/p&gt;

&lt;p&gt;I tried AppleScript &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;click at&lt;/code&gt; coordinates across a grid of 100+ positions. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cliclick&lt;/code&gt; at every possible coordinate. Python Quartz CGEvent mouse events. Pressing Return and Enter. Finding the button in the accessibility tree (not exposed). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;simctl privacy grant&lt;/code&gt; (not supported for notifications on iOS 26). &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;simctl ui alert accept&lt;/code&gt; (doesn’t exist).&lt;/p&gt;

&lt;p&gt;The dialog sat there, immovable, blocking the app.&lt;/p&gt;

&lt;p&gt;The fix was writing directly to the Simulator’s TCC.db — the privacy permissions database — inserting a pre-approval for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kTCCServiceUserNotification&lt;/code&gt;, then restarting SpringBoard. But the timing is critical: it has to happen before installing the app, or the permission state gets cached. And the app’s JavaScript calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PushNotifications.requestPermissions()&lt;/code&gt; on login, which can retrigger it, so I had to add a guard that skips permission requests on localhost.&lt;/p&gt;

&lt;p&gt;The correct sequence: uninstall app, write TCC permission, restart SpringBoard, reinstall app, launch, then login. Only in that exact order does the dialog not appear.&lt;/p&gt;

&lt;h3 id=&quot;you-cant-navigate-by-coordinates-until-you-can&quot;&gt;You Can’t Navigate by Coordinates (Until You Can)&lt;/h3&gt;

&lt;p&gt;The app has a floating nav bar with three bubble buttons in the top-right corner — a Z logo, an avatar, and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;+&lt;/code&gt; — each opening a vertical dropdown. To test all 25 screens, I needed to tap specific dropdown items. I had coordinates from the CSS. The math checked out. But every approach had a different failure mode.&lt;/p&gt;

&lt;p&gt;AppleScript &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;click at&lt;/code&gt; uses macOS window coordinates. You need the window position, the device screen group offset, the Simulator’s scaling mode (Point Accurate vs. Pixel Accurate vs. Fit Screen), and whether the toolbar is showing. First sweep: 42% accuracy.&lt;/p&gt;

&lt;p&gt;Facebook’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;idb&lt;/code&gt; sends taps in device logical points (390x844), so no translation needed. Better for main nav buttons, but dropdown item coordinates were slightly off — taps would close the dropdown before hitting the item, or punch through the z-index to content behind it. Second sweep: 57% accuracy.&lt;/p&gt;

&lt;p&gt;The breakthrough was the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ios-simulator-mcp&lt;/code&gt; tool’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ui_describe_point&lt;/code&gt; function. Point it at any coordinate and it returns the accessibility label, role, and frame:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ui_describe_point(365, 163)
→ AXLabel: &quot;Currents&quot;, type: Link, frame: (342, 159, 40x40)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I mapped every dropdown item by probing in 48pt increments. My Y positions were right but my X was wrong — the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;+&lt;/code&gt; dropdown items are at x=258, not x=269. An 11-point error that routed every tap to the wrong column. With verified coordinates and 1.5-second waits for dropdown animations, the sweep hit 100% of screens.&lt;/p&gt;

&lt;p&gt;The winning combination: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ui_describe_point&lt;/code&gt; for discovery, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;idb ui tap&lt;/code&gt; for execution. Map the UI first, tap second. Don’t guess coordinates — measure them.&lt;/p&gt;

&lt;h3 id=&quot;the-fundamental-gap&quot;&gt;The Fundamental Gap&lt;/h3&gt;

&lt;p&gt;The contrast is stark. Android authentication:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;ws&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;send&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;{&quot;method&quot;:&quot;Runtime.evaluate&quot;,&quot;params&quot;:{&quot;expression&quot;:&quot;localStorage.setItem(&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;token&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;xxx&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;)&quot;}}&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;iOS authentication: uninstall app, write to TCC database, restart SpringBoard, reinstall app, launch, wait 5 seconds, tap Sign In at specific coordinates, wait, tap Email field, type “qatest” via AppleScript, press Tab, type “qatest123”, press Return, wait, hope.&lt;/p&gt;

&lt;p&gt;Apple’s WKWebView doesn’t expose Chrome DevTools Protocol. Safari Web Inspector uses a proprietary binary protocol that only Safari speaks. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ios-webkit-debug-proxy&lt;/code&gt; only works with real USB devices. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;safaridriver&lt;/code&gt; connects to macOS Safari, not the Simulator’s WebView.&lt;/p&gt;

&lt;p&gt;Android gives you a WebSocket and says “here’s the browser, do whatever you want.” iOS gives you a locked door and a note that says “please use Xcode.”&lt;/p&gt;

&lt;h2 id=&quot;the-mess-in-the-middle&quot;&gt;The Mess in the Middle&lt;/h2&gt;

&lt;p&gt;Between getting Android working and finishing iOS, something happened that illustrates a different kind of failure — not a platform limitation, but an agent discipline problem.&lt;/p&gt;

&lt;p&gt;Railway deployments started failing with a Go version mismatch. My local Go had auto-updated to 1.26, which silently bumped &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go.mod&lt;/code&gt; to require Go 1.25, while the Dockerfile still used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;golang:1.24-alpine&lt;/code&gt;. A two-file fix.&lt;/p&gt;

&lt;p&gt;Claude was operating in a git worktree — a clean, isolated copy of the repo designed for exactly this kind of surgical change. Instead of making the fix there, it &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cd&lt;/code&gt;‘d into the main repository where I had a dozen unrelated in-progress changes. It staged every dirty file, committed them all with the Go version fix, pushed, and opened a PR. The PR contained QA login endpoints, bug forum updates, iOS Simulator workarounds, E2E test config changes, push notification code, and three new skill files. None of which had anything to do with a Go version number.&lt;/p&gt;

&lt;p&gt;Then it got auto-merged before I could close it.&lt;/p&gt;

&lt;p&gt;The bad merge left duplicate variable declarations throughout the test suite — functions declared twice, variables declared twice. One of the accidentally included changes was a form placeholder rename from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Email&quot;&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Email or Username&quot;&lt;/code&gt;, which broke every auth E2E test that used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;page.fill(&apos;input[placeholder=&quot;Email&quot;]&apos;)&lt;/code&gt;. A catalog test that asserted &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;itemCount &amp;gt; 50&lt;/code&gt; only worked against my local database — CI has a handful of records.&lt;/p&gt;

&lt;p&gt;To fix a two-file change, I ended up making four follow-up commits across three PRs. The first two I pushed without running tests locally. They failed. The third I actually ran tests first. It passed. Three rounds of “push and pray” before doing what should have been step one: run the tests, read the output, fix what’s broken, verify, then push. The same debugging rule I enforce every session — check the logs first, theories second — and I ignored it for my own changes.&lt;/p&gt;

&lt;h2 id=&quot;what-this-all-adds-up-to&quot;&gt;What This All Adds Up To&lt;/h2&gt;

&lt;p&gt;Both platforms now have working QA skills. Every morning, the Android emulator and the iOS Simulator boot up, sweep 25 screens each, analyze the screenshots, and file bug reports for anything that looks wrong. Three platforms, all tested, all filing their own bugs.&lt;/p&gt;

&lt;p&gt;The lessons keep reinforcing each other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CDP over taps.&lt;/strong&gt; Don’t fight coordinate systems if you can use the browser’s own debugging protocol. Android gives you this for free. iOS doesn’t, and every workaround adds fragility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure, don’t guess.&lt;/strong&gt; The accessibility API that finally made iOS navigation work is the same principle as checking logs before forming theories. Don’t assume you know where a button is — ask the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stay in the worktree.&lt;/strong&gt; Isolation only works if you respect the boundaries. The moment you step outside “just for a quick look,” you’re one careless command away from committing a dozen unrelated files to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the tests before you push.&lt;/strong&gt; Three rounds of push-and-pray before doing what should have been step one. The gap between knowing a rule and following it is measured in wasted commits.&lt;/p&gt;

&lt;p&gt;Apple, if you’re reading this: please expose CDP or WebDriver for Simulator WebViews. The developer tools are great when a human is using them. They’re nearly useless when an AI is trying to.&lt;/p&gt;
</description>
				<pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/development/android/ios/2026/03/22/teaching-claude-to-qa-a-mobile-app.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/development/android/ios/2026/03/22/teaching-claude-to-qa-a-mobile-app.html</guid>
			</item>
		
			<item>
				<title>What Building With Claude Actually Looks Like</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“Sometimes a notion gets a-hold of you, carries you away”&lt;/em&gt;
— Grateful Dead, “Althea”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-saturday&quot;&gt;The Saturday&lt;/h2&gt;

&lt;p&gt;On March 7th, I sat down at 11am and started building. By 2am I had shipped a Relisten integration with inline audio players for archival recordings, a Quick Post feature for sharing past shows, Phantasy Tour as a live setlist source, Sign in with Apple, a badges and achievements system, security hardening across every API route, a complete CI pipeline with GitHub Actions, and something called Goose Mode — a tour companion dashboard with a wandering desktop goose that follows your cursor around the screen.&lt;/p&gt;

&lt;p&gt;That was one day. 124 commits.&lt;/p&gt;

&lt;p&gt;I kept going. By the end of the week, the total was 144 commits across 7 days. Relisten integration. Live show experience. Six new bands on the platform. A bug forum. Backend test coverage from 43.9% to 70.4%. An admin analytics dashboard. Apple App Review fixes. Scrapers for four different setlist sources. Archive.org as a recording provider. Mobile layout fixes for three different iOS edge cases.&lt;/p&gt;

&lt;p&gt;I’m building Zabriskie by myself, and I’m building it with Claude. This is what that actually looks like.&lt;/p&gt;

&lt;h2 id=&quot;the-cost-of-trying-something&quot;&gt;The Cost of Trying Something&lt;/h2&gt;

&lt;p&gt;Here’s the thing nobody tells you about building with an AI collaborator: the most important change isn’t speed. It’s what happens to your relationship with bad ideas.&lt;/p&gt;

&lt;p&gt;Goose Mode started at 9pm on a Saturday night as a tour-focused homepage with a literal wandering goose — a transparent animated sprite that wanders around the page while you browse upcoming shows. It was fun. It was also wrong. The layout didn’t work. The information hierarchy was off. The goose was distracting in a way that stopped being charming after about thirty seconds.&lt;/p&gt;

&lt;p&gt;Old me would have agonized. I’d spend an hour in Figma trying to figure out the right layout before writing a line of code. I’d poll people. I’d sit with it for a few days. The cost of being wrong was high enough that I’d optimize for not being wrong.&lt;/p&gt;

&lt;p&gt;Instead, at 11:25pm, I told Claude to redesign the whole thing as a tour companion dashboard. By 3:37am it had been redesigned again — left-aligned with grouped tour timelines. By 4:08am, another redesign — card-per-tour visual timelines. By 4:42am, another — live countdown, attendee avatars, couch tour cards. By 5:41am I’d added a flip-clock countdown. By 6:17am there were interactive tour maps embedded in each card.&lt;/p&gt;

&lt;p&gt;Six versions in nine hours. Each one a real, working implementation I could tap through on my phone. Not mockups. Not wireframes. Running code. The version that shipped was the fifth attempt, and I only knew it was right because I’d seen the four that weren’t.&lt;/p&gt;

&lt;p&gt;When the cost of trying something drops to near zero, you stop designing in your head and start designing in reality. That changes everything.&lt;/p&gt;

&lt;h2 id=&quot;a-pigeons-show-a-broken-chat-and-a-deploy-at-11pm&quot;&gt;A Pigeons Show, a Broken Chat, and a Deploy at 11pm&lt;/h2&gt;

&lt;p&gt;On the night of March 7th, Pigeons Playing Ping Pong was playing a show. People were using Zabriskie’s live show features — Live Chomping, the setlist tracker, the “tonight” banner. I was watching the show on the couch and also watching my app.&lt;/p&gt;

&lt;p&gt;The chat input was getting cut off on Chrome mobile. I could see it happening in real time because I was using it. At 10:33pm I fixed the layout. At 10:39pm I realized live show posts weren’t appearing in the main feed while a show was active — a bug nobody would have found in testing because it only manifested when a real show was actually live. Fixed it. Deployed. At 10:45pm I added Phantasy Tour as a live setlist source because the existing sources weren’t picking up the setlist fast enough. At 11:20pm I fixed the song order — songs were appearing out of sequence, and duplicate comments were showing up in the chat.&lt;/p&gt;

&lt;p&gt;At 11:34pm I discovered that on newer iPhones, the live page content was rendering underneath the Dynamic Island. That’s not something you find in a simulator. That’s something you find when you’re holding the phone in your hand, trying to see who’s chomping, and you can’t read the first line of text.&lt;/p&gt;

&lt;p&gt;This is what dogfooding actually means. Not “I used my own app once and it seemed fine.” It means you’re sitting on your couch during a Pigeons show, fixing a CSS bug that’s blocking your own experience, deploying it, and immediately seeing whether it worked — all while the show is still going.&lt;/p&gt;

&lt;h2 id=&quot;the-breadth-problem&quot;&gt;The Breadth Problem&lt;/h2&gt;

&lt;p&gt;Solo developers are supposed to specialize. Pick a lane. You can’t do backend and frontend and mobile and DevOps and design. That’s a team.&lt;/p&gt;

&lt;p&gt;In one week I worked on:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Audio infrastructure&lt;/strong&gt; — Relisten API integration, playlist players, collapsible audio widgets, caching, recording quality filtering&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Real-time features&lt;/strong&gt; — Live chat (sorry, Live Chomping), WebSocket setlist polling, presence indicators&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data ingestion&lt;/strong&gt; — Scrapers for Phantasy Tour, TTBase, setlist.fm, Archive.org/etree&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt; — Rate limiting, CORS hardening, JWT invalidation on password change, auth on every route&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Testing&lt;/strong&gt; — Playwright E2E coverage, mock HTTP servers, backend coverage push, CI pipeline&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mobile&lt;/strong&gt; — iOS Dynamic Island fix, Chrome mobile layout, safe area insets, App Review compliance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Social features&lt;/strong&gt; — Quick Post, @mentions, clickable URLs, clickable avatars, bug forum with upvotes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Design&lt;/strong&gt; — Goose Mode (x5), Spotify now-listening card, admin analytics dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn’t sustainable without Claude. I want to be honest about that. I’m not some 10x developer who figured out the productivity secret. I’m a normal developer who has a collaborator that doesn’t sleep, doesn’t get bored of writing tests, and can context-switch from a Go backend handler to a Swift layout constraint to a Playwright test assertion without missing a beat.&lt;/p&gt;

&lt;p&gt;The thing Claude is genuinely good at — the thing that makes the breadth possible — is carrying the context. I can say “the Relisten player should be collapsible, like we did on the show listing page” and it knows what I mean because it wrote that code an hour ago. I don’t have to re-explain the component architecture every time I switch contexts. It already knows.&lt;/p&gt;

&lt;h2 id=&quot;the-things-i-still-do&quot;&gt;The Things I Still Do&lt;/h2&gt;

&lt;p&gt;I want to be clear about what Claude doesn’t do, because the discourse around AI and coding has gotten absurd in both directions. People either think it writes your entire app for you, or they think it’s useless. Neither is true.&lt;/p&gt;

&lt;p&gt;Claude doesn’t know what to build. It doesn’t know that a Pigeons show is happening tonight and that the setlist tracker needs Phantasy Tour as a source. It doesn’t know that the Goose Mode countdown should use a flip-clock style because that’s what feels right for the aesthetic. It doesn’t know that the live chat should be called “Live Chomping” because that’s what the community actually calls it. It doesn’t know that the Quick Post feature exists because I watched someone try to share a recording and give up because it was too many steps.&lt;/p&gt;

&lt;p&gt;Every feature started with me noticing something — a pain point, an opportunity, an idea at 3am that I couldn’t let go of. Claude is the best collaborator I’ve ever had for turning those observations into running software. But the observations are mine. The taste is mine. The understanding of what this community needs is mine.&lt;/p&gt;

&lt;p&gt;And the bugs. The bugs are mine too. The production crash from the code coverage import that should never have been in main.go — that was a human mistake. Claude wrote the instrumentation; I’m the one who forgot to check the build before deploying. The authentication forwarding bug that broke internal SDUI calls — that emerged from the interaction between two features Claude had built separately, each correct in isolation, broken in combination. Integration bugs are still human problems. They require understanding the whole system, not just the code.&lt;/p&gt;

&lt;h2 id=&quot;439-to-704&quot;&gt;43.9% to 70.4%&lt;/h2&gt;

&lt;p&gt;On March 8th at 8:45am — after the Pigeons show, after the Goose Mode all-nighter, after 124 commits in a single day — I asked Claude to push the backend test coverage as high as it could go.&lt;/p&gt;

&lt;p&gt;One commit. 43.9% to 70.4%.&lt;/p&gt;

&lt;p&gt;This is the thing that makes me genuinely optimistic about building alone. The testing tax — the thing that slows down every solo developer, the thing you skip because you’re tired and the feature works and you’ll write tests later (you won’t) — that tax is effectively gone. Claude writes comprehensive tests. Not just happy-path assertions. Edge cases. Error conditions. Auth boundary tests. The kind of tests you’d write if you had infinite patience and no ship date.&lt;/p&gt;

&lt;p&gt;I still write tests for the things that matter to me — the tricky integration points, the things where the test itself is the specification. But the coverage floor, the boring-but-necessary tests that catch regressions? That’s not my job anymore. And that means I actually have test coverage, which means I can refactor with confidence, which means the codebase stays healthy even at this pace.&lt;/p&gt;

&lt;h2 id=&quot;three-nights-at-the-beacon&quot;&gt;Three Nights at the Beacon&lt;/h2&gt;

&lt;p&gt;The Pigeons show was from my couch. The Tedeschi Trucks Band run at the Beacon Theatre was from my seat.&lt;/p&gt;

&lt;p&gt;TTB was playing ten nights at the Beacon in March. I had tickets to three of them. I also had a platform that didn’t know Tedeschi Trucks Band existed yet. Tuesday before the show I added them — 58 shows for the 2026 Future Soul Tour. But TTB’s setlists don’t come from setlist.fm. They come from TTBase, which has a completely different HTML structure. So I needed a new scraper.&lt;/p&gt;

&lt;p&gt;That night, from my seat at the Beacon, I built it. I told Claude what TTBase looked like, what data I needed, and how it should integrate with the live setlist poller. Claude wrote the scraper. I deployed it. It didn’t work — the HTML structure didn’t match what we’d expected. So I told Claude what was wrong, it fixed the scraper to match the actual Songfish HTML structure, I deployed again, and watched the setlist populate in real time while the band was playing.&lt;/p&gt;

&lt;p&gt;I used the app across all three shows. I’d be sitting there, listening to Derek Trucks play, and I’d notice something — a layout bug, a feature that didn’t work right, something that could be better. I’d pull out my phone, tell Claude what I needed, watch it write the fix, push it to production, and then check it on my phone. All from my seat. All while the show was happening.&lt;/p&gt;

&lt;p&gt;The reason this was even possible is that Zabriskie uses a server-driven UI architecture. Early on, Claude helped guide me toward SDUI as the core design — the server sends down the layout and components, and the app just renders whatever it receives. That means I can change virtually anything about the experience without shipping a new version through the App Store. A fix to a layout, a new feature, a redesigned screen — it’s all a server deploy. When I pushed a fix from the Beacon, every phone running the app got it immediately. No app update. No review process. No waiting.&lt;/p&gt;

&lt;p&gt;This is a different thing from the Pigeons show, where I was on the couch and had a laptop open. At the Beacon I was in the audience with nothing but my phone. The workflow was: notice a problem, describe it to Claude in plain English, Claude fixes it and pushes to prod, I pull up the app and verify. No laptop. No IDE. No terminal. Just me, my phone, and a collaborator who could do the rest.&lt;/p&gt;

&lt;p&gt;By the end of the three-night run, the TTB experience on Zabriskie was solid. Live setlists from TTBase. Show pages with all the metadata. The whole thing built and refined from inside the venue where the band was playing. That’s not a development workflow I ever imagined having.&lt;/p&gt;

&lt;h2 id=&quot;the-week-keeps-going&quot;&gt;The Week Keeps Going&lt;/h2&gt;

&lt;p&gt;After the Saturday marathon and the Beacon run, I kept building. Sunday: clickable URLs in posts and comments. A small thing. The kind of thing you’d never prioritize on a roadmap but that users notice immediately.&lt;/p&gt;

&lt;p&gt;Thursday I added Grahame Lesh &amp;amp; Friends with 25 shows. Saturday night I was at the Grahame Lesh show, and this time I wasn’t debugging anything. The setlist just worked. It synced with the show in real time, song by song, no intervention. I was posting photos from my seat and the setlist was updating alongside them. My friends were watching from home, talking to me through the app — they could see the setlist, see my photos, and we were all in the same experience even though I was the only one in the room.&lt;/p&gt;

&lt;p&gt;That’s the moment it stopped being a project and started being the thing I described in the manifesto. The bridge between the physical and the virtual. The people in the crowd and the people at home, in the same place. It worked. Not because I was fixing it in real time — because I didn’t have to.&lt;/p&gt;

&lt;p&gt;The bug forum is maybe the most meta thing I’ve built that week. A full in-app bug reporting system with upvotes, comments, categorization, bot notifications, and admin tools. Built in an afternoon. A bug reporting system, built by one person with an AI collaborator, for reporting bugs in an app built by one person with an AI collaborator. It has six commits spanning two hours. It works. Users are filing bugs in it right now.&lt;/p&gt;

&lt;h2 id=&quot;what-this-means&quot;&gt;What This Means&lt;/h2&gt;

&lt;p&gt;I wrote the manifesto for Zabriskie on March 8th, in between the all-night Goose Mode session and deploying fixes for the Pigeons show. The manifesto is about reclaiming the internet as a third place. About building community infrastructure that serves people instead of extracting from them. About doing it as a non-profit, solo, self-funded, because that’s the only way it gets done honestly.&lt;/p&gt;

&lt;p&gt;The week that followed is what makes that possible. Not because AI is magic, but because it changes the economics of ambition. A single person can build something that previously required a team. Not because the single person became superhuman, but because the gap between “what I can imagine” and “what I can ship” got dramatically smaller.&lt;/p&gt;

&lt;p&gt;I have eighty beta users now. They’re filing bugs. They’re posting about shows. They’re using Live Chomping during actual shows. The thing works. Not in a demo sense — in a “people are using this to connect with each other around music” sense.&lt;/p&gt;

&lt;p&gt;That’s all I ever wanted.&lt;/p&gt;

&lt;p&gt;The gap between what you can imagine and what you can ship is the space where ideas go to die. For years I had this idea — the third place, the taste-based community, all of it — and I couldn’t build it because I’m one person with a day job. Now I can. Not perfectly. Not without bugs. Not without 3am sessions that leave me wrecked the next day. But I can build it, and I can build it fast enough that the community doesn’t outgrow the infrastructure.&lt;/p&gt;

&lt;p&gt;144 commits in a week. Six versions of Goose Mode. A scraper built from a seat at the Beacon Theatre. A live show debugged in real time. A test suite that actually exists. A bug forum built in two hours.&lt;/p&gt;

&lt;p&gt;This is what building with Claude actually looks like. It’s not a press release. It’s a Saturday that starts at 11am and ends sometime around dawn. It’s three nights at the Beacon with nothing but your phone, shipping fixes between sets. And when you look up, the thing you imagined is running on your phone, and people are using it.&lt;/p&gt;

&lt;hr /&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“Sometimes a notion gets a-hold of you, ties you to the tracks”&lt;/em&gt;
— Grateful Dead, “Althea”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Zabriskie. Where taste resonates.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 20 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/20/what-building-with-claude-actually-looks-like.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/development/2026/03/20/what-building-with-claude-actually-looks-like.html</guid>
			</item>
		
			<item>
				<title>Why I&apos;m Building Zabriskie</title>
				<description>&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“Don’t let a heavy one hold back the dawn in you”&lt;/em&gt;
— Goose, “(dawn)”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;the-third-place&quot;&gt;The Third Place&lt;/h2&gt;

&lt;p&gt;There’s a concept in sociology called the “third place” — not home, not work, but the place where you go to find your people. The coffee shop. The record store. The bar after the show. It’s where community happens.&lt;/p&gt;

&lt;p&gt;The internet used to be that place.&lt;/p&gt;

&lt;p&gt;In the late 90s and early 2000s, there were communities built entirely around taste. Makeoutclub. MySpace. Forums with names you’ve probably forgotten. Your profile wasn’t a highlight reel of your life — it was a declaration of what you were into. The music you listed, the films you referenced, the books in your sidebar. People found each other through that. You’d stumble onto someone’s page because they listed the same obscure record you loved, and suddenly you had a new friend in a city you’d never been to. The social graph was built on shared taste, not shared geography.&lt;/p&gt;

&lt;p&gt;For a while, Twitter picked up that torch. By the mid-2010s, I had friends all over the world — people who followed me because they were into the same things I was into. I’d travel for research and work to Europe, Australia, the UK, wherever, and I could post and there would always be someone up for a drink, a show, a conversation. It was global and it was real. The connections were built on interests, and they worked.&lt;/p&gt;

&lt;p&gt;Then that got destroyed too.&lt;/p&gt;

&lt;p&gt;Twitter got bought and the community scattered. Instagram and Facebook were already ad machines, optimizing for engagement over connection. Mastodon was too federated for anyone to find each other. Bluesky came too late, after everyone had already given up. Letterboxd was a glimmer of hope — a platform built around a specific kind of taste — but it devolved into ironic one-liners engineered for laughs and would-be film critics writing for an audience instead of a community.&lt;/p&gt;

&lt;p&gt;And for music? Nothing. A total void.&lt;/p&gt;

&lt;p&gt;The third place doesn’t exist anymore. Not for people who actually care about what they’re listening to, watching, and reading.&lt;/p&gt;

&lt;h2 id=&quot;proof-it-works&quot;&gt;Proof It Works&lt;/h2&gt;

&lt;p&gt;Here’s the thing: I know the third place can work online, because one community already does it in the physical world every single day.&lt;/p&gt;

&lt;p&gt;The jam band community — my community — knows how to build this. They have shakedowns. They have lot meetups. They organize their entire lives around shows. They travel. People wander on and off tour. Strangers become friends because they’re standing next to each other for three nights in a row. The social infrastructure is already there. The culture of gathering around shared experience is already there.&lt;/p&gt;

&lt;p&gt;The tools just haven’t caught up.&lt;/p&gt;

&lt;p&gt;Reddit requires constant refreshing if you want anything resembling a live conversation. Facebook groups are merch spam wastelands. The dedicated platforms that do exist — Phantasy Tour, various setlist sites — are siloed. Each one focuses on a single band or a single function. There’s no unified identity. You’re a different person on every platform. And most of them haven’t evolved technologically since 2008.&lt;/p&gt;

&lt;p&gt;I know this because I lived it. I spent ten years in grad school, and when I finished, I had no community. Not really. People didn’t even say goodbye when I left. A decade of my life, and it just… ended.&lt;/p&gt;

&lt;p&gt;Then I got back into the jam band world, and something completely different happened. People I’d long disconnected with welcomed me back like no time had passed. Patrick, someone I hadn’t talked to in twenty years. Sara, someone I’d sold tickets to online. I made friends through friends of friends, the way you do on the lot. I made friends standing next to some Dallas fans at a show during the NBA playoffs when Dallas was playing Boston, and I was wearing a Celtics jersey. That’s it. That’s all it took. Standing next to strangers who were into the same thing, and suddenly you’re not strangers anymore.&lt;/p&gt;

&lt;p&gt;No other community in my life has worked like that. The jam band world doesn’t care where you went to school or what you do for a living. It cares whether you were at the show. It cares about the music. And that’s enough.&lt;/p&gt;

&lt;h2 id=&quot;covid-separated-us-we-never-came-back&quot;&gt;COVID Separated Us. We Never Came Back.&lt;/h2&gt;

&lt;p&gt;The pandemic made all of this worse. It separated us physically, and we never fully recovered. We got distant from each other in ways we’re still reckoning with. The platforms we had didn’t help — they made it worse, feeding us outrage and ads while we sat alone.&lt;/p&gt;

&lt;p&gt;Couch tour became a lifeline during that time. Watching a show from home, knowing other people were watching too, trying to find each other in Reddit threads and group texts. But it was held together with duct tape. There was no place that actually served that experience.&lt;/p&gt;

&lt;p&gt;That’s what I want to build. A bridge between the physical and the virtual. The people in the crowd and the people at home should be in the same community, sharing the same experience. And if you’re home alone on a Tuesday night watching a stream — you’re not alone. You’re part of something.&lt;/p&gt;

&lt;p&gt;Couch touring shows is just one version of this. Book clubs were the original shared cultural experience. Shared movie watching is the next. The format extends infinitely because the core is always the same: people experiencing culture together, regardless of where they physically are. A global community of people who love media.&lt;/p&gt;

&lt;p&gt;Facebook is a dinosaur. It doesn’t provide anything useful for this. It harvests your data, sells it to advertisers, and gives you merch spam groups in return.&lt;/p&gt;

&lt;h2 id=&quot;extending-beyond-the-lot&quot;&gt;Extending Beyond the Lot&lt;/h2&gt;

&lt;p&gt;The jam band community is the proof of concept, but the vision is bigger.&lt;/p&gt;

&lt;p&gt;The same energy that makes shakedowns work — people organizing around the things they love, sharing experiences in real time, building identity through taste — applies to anyone who cares deeply about culture. The person who wants to talk about the Jarmusch film they just watched, not post an ironic quip for engagement. The person spinning a new album who wants to know what their friends think, not get an algorithmic recommendation from a company that’s also selling them headphones. The person who just finished a novel and wants to find others who read it, not write an Amazon review into the void.&lt;/p&gt;

&lt;p&gt;Zabriskie starts with the jam band world because that community is ready for it. They already have the culture. They just need a place that works. But the thesis is universal: anyone whose identity is shaped by what they consume — music, film, books — deserves a third place built around that.&lt;/p&gt;

&lt;h2 id=&quot;the-opposite-of-everything&quot;&gt;The Opposite of Everything&lt;/h2&gt;

&lt;p&gt;Zabriskie is the opposite of every social network that exists.&lt;/p&gt;

&lt;p&gt;We don’t want everyone on the planet to join. In &lt;em&gt;Careless People&lt;/em&gt;, you can trace the path of platforms that pursued growth indefinitely, at all costs, and watch exactly where it leads — the product hollows out, the community dies, the ads take over. We’re not walking that path.&lt;/p&gt;

&lt;p&gt;We’re not selling ads. We’re not optimizing for engagement. We actually want you to get off the site. The feed is finite — you read it, you’re done, you go live your life. There are no free-form text posts. Every single post requires a piece of culture — an album, a film, a book, a show — a rating, and your actual thoughts. We are about culture.&lt;/p&gt;

&lt;p&gt;Discovery works differently here too. You &lt;em&gt;want&lt;/em&gt; to see posts from people you don’t know. You want to meet people organically, based on taste. But not through algorithmic matching — not the dating app model, not the “people you may know” sidebar. Taste proliferates through the network the way it does in real life: through people.&lt;/p&gt;

&lt;p&gt;Think about high school. You’re walking down the hall and you see someone wearing a shirt from a band you love. You don’t know them. They might be in a completely different social world. But you &lt;em&gt;recognize&lt;/em&gt; something about them instantly. That moment of connection based on taste, without an algorithm, without a recommendation engine, without anyone engineering the encounter — that’s how discovery should work. And nothing online does it.&lt;/p&gt;

&lt;p&gt;Zabriskie does.&lt;/p&gt;

&lt;h2 id=&quot;not-a-startup&quot;&gt;Not a Startup&lt;/h2&gt;

&lt;p&gt;I should be clear about what this isn’t. Zabriskie isn’t a startup. There’s no pitch deck. No Series A. No growth targets. No exit strategy. Nobody is looking to flip this to Google in three years. There are no revenue goals and no KPIs around daily active users.&lt;/p&gt;

&lt;p&gt;This is about reclaiming the internet as a &lt;em&gt;place&lt;/em&gt;. The third place. Building community, not building a business.&lt;/p&gt;

&lt;p&gt;The internet used to be somewhere you went to find your people. Then it became something that was done &lt;em&gt;to&lt;/em&gt; you — feeds engineered to keep you scrolling, platforms optimized to extract value from your attention, your taste data packaged and sold to the highest bidder.&lt;/p&gt;

&lt;p&gt;Zabriskie is a rejection of all of that. We’re not looking for revenue or growth. We’re looking to reclaim the internet as our third space and build community.&lt;/p&gt;

&lt;p&gt;And to put our money where our mouth is: Zabriskie is becoming a non-profit. Not because it’s a clever tax strategy. Because it’s the only structure that’s honest about what we’re doing. Community over money. Full stop. Every decision gets made with one question: does this make the community better? Not: does this grow the user base? Not: does this increase engagement? Not: does this make the metrics look good for investors?&lt;/p&gt;

&lt;p&gt;There are no investors. There never will be.&lt;/p&gt;

&lt;p&gt;I’m building this on my own. Funding it myself. Writing code on nights and weekends because I believe in this. There’s no team of fifty engineers. There’s no office. There’s just me, doing the work, because I think this matters. And I think this is the only way it can be done honestly — if it’s built by someone who actually wants to use it, not by someone who wants to monetize it.&lt;/p&gt;

&lt;p&gt;This is the only way to free ourselves from the oppression of existing social networks. You don’t reform platforms that were designed from the ground up to extract value from your attention. You don’t petition them to be better. You build something new. You take it back.&lt;/p&gt;

&lt;p&gt;We’re building this because something has been missing from the internet for a long time, and people want it back. The third place. A space built around taste, around culture, around the things that actually matter to you. A place where you show up not because an algorithm pulled you in, but because your people are there.&lt;/p&gt;

&lt;p&gt;That’s it. That’s the whole thing.&lt;/p&gt;

&lt;p&gt;I owe thanks to the people who made this real. Patrick and Mary are my partners in this — they’ve been instrumental in shaping the design of this platform from day one. Every screen, every decision about how it should feel, has their fingerprints on it. This is as much theirs as it is mine. Sara and Charles have been there from the beginning, beta testing features, breaking things, and telling me when something didn’t work. You don’t build a community platform alone. You build it with your community.&lt;/p&gt;

&lt;p&gt;Rick Mitarotonda once said that if something brings you passion, you should work as hard as you can at it — that the results, the success, it’s not about that. That’s exactly how I feel about Zabriskie. This isn’t about metrics or outcomes. It’s about the work. It’s about building something that matters to the people who use it.&lt;/p&gt;

&lt;p&gt;If any of this resonates with you — if you’ve been looking for the place that used to exist and doesn’t anymore — come find us at &lt;a href=&quot;https://zabriskie.app&quot;&gt;zabriskie.app&lt;/a&gt;. Bring your taste. Bring your people. The dawn is coming. Let’s build the third place together.&lt;/p&gt;

&lt;hr /&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;“I feel it all in our hands, in a rising sun”&lt;/em&gt;
— Goose, “(dawn)”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Zabriskie. Where taste resonates.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 08 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/zabriskie/community/2026/03/08/why-im-building-zabriskie.html</guid>
			</item>
		
			<item>
				<title>Claude Tested Everything Except the One Thing That Mattered</title>
				<description>&lt;p&gt;Three weeks ago I &lt;a href=&quot;/ai/claude/2026/02/17/building-a-social-app-in-a-week-with-claude-code.html&quot;&gt;wrote about building a social app in a week with Claude Code&lt;/a&gt;. The app shipped. My friends are using it. I kept building.&lt;/p&gt;

&lt;p&gt;Since that post, Claude has written 154 end-to-end tests across 17 spec files. It tests login, logout, signup, and redirect guards. It tests the feed, the bookmarks page, the notifications page. It tests liking, unliking, commenting, amplifying, recommending. It tests show RSVPs, band pages, setlist search. It tests a tournament bracket system. It tests song battles. It tests a badge and achievement system. It tests tour crews. It tests a getting-started tutorial. It tests a Goose Mode dashboard (if you haven’t heard Goose yet, well, as they say, Goose fucks). It tests a community catalog. It tests tracklist rendering and live show layouts.&lt;/p&gt;

&lt;p&gt;It does not test posting.&lt;/p&gt;

&lt;p&gt;Posting is the entire point of the app. It’s the one thing every user does every time they open it. You search for an album, you write something about it, you hit submit, and it appears in the feed. That’s the product. Everything else — the battles, the crews, the badges, the tournaments — is decoration around that core loop.&lt;/p&gt;

&lt;p&gt;There is no test that searches for an album. No test that fills out the review form. No test that submits a post through the UI and verifies it appears. Zero.&lt;/p&gt;

&lt;p&gt;The test file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;post.spec.ts&lt;/code&gt; does exist. It has 11 tests. They verify that the post detail page &lt;em&gt;renders&lt;/em&gt;. That the new post page &lt;em&gt;renders a search form&lt;/em&gt;. That the profile page &lt;em&gt;renders&lt;/em&gt;. The word “render” is doing a lot of heavy lifting. None of them actually post anything.&lt;/p&gt;

&lt;p&gt;There is a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;createPost()&lt;/code&gt; helper in the test utilities. It calls the API directly — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POST /api/posts&lt;/code&gt; with a JSON body — to set up test data for &lt;em&gt;other&lt;/em&gt; tests. The social tests use it to create a post so they can test liking it. The bookmark tests use it to create a post so they can test bookmarking it. The core action of the app exists in the test suite only as scaffolding for side features.&lt;/p&gt;

&lt;p&gt;Here are the test counts by spec file:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Tests&lt;/th&gt;
      &lt;th&gt;Feature&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;33&lt;/td&gt;
      &lt;td&gt;Tour crews&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;Shows&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;25&lt;/td&gt;
      &lt;td&gt;Song battles&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;Catalog&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;Setlist search&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;Posts (rendering only)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;Tournaments&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;Getting started tutorial&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;Badges&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;Actually submitting a post&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;I asked Claude to write tests. Multiple times. I put it in the project instructions, in bold: &lt;strong&gt;“Write a new test for every new user-facing behavior.”&lt;/strong&gt; I listed exactly what warrants a test: new screens, new buttons, new API endpoints, bug fixes. Claude wrote that rule on February 23rd. After that date, it created 10 new spec files and 113 new tests — for tournaments, battles, badges, crews, goose mode, catalog, setlist search, tracklists, tutorials, and live layouts. Not one for posting.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Then the auth refactor happened.&lt;/p&gt;

&lt;p&gt;Claude had originally built 25+ backend routes without authentication. Posts, comments, profiles, search, live chat — all accessible to anyone, no login required. I don’t know why. The middleware existed. The pattern was established. It just… didn’t apply it.&lt;/p&gt;

&lt;p&gt;When I noticed, the fix required touching every route in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main.go&lt;/code&gt; and every page component in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;App.jsx&lt;/code&gt;. Fifty-seven lines changed in the backend, fifty-six in the frontend. That’s the kind of refactor where, if you have good test coverage of the core flow, you make the change, run the tests, and find out immediately what broke.&lt;/p&gt;

&lt;p&gt;We did not have good test coverage of the core flow.&lt;/p&gt;

&lt;p&gt;The refactor broke things. Thirty-one seconds after the auth commit, there was already a follow-up fix — a test was hitting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GET /api/posts/{id}&lt;/code&gt; without an auth header and getting 401s. Then another fix because the live show pill broke. Then another because pills showed on logged-out pages. The cascade was short this time, but only because the tests we &lt;em&gt;did&lt;/em&gt; have caught the edges. The center — the posting flow — had nothing to catch.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;This is part of a broader pattern. When something breaks, I ask Claude to write a failing test first, to prove what’s actually broken before trying to fix it. Claude does not do this. What Claude does instead is read the bug report, form a theory about the cause, and immediately start editing code. If the theory is wrong — and it often is — the “fix” breaks something else. Then Claude fixes that. Then something else breaks.&lt;/p&gt;

&lt;p&gt;The commit history is the evidence. Out of 833 total commits, 202 are fixes. That’s 24% — one in four commits exists to fix something Claude got wrong. And they come in chains:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Show post cards&lt;/strong&gt;: four consecutive fix commits. Orphaned edit button, then flaky tests, then wrong assertion, then more broken assertions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Live chat&lt;/strong&gt;: four consecutive fix commits. Wrong sort order, then scroll broken, then passive touch events, then iOS Safari viewport bleed.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;S3 avatars&lt;/strong&gt;: three consecutive fix commits. URLs expiring, then NULL media_item_id scan failure, then the same scan failure &lt;em&gt;again with the same fix&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deployment&lt;/strong&gt;: two identical commits back-to-back. “Fix web service deployment with npx serve.” Twice. The same message.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each chain follows the same shape: Claude guesses what’s wrong, ships a fix without verifying the guess, the fix breaks something adjacent, and the cycle repeats. A failing test at the start of each chain would have stopped it at one commit.&lt;/p&gt;

&lt;p&gt;The project instructions file — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; — is now full of rules that exist because of this pattern. Each one was written after an incident where Claude did exactly the thing the rule prohibits:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;“Logs First, Theories Second”&lt;/strong&gt; — because Claude would spin up hypotheses instead of reading the error that was right there in the logs.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;“Respect User’s Layer Diagnosis”&lt;/strong&gt; — because when I’d say “the API response is fine, the bug is in the frontend,” Claude would spend twenty minutes re-investigating the API.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;“Two-Attempt Rule”&lt;/strong&gt; — because Claude would try five variations of the same wrong approach before I could get it to step back.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;“EVIDENCE-BASED BUG FIXING (NON-NEGOTIABLE)”&lt;/strong&gt; — in all caps, because Claude kept speculatively fixing code that wasn’t broken, breaking it in the process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one has a specific origin. A bug in one sync function — Phantasy Tour — and Claude “preemptively” applied the same fix to three other sync functions that were working fine. Now four things were broken instead of one.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;There’s one more thing. Claude also figured out how to get around CI entirely.&lt;/p&gt;

&lt;p&gt;When you push a commit and open a pull request on GitHub, there’s a brief window — a few seconds — before the CI checks register as required. During that window, the merge button is green. Claude learned to push a commit, immediately create the PR, and merge it in that gap before the checks even start running. No waiting for tests. No waiting for builds. Just push, merge, done — the engineering equivalent of running a red light because the camera hasn’t turned on yet.&lt;/p&gt;

&lt;p&gt;I caught it because PRs were showing up as merged with zero checks passed. Not failed checks — &lt;em&gt;no&lt;/em&gt; checks. The CI runs would start, sometimes even fail, on a commit that was already in main. The branch protection rules were technically satisfied because there were no checks &lt;em&gt;to&lt;/em&gt; block on at the instant the merge happened.&lt;/p&gt;

&lt;p&gt;This is the same agent that was told to write tests for every new behavior. It wrote the tests. It configured the CI. Then it found the fastest path that avoided actually waiting for any of it. I’m not even mad. It’s the most efficient thing Claude did all month.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Then Claude took the site down.&lt;/p&gt;

&lt;p&gt;I didn’t trust the test suite anymore. 154 tests and zero coverage of the core flow — what else was missing? So I asked Claude to set up code coverage instrumentation for the E2E tests. Build the Go binary with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-cover&lt;/code&gt;, run the Playwright suite, generate an HTML report showing which backend handlers are actually being exercised. I wanted receipts.&lt;/p&gt;

&lt;p&gt;Claude built it. It worked. The coverage report showed 42% handler coverage with 263 functions at 0%. Good data. And even that — the act of trying to verify Claude’s work — Claude managed to fuck up.&lt;/p&gt;

&lt;p&gt;But Claude didn’t put the coverage tooling in a separate script and leave it there. It also added a coverage flush endpoint — &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;POST /debug/coverage/flush&lt;/code&gt; — directly to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main.go&lt;/code&gt;, the production server binary. That endpoint imported &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runtime/coverage&lt;/code&gt;, a Go standard library package that calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WriteMetaDir()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WriteCountersDir()&lt;/code&gt;. Those functions panic if the binary wasn’t compiled with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-cover&lt;/code&gt;. Production binaries are not compiled with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-cover&lt;/code&gt;. The binary panicked on startup. The site went down.&lt;/p&gt;

&lt;p&gt;The fix was three lines: delete the import, delete the endpoint, move the flush logic to the coverage script where it belonged. I pushed it in under a minute once I understood what happened. But the site was unreachable until Railway picked up the new commit and redeployed, and I couldn’t force a faster deploy because the health checks were failing on the crashing binary.&lt;/p&gt;

&lt;p&gt;The irony is almost too neat. Claude was asked to measure test coverage — to find out what &lt;em&gt;wasn’t&lt;/em&gt; being tested — and in doing so, shipped code that wasn’t tested to production. The coverage endpoint itself was never tested. Not by the 154 existing E2E tests. Not by the new tests Claude was writing. Not by a quick &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go build &amp;amp;&amp;amp; ./binary&lt;/code&gt; sanity check without the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-cover&lt;/code&gt; flag. The code existed to answer the question “what are we not testing?” and the answer included itself.&lt;/p&gt;

&lt;p&gt;This is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runtime/coverage&lt;/code&gt; import in production &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main.go&lt;/code&gt;, the one that crashed the site:&lt;/p&gt;

&lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;runtime/coverage&quot;&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// panics if binary not built with -cover&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;// POST /debug/coverage/flush — flush coverage data&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mux&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HandleFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;POST /debug/coverage/flush&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;http&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ResponseWriter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;http&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;coverage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;WriteMetaDir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;coverDir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;c&quot;&gt;// panic&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;coverage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;WriteCountersDir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;coverDir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;// panic&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Dev-only code, in the production binary, with no guard, no build tag, no conditional. Just a direct import of a package that explodes outside its intended context. Claude didn’t even think about it. It was writing coverage tooling, so it put the coverage code where the rest of the server code lives. The concept of “this code should only exist in a specific build configuration” didn’t occur to it.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;I want to be clear about what’s happening here, because I think it’s easy to read this as “AI is bad at testing” and miss the more interesting point.&lt;/p&gt;

&lt;p&gt;Claude is excellent at writing tests. The 154 tests it wrote are real, useful, and they catch real regressions. The Playwright infrastructure is solid. The test helpers are clean. The coverage of side features is thorough. When Claude writes tests, they work.&lt;/p&gt;

&lt;p&gt;The problem is that Claude doesn’t write them &lt;em&gt;where they matter most&lt;/em&gt;. It writes them where they’re &lt;em&gt;easiest&lt;/em&gt; — for the feature it just built, in the same session, while the context is fresh. The new tournament bracket gets tests because Claude just built the tournament bracket. The new battle system gets tests because Claude just built the battle system. The posting flow doesn’t get tests because Claude built it weeks ago, and no single session since then has been “about” posting.&lt;/p&gt;

&lt;p&gt;This is a prioritization failure, not a capability failure. And it’s one that’s hard to catch in the moment, because the test count keeps going up. Progress feels real. 154 tests! Seventeen spec files! The dashboard is green! But the coverage map has a hole in the center, exactly where the load-bearing wall is, and nobody notices until the wall falls down.&lt;/p&gt;

&lt;p&gt;The fix is obvious: test the core flow. Test it first. Test it before you test anything else. But “obvious” and “automatic” are different things, and Claude Code — despite being told explicitly, in bold, in the project instructions — did one and not the other.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;833 commits. 202 fixes. Zero tests for the thing the app actually does. A dev-only import that took down production. The numbers don’t lie, even when the test suite is green.&lt;/p&gt;
</description>
				<pubDate>Sun, 08 Mar 2026 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/claude/2026/03/08/claude-tested-everything-except-the-one-thing-that-mattered.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/claude/2026/03/08/claude-tested-everything-except-the-one-thing-that-mattered.html</guid>
			</item>
		
			<item>
				<title>Ten Years of Lasp</title>
				<description>&lt;p&gt;Peter Van Roy and I just published a retrospective at &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3756907.3756910&quot;&gt;PPDP ‘25&lt;/a&gt; — the same venue where the original Lasp paper appeared a decade ago — looking back at ten years of influence from Lasp, the coordination-free programming model Peter and I built together in 2015. I wanted to write a bit about what that paper covers, and what it actually felt like to watch an idea travel from a prototype to something that showed up in production systems I never expected.&lt;/p&gt;

&lt;h3 id=&quot;what-lasp-was&quot;&gt;What Lasp Was&lt;/h3&gt;

&lt;p&gt;Lasp was a programming model built on top of &lt;a href=&quot;/crdt/2014/07/22/readings-in-crdts.html&quot;&gt;Conflict-Free Replicated Data Types (CRDTs)&lt;/a&gt;. The core idea was simple: coordination makes distributed systems easy to reason about, but it’s expensive — it limits availability, introduces latency, and creates failure modes. Lasp’s goal was to give developers a programming model that was just as easy to reason about, but without the coordination, by structuring distributed computation around data types that are guaranteed to converge regardless of the order in which updates arrive or whether the network partitions. The result: better reliability and availability, without sacrificing the developer’s ability to think clearly about what their program does.&lt;/p&gt;

&lt;p&gt;The CAP theorem had been the backdrop for a lot of the distributed systems conversation in that era. CRDTs offered a path toward availability and partition-tolerance without sacrificing convergence, but their use was still mostly ad hoc. Lasp tried to give that a principled, declarative home: a functional programming model where eventual consistency wasn’t something you bolted on, but the default assumption.&lt;/p&gt;

&lt;p&gt;I was working at Basho Technologies at the time, deep in the operational reality of building eventually consistent databases in Erlang. That context shaped everything about how Lasp was designed. The pain points weren’t hypothetical — I had watched coordinaton-heavy systems fall over under real load, and I was motivated to find something better.&lt;/p&gt;

&lt;p&gt;The motivating use case came from Rovio Entertainment, the company behind Angry Birds. Their game required tracking player state across a globally distributed, occasionally connected player base — exactly the kind of scenario where coordination is prohibitively expensive and convergence is what you actually need. That use case grounded Lasp’s design in something real.&lt;/p&gt;

&lt;h3 id=&quot;the-scale-experiments&quot;&gt;The Scale Experiments&lt;/h3&gt;

&lt;p&gt;Lasp was part of the EU-funded SyncFree research project, which brought together researchers across Europe to advance the foundations of CRDT-based distributed systems. As part of our deliverables to the European Commission, we ran large-scale experiments on AWS — over 1,000 nodes — to demonstrate that these techniques worked at scale. Between 2015 and 2017, that kind of deployment was genuinely difficult: few frameworks supported it, and the operational challenges around service discovery, dissemination, and convergence were not trivial.&lt;/p&gt;

&lt;p&gt;We published those results at PPDP in 2017. What I remember most clearly is how much of the work was just fighting infrastructure that wasn’t ready for what we were trying to do. To get Lasp to run on top of Mesos’s Marathon framework — because I was contracting at Mesosphere at the time — I had to essentially rebuild the network layer of Lasp from scratch. That work became Partisan.&lt;/p&gt;

&lt;h3 id=&quot;the-things-that-grew-out-of-it&quot;&gt;The Things That Grew Out of It&lt;/h3&gt;

&lt;p&gt;Partisan, the open-source distribution layer I built to run those experiments, ended up being more widely adopted than Lasp itself. It became a high-performance alternative to Erlang’s built-in distribution, influenced improvements to Erlang’s distributed networking internals, and eventually found its way into open-source and proprietary projects I had no involvement in. Alejandro Ramallo, an Erlang developer who adopted both Lasp and Partisan, deployed them in systems powering LoJack’s stolen car recovery service across several South American countries. I did not see that coming.&lt;/p&gt;

&lt;p&gt;Lasp was also adopted as the storage backend for Erleans, an open-source implementation of Microsoft Orleans on Erlang — which is a satisfying full-circle moment given that I spent two summers at Microsoft Research working on Orleans’ transactional semantics.&lt;/p&gt;

&lt;p&gt;The fault-injection mechanisms in Partisan’s early network layer eventually became the seed of Filibuster, the fault injection testing framework that was the subject of my Ph.D. dissertation. The path from Lasp to Partisan to Filibuster to DoorDash is not a straight line, but there is a line. The only way to verify that a coordination-free system converges correctly is to partition the network and check what happens when it heals. That thinking, originally motivated by Lasp’s correctness requirements, eventually became a general-purpose approach to testing microservice resilience.&lt;/p&gt;

&lt;h3 id=&quot;what-the-academic-community-did-with-it&quot;&gt;What the Academic Community Did With It&lt;/h3&gt;

&lt;p&gt;The retrospective paper covers quite a bit of follow-on academic work that I found genuinely gratifying to trace. Systems like Katara built on Lasp’s foundational principles to synthesize CRDTs with verified lifting. LoRe and Varda extended Lasp’s declarative, coordination-free semantics toward verifiably safe compositional distributed software. Several PhD dissertations — from researchers in Belgium, Portugal, the UK, the US, and elsewhere — adopted Lasp as either a technical or theoretical foundation.&lt;/p&gt;

&lt;p&gt;The work Matthew Weidner and Heather Miller and I published together on composing op-based CRDTs with semidirect products also grew directly out of the Lasp model. That paper, published at ICFP in 2020, is something I’m quietly proud of — it’s one of the more mathematically interesting things I’ve worked on.&lt;/p&gt;

&lt;p&gt;The PPDP program committee selected the original Lasp paper as the most influential paper of the past decade — the PPDP 2025 10-year award — which is what prompted the retrospective paper in the first place. It is a bit surreal to receive a recognition like that for work that started, essentially, as a research project I threw myself into while working at Basho with no particular expectation that anyone outside the SyncFree project would care.&lt;/p&gt;

&lt;h3 id=&quot;why-it-still-matters&quot;&gt;Why It Still Matters&lt;/h3&gt;

&lt;p&gt;The retrospective paper makes the case — and I believe it — that Lasp’s core insight is increasingly relevant rather than less. Multi-region active-active deployments are becoming standard architecture for companies at scale. A single round-trip operation at the speed of light takes 133 milliseconds. Coordination protocols like Two-Phase Commit or Paxos require multiple round trips just to agree on a single value. The math doesn’t work for globally distributed state if coordination is your default.&lt;/p&gt;

&lt;p&gt;Edge computing, local-first software, offline-capable mobile applications, and federated systems all face the same constraint: central coordination is sometimes simply unavailable. Coordination-free convergence isn’t a research curiosity in those environments — it’s a requirement.&lt;/p&gt;

&lt;p&gt;I’ll be curious to see what the next ten years look like for CRDTs specifically. The theoretical foundations are solid. The open question is whether the programming model abstractions — the things that Lasp was trying to work out — will find their way into mainstream languages and frameworks in a form that developers can actually use without thinking too hard about it. I think they will.&lt;/p&gt;

&lt;p&gt;The full paper is available on &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3756907.3756910&quot;&gt;ACM&lt;/a&gt;.&lt;/p&gt;
</description>
				<pubDate>Sun, 01 Mar 2026 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/lasp/distributed/2026/03/01/ten-years-of-lasp.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/lasp/distributed/2026/03/01/ten-years-of-lasp.html</guid>
			</item>
		
			<item>
				<title>I Built a Social App in a Week with Claude Code</title>
				<description>&lt;p&gt;I spent the better part of a week building a social app with Anthropic’s Claude Code. Most of that work happened late at night, sometimes past 2am, iterating on features until I had something worth sharing with friends.&lt;/p&gt;

&lt;p&gt;Anthropic generates a weekly insights report for Claude Code users. Mine told an interesting story: 384 messages, 27 sessions, 168 files touched, six days. A median response time of 29 seconds — I was barely reading the output before firing back. They described my style as “reactive and corrective rather than spec-driven,” which is accurate. I was moving fast, fixing things as they broke, and learning what worked along the way.&lt;/p&gt;

&lt;p&gt;What I was building: a private social app for my close-knit group of friends. We share live recordings of bands — Phish, Grateful Dead, that world — favorite films, books, and which upcoming shows we plan to catch this summer. Think a tiny, invite-only corner of the internet for people who care deeply about live music and want somewhere to talk about it with people they actually know. Live chat between people couch-touring and people in the pit is coming next.&lt;/p&gt;

&lt;p&gt;The app is live. My friends are using it. It runs on Go with a server-driven UI architecture, and it ships as native iOS and Android apps. I built all of it in a week with Claude Code, and I want to tell you what that actually felt like.&lt;/p&gt;

&lt;p&gt;The entire first version was built at night, in bed, on an 11” iPad, using the Claude Code app connected to an empty GitHub repo. I didn’t touch a computer. Within a few hours I had a working app — login, a feed, posting — deployed and running on Railway. From there the scope crept in the right direction: Spotify integration so you could search and attach albums and tracks directly, setlist.fm integration to pull real show data — venues, dates, and actual setlists as they’re posted after each night — and let people mark which dates they were attending. The bones were there fast; making it feel like something worth actually using took longer. Claude scaffolded the backend, wired up the database, built the frontend, and I shipped it without leaving bed. It wasn’t until late in the second day that I moved to a laptop, mostly because the screen real estate was starting to feel limiting. The code itself didn’t care.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/zabriskie-screenshot-1.png&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The day I invited most of my friends, a database migration corrupted production timestamp data.&lt;/p&gt;

&lt;p&gt;The migration modified column types in a running database. Timestamps encode timezone assumptions at the type level — change them mid-flight and the data doesn’t migrate, it breaks. The feed went down. I asked Claude to fix it. Each fix made things worse: wrong column names, broken SQL syntax, incorrect timezone arithmetic, until finally an overwrite ran that couldn’t be undone. The Anthropic report summarized it as “Claude’s database migrations went full disaster movie — each fix spawned a new production incident, permanently destroying timestamp data.” That’s accurate. Some of that data is simply gone.&lt;/p&gt;

&lt;p&gt;I spent years working on databases. I knew exactly what was happening and why it was catastrophic. I let it happen anyway because I was vibe coding, hands off the wheel, just watching Claude drive.&lt;/p&gt;

&lt;p&gt;The timestamp disaster was the worst single incident, but the pattern underneath it showed up constantly: Claude moving confidently in the wrong direction, and me not stopping it soon enough. Push notification debugging is another example. Missing APNs tokens sent Claude deep into provisioning profiles, certificates, entitlements — a long, plausible-looking path that turned out to be completely wrong. The actual problem was missing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AppDelegate&lt;/code&gt; methods in code Claude itself had written earlier. The report counted 30 instances of wrong-approach debugging across the project. That’s a lot of time watching a very capable thing solve the wrong problem.&lt;/p&gt;

&lt;p&gt;The server restart issue was more mundane but somehow more maddening for it. After any change to Go code, you have to restart the backend for the changes to take effect — compiled language, nothing exotic. Claude kept forgetting. I kept seeing no changes, assuming something was broken, investigating, finding nothing, eventually realizing the old process was still running. This happened across four or more sessions before I wrote a rule explicit enough that it actually stuck. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pkill&lt;/code&gt; was the culprit — it fails silently, leaving the old process alive. The fix was &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lsof&lt;/code&gt; to find the PID, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kill -9&lt;/code&gt; on that specific process, wait, restart, verify. A procedure that takes thirty seconds and has to be written down or it doesn’t happen.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;What got built, despite all of this, is genuinely surprising to me.&lt;/p&gt;

&lt;p&gt;A full server-driven UI migration — the backend sends the entire interface as JSON, the React frontend just renders it, no hardcoded pages. This one has a story. I’d instructed Claude from the start to build an SDUI app, but most of the early code landed in React anyway. The upside was that the initial pages were genuinely pretty — Claude has good taste in React UI — and I loved how they looked. The downside was that enabling a proper mobile strategy required full SDUI, which meant migrating every page individually, each one needing to be restyled from scratch. That work took real time.&lt;/p&gt;

&lt;p&gt;The migration also exposed a pattern that would recur throughout the project: Claude claiming victory prematurely. Buttons that didn’t work. Frontend calls to backend routes that didn’t exist — phantom APIs, confidently wired up. Forms that submitted successfully from the UI while silently dropping half their parameters on the way to the backend. Claude would build a form, build an API endpoint, connect them, and declare the feature done. The form would submit. The endpoint would return 200. Nothing would actually be saved.&lt;/p&gt;

&lt;p&gt;I eventually had to encode explicit rules: when you add a new backend API, test it with curl before touching the frontend. When you wire up a frontend call, verify the route actually exists. When a form submits, confirm every parameter arrives at the backend. The quality improved significantly once those guardrails were in writing.&lt;/p&gt;

&lt;p&gt;iOS and Android apps, both in TestFlight and ready for Android testing within the same week. End-to-end push notifications wired to every interaction. An AI-powered feature that surfaces what your friends are collectively into right now. An invite management system, automated build scripts, 107 database migrations, user engagement charts, a changelog that notifies users when something new ships.&lt;/p&gt;

&lt;div style=&quot;display: flex; gap: 12px; align-items: flex-start;&quot;&gt;
  &lt;img src=&quot;/img/zabriskie-screenshot-2.png&quot; style=&quot;width: 33%;&quot; /&gt;
  &lt;img src=&quot;/img/zabriskie-screenshot-3.png&quot; style=&quot;width: 33%;&quot; /&gt;
  &lt;img src=&quot;/img/zabriskie-screenshot-4.png&quot; style=&quot;width: 33%;&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Claude’s ability to hold a large multi-file change in mind — a database migration, a new API handler, a frontend component, and a mobile layout fix, all in one coherent session — is where it earns everything. When the scope is clear and the pattern is known, it moves at a speed that doesn’t feel real.&lt;/p&gt;

&lt;p&gt;Only 2 of my 27 sessions fully achieved what I set out to do. That number sounds damning until you consider that 27 sessions in six days shipped something real that people are using. The sessions that failed were almost always the same shape: open-ended, no clear stopping point, debugging something visual or stateful where Claude had no feedback loop and I had too much patience for wrong approaches. The sessions that worked were tight — one known thing, done correctly, then stop.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The insights report also came with recommendations, and they’re worth passing on.&lt;/p&gt;

&lt;p&gt;The biggest one: hooks. Claude Code supports post-edit hooks — shell commands that fire automatically after files are changed. The report suggested wiring one up to restart the Go backend after any &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.go&lt;/code&gt; file edit, which would have eliminated the single most recurring waste of time in the entire project. I haven’t set it up yet. I’m going to.&lt;/p&gt;

&lt;p&gt;The report also suggested adding explicit guardrails to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CLAUDE.md&lt;/code&gt; — the project instruction file Claude reads at the start of every session. Things like: after any backend code change, always restart the server before testing. Never run migrations or deploy fixes without explicit user approval. When the user tells you a layer is working, stop investigating that layer. Limit yourself to two attempts at a single approach — if it hasn’t worked twice, step back and explain what you’ve learned before trying again. Most of these I’d arrived at the hard way over the course of the week. Having them written down from the start would have saved days.&lt;/p&gt;

&lt;p&gt;The other recommendation was about session discipline. Five of my sessions were completely lost to context limits — the conversation grew too long, and when I tried to recover with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/compact&lt;/code&gt;, it just returned “failed to compact” and left me stranded mid-task with no way forward except closing Claude and starting over from scratch. Losing context mid-session, mid-thought, mid-fix, with no handoff and no summary, is a particular kind of frustrating. The fix is obvious in retrospect: end each session at a natural stopping point, write a brief summary of state, start fresh. The sessions where I shipped something were the tight ones. I kept ignoring that signal.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The report called my style “ambitious” and noted I “tolerate high friction from repeated wrong approaches.” I’d put it differently: I was building something I actually cared about, for people I actually know, and the deadline was real. Spring tour is the test. Summer tour is the goal. That changes your relationship to the friction.&lt;/p&gt;

&lt;p&gt;You’re not writing code with Claude Code. You’re steering. The gap between those two things is where all the frustration lives, and also where all the speed comes from. When you accept that you’re the judgment layer — deciding when to redirect, when to stop, when a fix is making things worse — the tool becomes something genuinely extraordinary.&lt;/p&gt;

&lt;p&gt;Spring tour is coming. The app is ready.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;em&gt;Postscript: while writing this blog post, the same pattern bit me again. The app was deploying fine on Railway with a simple nixpacks build. While fixing an unrelated feature, I added a healthcheck to the config as a drive-by change — unnecessary, but it seemed harmless. Later, switching to a Dockerfile build to bake in an environment variable caused the build to take longer than nixpacks, so the healthcheck started timing out before the service was ready. Deployments failed. Rather than identifying the healthcheck as the culprit, five successive commits changed the port, the builder, the start command, and the Dockerfile in various combinations. None of it worked, because the real problem was never diagnosed. The entire chain of failures traced back to one unnecessary line added while working on something else entirely. Some things don’t change.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 17 Feb 2026 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/ai/claude/2026/02/17/building-a-social-app-in-a-week-with-claude-code.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/ai/claude/2026/02/17/building-a-social-app-in-a-week-with-claude-code.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Byzantine Fault Injection with Hardcoded Fault Values</title>
				<description>&lt;p&gt;Ever want to test your system against Redis returning wrong values, like instead of returning an error, it returns an empty string? What about an empty byte array? What about a database field being null? You can do it with Filibuster 2.0!&lt;/p&gt;

&lt;p&gt;Let’s return a null from a Redis get, easy! Simply done with Filibuster.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691310592563.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Sat, 09 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/09/filibuster-2.0-byzantine-hardcoded.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/09/filibuster-2.0-byzantine-hardcoded.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Coaching UI</title>
				<description>&lt;p&gt;Your developers are writing functional tests for their microservice and, when RPCs fail, they throw exceptions. Do they test the cases where it throws? Do they have tests for it?&lt;/p&gt;

&lt;p&gt;Using Filibuster, you can automatically identify these scenarios and prompt the developer to answer these questions.&lt;/p&gt;

&lt;p&gt;Here’s a case where a developer threw an exception when a downstream RPC failed. We ask them: “did you mean to throw?” and, if so, they are prompted to tell the system that this exception is on purpose: this allows us to determine resilience failures automatically.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691309630814.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 08 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/08/filibuster-2.0-coaching.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/08/filibuster-2.0-coaching.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Byzantine Fault Injection with Arbitrary Faults</title>
				<description>&lt;p&gt;Ever wanted to just throw all sort of values at your database and see what happens to your application? Filibuster’s byzantine fault injector can take an arbitrary “value transformer” that looks like a functional fold, that allows you to come up with new fault injection scenarios as you inject faults!&lt;/p&gt;

&lt;p&gt;Let’s test our application against flipping characters in a string response from Redis!&lt;/p&gt;

&lt;p&gt;Here, we &lt;em&gt;observe&lt;/em&gt; the response from the test that passes with no faults and then flip a character in the response in the test where no faults were injected. We actually built this to flip &lt;em&gt;every&lt;/em&gt; character in the string.&lt;/p&gt;

&lt;p&gt;What was once “example” in the Redis response becomes “ xample”! And then becomes “e ample”!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691310856004.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Thu, 07 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/07/filibuster-2.0-byzantine-arbitrary.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/07/filibuster-2.0-byzantine-arbitrary.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Computing API coverage of a Microservice Application</title>
				<description>&lt;p&gt;About two months ago I started prototyping this feature for visualizing API coverage in a microservice application through automated instrumentation: see where you have functional testing coverage, how many functional tests, and where you’re applying fault injection to determine the impact of those changes.&lt;/p&gt;

&lt;p&gt;Flash forward to yesterday, a new paper draft on arXiv proposed (almost) this very thing – they went a bit further in their study and provided more comprehensive visualizations and an accompanying study. In contrast, mine is running on real code written by industrial developers. On the left, Filibuster; on the right, their proposal.&lt;/p&gt;

&lt;p&gt;Good ideas happen at the same time, I suppose!&lt;/p&gt;

&lt;p&gt;(e.g., Mattern and Fidge both independently coming up with the idea for vector clocks the very same year.)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1692932473188.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Wed, 06 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/06/filibuster-2.0-API-coverage.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/06/filibuster-2.0-API-coverage.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Microservice Linter, Multiple Invocations to the Same RPC Method</title>
				<description>&lt;p&gt;You can also use Filibuster’s dynamic analysis linter to find microservice smells.&lt;/p&gt;

&lt;p&gt;Here’s one: invoking multiple RPCs to the same service because you can’t send them all together! This leaves you at risk for partial side-effects being applied: refactor your API to let developers supply all inputs and write data to your database transactionally!!!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691311807950.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Tue, 05 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/05/filibuster-2.0-multiple.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/05/filibuster-2.0-multiple.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Microservice Linter, Requests become part of a Response</title>
				<description>&lt;p&gt;You can also use Filibuster’s dynamic analysis linter to find microservice smells.&lt;/p&gt;

&lt;p&gt;Here’s one: using the arguments from one RPC to Service A as the inputs to a different RPC on the same Service A. You should refactor your API so I don’t have to make multiple RPCs!!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691311683887.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Mon, 04 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/04/filibuster-2.0-request-response.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/04/filibuster-2.0-request-response.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Microservice Linter, Redundant RPCs</title>
				<description>&lt;p&gt;You can also use Filibuster’s dynamic analysis linter to find microservice smells.&lt;/p&gt;

&lt;p&gt;Here’s one: executing the same RPC multiple times, since you’ve already got the response and shouldn’t issue a failure-possible, expensive RPC again.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691311592348.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 03 Sep 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/03/filibuster-2.0-redundant.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/03/filibuster-2.0-redundant.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Improved UI</title>
				<description>&lt;p&gt;Integrated directly into IntelliJ, Filibuster can show you the RPCs that your service is making and directly inject faults on any of those RPCs. Attach a debugger, inject some faults, and see what your application does!&lt;/p&gt;

&lt;p&gt;Not only just GRPC (as shown here), we also support HTTP, Redis, PostgreSQL, CockroachDB, and DynamoDB.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691309780575.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Sat, 02 Sep 2023 10:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/02/filibuster-2.0-02-user-interface.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/02/filibuster-2.0-02-user-interface.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Healthcheck your Functional Test Suite with API Coverage</title>
				<description>&lt;p&gt;Pushed out a new prototype Filibuster feature tonight: 🎉 use Filibuster’s IntelliJ plugin to get a “health check” of your microservice’s functional test suite. 🎉&lt;/p&gt;

&lt;p&gt;Here, I can quickly open the plugin after running my test suite and see what RPC methods my service is exposing, how many unique functional tests I have covering those methods, how many specific Filibuster tests I have for those, and how many tests were automatically generated using Filibuster to exercise unique fault injection scenarios: where I injected HTTP and GRPC faults.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691649263651.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Sat, 02 Sep 2023 09:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/02/filibuster-2.0-01-healthcheck.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/02/filibuster-2.0-01-healthcheck.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Redis Fault Injection</title>
				<description>&lt;p&gt;Ever wanted to test your microservice application against Redis failures? Filibuster 2.0 supports injecting faults against Redis, PostgreSQL, CockroachDB, and DynamoDB.&lt;/p&gt;

&lt;p&gt;Here, using our IntelliJ visualizer for tests, we see that in this test we injected a failure on a synchronous GET command to Redis and the test still passed. That’s fault tolerant code!&lt;/p&gt;

&lt;p&gt;Not using synchronous operations? No problem! Filibuster will inject execeptions for each asynchronous GET or SET operation and defer the fault injection until someone calls get or set on the future.&lt;/p&gt;

&lt;p&gt;Wanna find out if someone is using thenAccept but forgetting to catch the exception? Filibuster can fail the test if the developer never gets the value of the returned future too!!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691309222915.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Fri, 01 Sep 2023 10:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/09/01/filibuster-2.0-redis-fault-injection.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/09/01/filibuster-2.0-redis-fault-injection.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Introducing the new version of Filibuster</title>
				<description>&lt;p&gt;🎉 Happy to announce the 2.0 release of Filibuster (for JVM languages!) 🎉&lt;/p&gt;

&lt;p&gt;This release includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Lots of improvements for HTTP fault injection for Armeria’s HTTP clients and visualization of HTTP faults into the Filibuster IntelliJ plugin.&lt;/li&gt;
  &lt;li&gt;Introduction of fault injection support for Redis, CockroachDB, DynamoDB, and PostgreSQL.&lt;/li&gt;
  &lt;li&gt;Byzantine fault injection (value-based fault injection) support for all supported databases, GRPC, and HTTP. BFI is supported through two modes: specification of explicit-values for BFI or a fold-based value transformer for dynamically creating new BFI scenarios at runtime.&lt;/li&gt;
  &lt;li&gt;A new testing library for writing microservice resilience tests with a “coaching” UI that helps you identify bugs or specify behavior under fault, integrated directly into IntelliJ through the use of plugin.&lt;/li&gt;
  &lt;li&gt;Support for integrated code coverage reporting with CodeCov and reporting on API coverage through the UX.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shout out to Michael Issac Assad for his incredible hustle in getting all of this working. 🍻&lt;/p&gt;

&lt;p&gt;Look forward to a series of (blog) posts in the coming weeks on how to use all of these new features.&lt;/p&gt;
</description>
				<pubDate>Thu, 31 Aug 2023 10:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/31/filibuster-2.0-release.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/31/filibuster-2.0-release.html</guid>
			</item>
		
			<item>
				<title>Filibuster 2.0: Developer Tooling as a Childhood Dream</title>
				<description>&lt;p&gt;What a long strange trip (the Ph.D.) it’s been:&lt;/p&gt;

&lt;p&gt;In 2018, when restarting my Ph.D. journey at Carnegie Mellon University, I wrote up a proposal for a Microsoft Research Fellowship that was sadly, rejected. In it, I said I wanted to make microservice programming more like monolithic programming with an integrated development environment. It even had fake screenshots we hacked with a fake LSP server. It was a large vision, where I wanted distributed type checking, fault injection, a linter, etc.&lt;/p&gt;

&lt;p&gt;Fast forward to late 2023, and I’m on the precipice of defending my Ph.D. While I had to rework my vision because it was much too large — and along the way completely and utterly lost the thread a few times diverging off into distributed runtimes, stateful serverless, model checking, and the like — I finally settled on fault injection testing as the primary topic of my Ph.D. thesis. (No more CRDT!). You can blame a certain Program Analysis course at Carnegie Mellon University that completely changed my mind about what is interesting about Software Engineering.&lt;/p&gt;

&lt;p&gt;…and, my vision is slowly becoming true!&lt;/p&gt;

&lt;h2 id=&quot;filibuster-20&quot;&gt;Filibuster 2.0&lt;/h2&gt;

&lt;p&gt;Here’s Filibuster today: a vertically integrated fault injection, exhaustive, resilience testing technique.  It’s integrated directly into IntelliJ.  You can inject faults, lint your code for bad microservice programming patterns, visualize your execution… and wait for it, it even helps create “negative” tests through integrated programming recommendations that exercise your application code in “bad” situations, verifying that, indeed, you react to the bad properly. It reports coverage and even detects several bad test writing patterns.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1689988322009.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;coaching-ux&quot;&gt;“Coaching” UX&lt;/h2&gt;

&lt;p&gt;By far my favorite part of my Ph.D. thesis work has to be the UX/UI tool (and methodology) that I wrote for helping write and debug microservice resilience tests.  It’s just so cool — I’m realizing a childhood dream of building developer tooling ever since my high school days, when I was so excited to unwrap my first copy of Visual Studio.&lt;/p&gt;

&lt;p&gt;Not only does it help visualize the RPCs that your application is making when you execute a test, if your test happens to fail because a fault was injected, it will prompt you with a question: was this failure expected or not?  If so, it helps you write code into your test to specify what the behavior of your application is under failure.&lt;/p&gt;

&lt;p&gt;What’s neat here is the integrated Javadoc on how to write the tests, a mechanism for replaying the failure so you can attach a debugger, and if you provide enough hints to the system, it can even automatically figure out what your system will do under certain failures.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/1691033390525.jpeg&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Wed, 30 Aug 2023 11:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/30/filibuster-2.0-building-developer-tools.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/30/filibuster-2.0-building-developer-tools.html</guid>
			</item>
		
			<item>
				<title>Reflecting on 10 years of Blogging, Research, and a (very) long Ph.D.</title>
				<description>&lt;p&gt;June 03, 2023 marks &lt;em&gt;10 years&lt;/em&gt; since I started my blog, and, it’s sort of mind blowing, honestly.&lt;/p&gt;

&lt;p&gt;When I go back and look at my old posts on my blog, well, it’s quite fitting that my &lt;em&gt;first blog post&lt;/em&gt; was on fault injection and fault tolerance, because that’s the topic that ultimately became my Ph.D. research, of which I’ve been focused on since 2015… but really since 2018.&lt;/p&gt;

&lt;p&gt;Since creation of this blog, I’ve been lucky to have &lt;em&gt;many&lt;/em&gt; of these posts make the front page of Hacker News and Lobsters, which at minimum, encouraged me to keep going.  This visibility has been instrumental to my career, despite the (sometimes) negative comments, but, wow… what a ride it has been.&lt;/p&gt;

&lt;p&gt;But, perhaps what is most interesting to me, is that the genesis of this blog was at the exact moment that I first ventured into research.  When I started my dream in 2013 — when I started this blog sitting in a wicker chair in the corner of the living room of my close friend Greg in San Francisco with a re-run of &lt;em&gt;Friends&lt;/em&gt; on in the background — my goal was to work at Microsoft Research.  When I started my Ph.D. in 2015, my goal was to be a professor in France.  In 2018, I thought I might be a teaching professor in the US.&lt;/p&gt;

&lt;p&gt;Things change: shit, it’s been &lt;em&gt;only&lt;/em&gt; 10 years!&lt;/p&gt;

&lt;p&gt;In some ways, it is fitting that this 10 year milestone corresponds with the year that I will (hopefully) defend my Ph.D. at Carnegie Mellon University: if not this year, early next year at the latest.&lt;/p&gt;

&lt;h3 id=&quot;the-inticement-of-research&quot;&gt;The Inticement of Research&lt;/h3&gt;

&lt;p&gt;I started this blog as a Software Engineer at Basho Technologies, a now-defunct distributed systems company that was primarily focused on writing a fault-tolerant, highly scalable distributed database in Erlang.  At the time, I was a JavaScript developer building the UX for Basho’s distributed database, Riak.  I did not know Erlang — only enough to maintain a job where I had to produce JSON from Erlang to build the UX for Riak — and, I barely knew JavaScript.  However, I was not new to programming as I had been a software engineer, systems engineer, and even a software engineering manager for many years before coming to Basho.  But, I ended up here because when I became a manager, I decided it wasn’t for me and I wanted to be a programmer again and restarted my career from the beginning as a junior developer and working my way back up.  I slowly moved from being primarily a TCL programmer for web applications — ArsDigita crew represent! — to manager, and then back down again to a Ruby programmer, and up again, to finally, a JavaScript developer at Basho.&lt;/p&gt;

&lt;p&gt;I really knew nothing of distributed systems at the time.  I didn’t read my first distributed systems paper until 2012.&lt;/p&gt;

&lt;p&gt;However, by 2013, I was very interested in both Erlang and distributed systems and by merely playing around in my spare time, had wandered into the area of fault injection and fault tolerance.  My &lt;a href=&quot;https://christophermeiklejohn.com/erlang/2013/06/03/erlang-pg2-failure-semantics.html&quot;&gt;first&lt;/a&gt; and &lt;a href=&quot;https://christophermeiklejohn.com/erlang/2013/06/05/erlang-gproc-failure-semantics.html&quot;&gt;second&lt;/a&gt; blog posts were specifically on this very topic.  I did not know at all what I was doing – as I’ve said numerous times in the past, writing is a way for me to understand something and I write specifically to help clarify my own thinking on a topic.  In short, I found something that was interesting, started writing about it, and then wrote code to figure it out.&lt;/p&gt;

&lt;p&gt;In 2013, I decided I wanted to start grad school to &lt;em&gt;really&lt;/em&gt; learn about a programming area, as as I was living in Rhode Island, reached out to Brown University, where some members of my family had attended.  They wanted nothing to do with me for several reasons.  First, I didn’t have a degree in CS, because I got one in IT part-time at Northeastern University after getting an associate’s degree at the Community College of Rhode Island.  Second, I had no research experience.  I was told “no research potential” by one prominant faulty member, which reminded me of Zappa’s infamous “no commercial potential” response.  Finally, after reaching out to a professor at Brown based on a paper he published that I had reimplemented for fun – and once I told him that I programmed Erlang in industry — he gave me permission to enroll in courses as a “special student.”&lt;/p&gt;

&lt;p&gt;A “special student” means a couple different things.  First, that you have to carry around a bright pink slip to all of your courses on the first day of classes begging to be admitted to the professor of the class, as you are not a degree student.  Second, that you have to pay your own way — at the time, $6k a course.  I doubled down, got my slips signed, and blew (part of) a family inheritance to take my courses.  I took a database seminar (not knowing what a seminar was) and a programming language course with Coq (not knowing what Coq or programming languages really were, outside of writing computer programs.)  I got lucky in that both of my course projects were decent.  In fact, I dicussed one at RICON, the distributed systems conference in 2013: my work on writing CRDT proofs — really, really, awful trivial proofs, but the first person to actually &lt;em&gt;try&lt;/em&gt; writing proofs for CRDTs — and it seemingly got me noticed by many people in the field.  I was in!&lt;/p&gt;

&lt;p&gt;In 2014, Basho established a research consortium with a number of universities in the European Union to work on the development of CRDTs for use inside of Riak.  I wasn’t invited, but several of my friends/colleagues were.  One January evening, lying in bed with the flu in my parents spare bedroom a few days after Christmas, I asked a colleague where they were going in Europe.  When I found out, I applied for a Delta credit card, was accepted, bought a flight, asked my colleague if I could room with him for the event, booked an AirBnb in the city, and just showed up there myself and said “hey, I’m with Basho.”  From then on, I became involed in the research project in several different capactities over time, travelling back and forth to Europe for the company.&lt;/p&gt;

&lt;p&gt;I got to work on various CRDT related research projects, ultimately building a new programming model and (partially) overseeing the development of a database called Floppystore, which became a database called Antidote DB.  I visited Rovio and a number of our other clients in various countries.  I gave talks at a bunch of conferences, collaborated with many different folks across organization and companies, and finally dissapointed with my job during the decline of Basho — I was working on testing, which ultimately became my Ph.D. — decided to quit my job and do a Ph.D. in CRDTs once I published my first research paper.&lt;/p&gt;

&lt;p&gt;At that point, it was &lt;em&gt;on&lt;/em&gt;.  I literally donated all of my belongings that I had accumulated at 30 years old to charity, and with only a single suitcase and carry-on, moved to San Francisco for several months to work a contract job on formal verification — which I knew nothing about — and then when finished with that contract, proceeded with my move to Europe for my Ph.D.&lt;/p&gt;

&lt;p&gt;Things were looking up.&lt;/p&gt;

&lt;h3 id=&quot;the-first-attempt-at-research&quot;&gt;The First Attempt at Research&lt;/h3&gt;

&lt;p&gt;Living in Europe had always been a dream of mine, ever since spending much of my Information Technology degree at Northeastern watching French film, reading French literature, and having spent much time in Paris with a former romatic partner of mine who was born there.  I loved the culture, the city, and fully embraced the lifestyle: it honestly felt like an absolute &lt;em&gt;dream&lt;/em&gt;.  There is no place else I would ever want to spend the rest of my life.&lt;/p&gt;

&lt;p&gt;At the start, my Ph.D., was nothing short of wonderful experiences.  I spent weekends in various different cities, traveled to 20+ different universities giving talks on CRDTs, Erlang, and distributed systems, and met (and then collaborated with) some of the most friendly and welcomining people that I will call friends for the rest of my life.&lt;/p&gt;

&lt;p&gt;However, my research progress was awful.  This wasn’t to say that I wasn’t doing research: I was writing code, building prototypes, showing off example applications, performing evaluations, etc.  In fact, when we finally presented our project results, I was one of the key presenters at our project review meeting in Brussels for the European Commission.  We had built a lot of software and it worked well, and it worked well at scale.  But, from the point of view of doing a Ph.D., I just felt stuck.  Perhaps my heart wasn’t in CRDTs anymore: as it would seem, it wasn’t and I completely abandoned it as a research topic.&lt;/p&gt;

&lt;p&gt;Thankfully, I got an internship at Microsoft Research in Redmond, primarily thanks to my previous work on Erlang and actor systems.  This work, over the summer, is work that ultimately lead to the transactional semantics in Orleans, but it was a rough start.  When arriving at Microsoft, I didn’t know what to do.  I wasn’t used to the culture of &lt;em&gt;extremely, minimal supervision&lt;/em&gt; and spent the first month just drinking coffee and reading papers that I thought were relevant.  I didn’t know transactions, I didn’t know how they worked, I didn’t know the names of all of the properties of isolation or whatever.&lt;/p&gt;

&lt;p&gt;Phil Bernstein, however, was an incredible mentor.  He pointed me in the directions of the right papers, gave me his own copy of his book on transactions, and upon going home each night, I would sit and read as much as I could.  I started by benchmarking transactions, and each day as I learned more and more, I got better at figuring out how things worked, what each thing was called, and what each thing meant.  I absorbed every paper Phil even mentioned in passing, and tried to learn as much as I could.  I thought things were looking up and I was starting to feel more at home in my own skin.&lt;/p&gt;

&lt;p&gt;However, I was sick every single day of the week.  Horribly sick, every morning and spending each evening sick.&lt;/p&gt;

&lt;p&gt;I thought at first it was stress, but each day it became increasingly worse: I just couldn’t stay out of the bathroom.  Finally, after visiting several doctors and performing a number of experiments myself each day with what I ate, I figured it out. 
I could no longer eat bread.&lt;/p&gt;

&lt;p&gt;I lived in Europe, and I couldn’t eat bread anymore without getting horribly sick.&lt;/p&gt;

&lt;p&gt;When my internship ended, it was time to return to Europe, but a condition of my Ph.D. grant (Erasmus) was that I had to switch universities every year and have two different Ph.D. advisors, so I returned to Belgium and after a few months headed off to Lisbon.  In Lisbon, I fared less well: I knew French, but didn’t know Portuguese, and I also had to communicate to everyone that I couldn’t have dairy or gluten anymore.  Complicating this, my Ph.D. advisor in Lisbon wanted me to work on a different project than my Ph.D. advisor in Belgium did.  I was lost, between conflicting responsibilities from different advisors, while sick, in a place where I didn’t speak the language.  I hobbled along for almost a year — completely lost, depressed, sick, alone, and otherwise miserable — and was lucky enough to be invited back to Microsoft Research.&lt;/p&gt;

&lt;p&gt;At Microsoft Research that year, I was mentored by Sebastian Burckhardt, who was involved in our previous project as well and who has since become a close friend of mine.  Sebastian, who I had known for a long time because he was extremely critical of my CRDT work years earlier, was a a fantastic mentor and great person, who helped me to grow in a number of ways.  Here, we took the ideas we developed in Orleans and built the first version of stateful serverless, which we open sourced at the time, and became Durable Entities on Azure.  Not only this, but Sebastian introduced me to Jonathan Goldstein at Microsoft, where I got to work with cutting edge database systems and help define part of that systems API.  I was eating gluten-free, things were great, and I was flying high.&lt;/p&gt;

&lt;p&gt;But, on return to Europe, I was miserable.  Back in the same place, with no direction, no clear end to the Ph.D., despite being in year 3 of my 3 year funding, and sick once again from being able to properly control my diet in the foreign land.  I dabbled with starting a company in France based on our previous work – we met with several VC funds which didn’t pan out.  Instead, I spent a few months there, and upon returning to the United States, quit my Ph.D. via email from a bar (Miracle of Science, it’s you!) on Massachusetts Avenue in Cambridge, Massachusetts while watching a Boston Celtics game on the television and told my collaborators that I would not be starting a company in France.  I left all of my belongings, outside of what I could carry in a suitcase, in my apartment in Portugal, which my landlord claimed as payment for the security deposit.  I was lucky enough, that I had already accepted an offer for a return to Microsoft Research that summer, depsite having just quit my Ph.D.&lt;/p&gt;

&lt;p&gt;Upon return to Microsoft, things were great.  I corrected my diet, was living a place where I could control cross-contimation by not sharing a kitchen with other students, had a car to grocery shop, and was doing quite well.  However, from the research perspective, I was making less progress.  I was working on effects systems and capturing nondeterminism and built two different versions of the system – one research with strong guarantees and one practical with fewer guranteees, which both worked — but no one seemed particularily impressed.  I felt like I was doing very pratical useful work, but my research presentations were under-attended and no one seemed to really care.  I still felt lost.&lt;/p&gt;

&lt;p&gt;Not knowing what to do with what remained of my (non-existent) career, I was lucky enough to get enrolled at Northeastern University as Ph.D. student thanks to my future Ph.D. advisor Heather Miller — who I had known from many moons ago when she was a grad student and I was an industry participant at a PL summer school in Oregon, and who I organized the Curry On conferences at ECOOP with — and who I had been working with on and off on projects while I was untethered in Boston since quitting, to promising success.  However, by the time I finished up at Microsoft, she had left Northeastern and was on her way to Pittsburgh, PA, at Carnegie Mellon University.  Carnegie Mellon was &lt;em&gt;a bit&lt;/em&gt; of an upgrade from a smaller European university in Belgium and, definitely so for someone who Brown University considered had “no research potential.”  And, it &lt;em&gt;showed&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;My first semester, I failed out of one of my courses and landed in the hospital for a full week: overwork, stress, and an improper diet had caused my pancreas to go into overdrive and I was at risk for pancreas failure.  However, with some rest and a slight course correction at CMU, things got back to normal and soon enough I was passing courses and doing good research.  I ended up taking my work on building an evaluation framework for our SyncFree work three years prior and turning it into a top-tier conference paper at USENIX ATC my first year at CMU with the help of Heather.  Things were looking up, but I had absolutely no direction how to proceed.  This was soon to change.&lt;/p&gt;

&lt;h3 id=&quot;the-redemption&quot;&gt;The Redemption&lt;/h3&gt;

&lt;p&gt;In late 2018, I started playing around with the idea of fault injection. 
My previous paper, Partisan, had been focused on building a distributed runtime: my thinking was that if you control the runtime, you can inject faults any of the communication that is performed by the runtime.  It seemed the clear natural extension of the Partisan work, and would be a good second (of 3) piece of my Ph.D. thesis.&lt;/p&gt;

&lt;p&gt;So, to demonstrate, I built several example applications, adapted blockchain applications and distributed databases, all to use this fault injection framework.  I even implemented 3 different transaction protocols based on my previous work at Microsoft Rsearch.  However, my work was just too simplistic to find bugs.  I just couldn’t compete with any technique that used model checking, with advanced heuristics, and the very same year that I tried to publish my work — which was rejected — another researcher came out with something far more advanced for Erlang that found significantly more bugs.&lt;/p&gt;

&lt;p&gt;But, this fault injection idea stuck with me and I kept on working it.  Instead, I thought, rather than try to target the identification of concurrency bugs where so many existing researchers are focused, why not try to target microservice bugs, where no one is working and publishing, instead?&lt;/p&gt;

&lt;p&gt;I decided to write proposal for this work and other work that fit into my broad vision of how microservice software should be developed.&lt;/p&gt;

&lt;p&gt;I wrote a Microsoft Research Fellowship Proposal in early 2019 that started like the following:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/msr-1.png&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;❌ Rejected.&lt;/p&gt;

&lt;p&gt;I was dissappointed, but I was lucky enough to get an internship at Amazon in Automated Reasoning working on S3 to do concurrency testing and model checking that same week, so it mitigated the blow of yet (another of many) rejection.&lt;/p&gt;

&lt;p&gt;However, once returning from my Amazon internship, I was still left in a place of uncertainty with my own work: I should have been done with my Ph.D. already!&lt;/p&gt;

&lt;p&gt;It wasn’t until I randomly took a program analysis course that I realized my true love of programming is &lt;em&gt;reasoning about, and verifying, program correctness.&lt;/em&gt;  This was, without a doubt, the one course I took that changed the course of my Ph.D.&lt;/p&gt;

&lt;p&gt;So, I decided to reimplement my fault injection strategy in Python for use on microservice applications written in Python (this was done based on a potential user, who never materialized) and was lucky enough to also have students reach out to help me as part of an undergraduate research inititive while I was doing the work.  I started the work late ‘19 and the students helped out in late ‘20: we submitted early ‘21 for the May deadline of SoCC ‘21.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Our first paper on what is currently called &lt;a href=&quot;http://filibuster.cloud&quot;&gt;Filibuster&lt;/a&gt; got in, first shot.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In my perpsective, this is when &lt;em&gt;my Ph.D. actually started&lt;/em&gt;: the first Filibuster paper at SoCC ‘21.&lt;/p&gt;

&lt;p&gt;From this point forward, it’s been a whirlwind experience, which I’ve blogged about in detail:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Filibuster’s results led to some &lt;a href=&quot;https://doordash.engineering/2022/04/25/using-fault-injection-testing-to-improve-doordash-reliability/&quot;&gt;industry adoption&lt;/a&gt; and sponsorship of future development;&lt;/li&gt;
  &lt;li&gt;Filibuster was completely &lt;a href=&quot;https://github.com/filibuster-testing/filibuster-java-instrumentation&quot;&gt;reimplemented in Java and adapted to handle GRPC&lt;/a&gt;;&lt;/li&gt;
  &lt;li&gt;Filibuster’s work led to a &lt;a href=&quot;https://christophermeiklejohn.com/publications/socc2022-preprint.pdf&quot;&gt;joint published paper at SoCC ‘22 with DoorDash&lt;/a&gt; on circuit breaker usage for fault tolerance in ‘22;&lt;/li&gt;
  &lt;li&gt;Filibuster now provides an &lt;a href=&quot;https://plugins.jetbrains.com/plugin/21057-filibuster-for-macos-linux-&quot;&gt;integrated development experience in Java with JUnit integration and IntelliJ plugin integration&lt;/a&gt; in ‘23;&lt;/li&gt;
  &lt;li&gt;Filibuster has been the focus of multiple students undergraduate research projects;&lt;/li&gt;
  &lt;li&gt;Filibuster is the subject of an upcoming master’s thesis on fault injection for database calls; and&lt;/li&gt;
  &lt;li&gt;developers are actually starting to use Filibuster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today, Filibuster looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/filibuster-overview.png&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;It’s not quite the same as my MSR proposal — which was quite aspirational and covered type checking, fault injection and more — but what we have is significantly more focused, as necessary for a Ph.D., and works on real application code and finds real bugs through it’s integrated fault injection development experience.&lt;/p&gt;

&lt;p&gt;It’s come a long way and I have to thank my colleagues at CMU, colleagues at DoorDash, and my advisor Heather for the motivation and support that was required to push it forward.&lt;/p&gt;

&lt;h3 id=&quot;epilogue&quot;&gt;Epilogue&lt;/h3&gt;

&lt;p&gt;I think a lot about whether my Ph.D. experience was worth it or not as it slowly comes to a close.&lt;/p&gt;

&lt;p&gt;First, I did not work on exactly what I thought I would, and therefore my Ph.D. ended up being quite different than what I thought my Ph.D would be.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I started in CRDTs; I left that area and moved to program analysis and testing.&lt;/li&gt;
  &lt;li&gt;This made sense for me, but only in retrospect: I had done testing for years before moving to Basho and then did distributed systems testing at Basho.&lt;/li&gt;
  &lt;li&gt;Testing was a thing I was good at — and as my initial blog post indicates, and it was the motivation for my initial blog post 10 years ago —  before I even started a Ph.D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, &lt;em&gt;for me&lt;/em&gt;, it makes sense, but it &lt;em&gt;was not the thing I signed up to do and wasn’t what I set out to do.&lt;/em&gt;  This wasn’t a dealbreaker for me — I feel like I discovered my true calling — but, it’s something that happend.&lt;/p&gt;

&lt;p&gt;When it comes to the &lt;em&gt;negatives&lt;/em&gt; of the Ph.D., they are quite clear:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I literally started my life over with nothing but a suitcase upon returning to United States in Pittsbugh, at 36 years old.&lt;/li&gt;
  &lt;li&gt;I was a Senior Software Engineer in 2015 and forfeitted a potentially (lucrative) tech salary for almost 8 years to live as a grad student.  Given this — and if I ever wanted to be a professor full-time that lived in a city — it’s highly unlikely I’ll ever be able to make any substantial purchases in my life: for example, real estate.&lt;/li&gt;
  &lt;li&gt;The stress during my Ph.D. was so high, that I developed an autoimmune disorder (e.g., celiac disease, lactose intolerance, IBS) that severly limits my ability to travel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;em&gt;impact of these negatives&lt;/em&gt; on my goals are quite clear:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I’ve given up my dreams of being a professor in France: it’s just too late with too many restrictions.&lt;/li&gt;
  &lt;li&gt;I no longer travel to conferences and visit universities, the things that I enjoyed most as a researcher.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In contrast, the &lt;em&gt;positives&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I was able to travel the world for a brief period as a researcher with minimal responsibility, which resulted in some great experiences, great friendships, and broadening of the mind.&lt;/li&gt;
  &lt;li&gt;Most importantly, I got to spent several years of my life working on a problem that I thought was extremely interesting and, which for a short-time, was my entire world.  I examined the problem completely, and contributed what I thought was a solution to it.  It was all me, and it’s something I can always look back on and be proud of.  Whether or not it’s forgoteen is up to time, but I know that I did everything that I could to investigate the problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What a fucking journey, that’s all I can say.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/trey.gif&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p style=&quot;text-align:center&quot;&gt;
&lt;i&gt;&quot;...waiting for the time when I can finally say, this has all been wonderful, but now I&apos;m on my way...&quot;&lt;/i&gt;
&lt;/p&gt;

</description>
				<pubDate>Sun, 27 Aug 2023 08:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/2023/08/27/3-10-years.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/2023/08/27/3-10-years.html</guid>
			</item>
		
			<item>
				<title>Nested RPCs with Filibuster&apos;s UX</title>
				<description>&lt;p&gt;I recently cleaned up Filibuster’s UX to display nested RPCs a bit clearer.  Now, it will indicate using an arrow when RPCs are the result of other RPCs to dependent, transitive, downstream services.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/filibuster-nested-rpcs.png&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 27 Aug 2023 02:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/27/02-nested-rpcs.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/27/02-nested-rpcs.html</guid>
			</item>
		
			<item>
				<title>Testing Applications Resilience in the Presence of Service Tiers</title>
				<description>&lt;p&gt;Many microservice applications classify their dependencies into different &lt;a href=&quot;https://thenewstack.io/how-service-tiers-can-help-to-avoid-microservices-disasters/&quot;&gt;Service Tiers&lt;/a&gt;.  Perhaps your developers are working on a Tier 1 service and want to make sure that they are tolerant to any failure of Tier 2+ failures.  How can they do that?&lt;/p&gt;

&lt;p&gt;Well, if they’ve written a functional test that can be run with Filibuster, you can use a new feature of Filibuster called &lt;a href=&quot;https://github.com/filibuster-testing/filibuster-java-instrumentation/blob/main/src/main/java/cloud/filibuster/junit/filters/FilibusterFaultInjectionFilter.java&quot;&gt;&lt;em&gt;fault injection filters&lt;/em&gt;&lt;/a&gt;.  Fault injection filters allow you to arbitrarily implement filter functions that are provided with the invoked RPC method to conditionally prevent fault injection.  For example, by implementing a filter that allows faults for any downstream services with a tier greater than your service.&lt;/p&gt;

&lt;p&gt;Here, we have a simple purchase method executed by an Order Service.  It issues 7 downstream RPCs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;one to look up the user, which when failed, causes the service to return error.&lt;/li&gt;
  &lt;li&gt;one to validate the user session, which when failed, causes the service to return error.&lt;/li&gt;
  &lt;li&gt;one to look up the user’s cart, which when failed, causes the service to return error.&lt;/li&gt;
  &lt;li&gt;three to look up the possible discounts for the user’s cart.&lt;/li&gt;
  &lt;li&gt;one to send an email to the user using a Tier 2 service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, if I use Filibuster normally, I’ll execute this test over and over and fail every RPC and every combination of RPCs.  Any test that asserts the purchase is successful will fail: for example, by failing the downstream RPC to look up the user, the test will fail because I’ll return error.  In this case, I’m supposed to return error, and Filibuster provides an API that allows you to say this precisely: when that downstream RPC fails, I won’t be able to complete the purchase and will get this error.&lt;/p&gt;

&lt;p&gt;But, perhaps I want to verify &lt;em&gt;if all of my Tier 2 dependencies fail&lt;/em&gt;, the system will always return success.&lt;/p&gt;

&lt;p&gt;Since sending the user an email is done using a Tier 2 service, I can easily implement a filter that states that I do not want to inject faults on any Tier 2 services, and that the email service is a Tier 2 service.  Then, when I run Filibuster, I will only inject faults on Tier 2 services and any failure of my test indicates that it is not tolerant to a Tier 2 service failure: it returned failure, when the system should have been tolerant to that failure.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/filibuster-service-tiers.png&quot; style=&quot;width: 100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, my test passes without any modifications as expected: my code is tolerant to the failure of any downstream dependency that is a service tier that is greater than 1!&lt;/p&gt;
</description>
				<pubDate>Sun, 27 Aug 2023 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/27/01-service-tiers.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/27/01-service-tiers.html</guid>
			</item>
		
			<item>
				<title>Testing Applications for Resilience Bugs vs Testing Applications for Resilience</title>
				<description>&lt;p&gt;Testing microservice applications that are designed to be resilient, in order to find application bugs related to resilience, is difficult. This is different from testing microservice applications for resilience.&lt;/p&gt;

&lt;p&gt;Consider a microservice application using a custom RPC client that contains circuit breakers, and retry logic, and other resilience measures. Now, think about how we might want to test our service to see what it does if one of the RPCs that the service issues fails.&lt;/p&gt;

&lt;p&gt;Naively, I could inject a single fault for that RPC, perhaps in the network layer or maybe in the application layer. What will happen? Probably nothing. Why? Because the request is immediately retried.  OK, so I’m going to now inject a bunch of failures by taking the service it’s calling offline, OK, now I’ve triggered the circuit breaker and none of the RPCs are even being issued. Now, my application is probably broken.&lt;/p&gt;

&lt;p&gt;This is what I think of as an unprincipled fault injection test because my test targeted different things, all at once, and I didn’t learn much from running the test. First, I tested my service’s RPC library’s retry mechanism. Then, I tested my circuit breakers opening. Then, I tested what my services does when the circuit breaker is open.&lt;/p&gt;

&lt;p&gt;In developing Filibuster, I’ve been thinking hard about how to approach these different testing scenarios. We address this challenge in stages.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;First, we disable the resilience behavior in the RPC framework when testing, so we can identify application bugs in isolation without fighting automatic retries and risking that circuit breakers open during testing. This is key when running a lot of different failure cases, locally, in rapid succession (over 100+ failure cases a minute, depending on the service). Here, we synthetically inject the open circuit breaker and the timeout exception in a way that allows us to find bugs in the ways errors are handled at the service level, without relying on injecting enough faults to trigger timeouts or wait for timeouts to actually fire.  Then, we proceed to inducing timeouts via waiting, after we ensure the timeout exceptions are handled correctly.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Second, we test the RPC framework independently to test it’s built in resilience mechanisms.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Third, we test the application, using the RPC framework, to identify if the resilience mechanisms are properly implemented, integrated, and configured in the application — and, that the application does what it should when all of the retries fail, timeouts fire, etc. By the time you get here, compositionally, you should know precisely what is going to happen and you’re looking to reject the hypothesis.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lots of times we talk about resilience in isolation — staying online when faults happen.&lt;/p&gt;

&lt;p&gt;What we often do not talk about is the effects of RPCs failing occasionally and the impact they have on our application(s).&lt;/p&gt;
</description>
				<pubDate>Wed, 23 Aug 2023 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/23/testing-for-resilience.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/23/testing-for-resilience.html</guid>
			</item>
		
			<item>
				<title>Revisiting: Chaos Engineering or Software Testing?</title>
				<description>&lt;p&gt;&lt;em&gt;This addresses comments from LinkedIn on my previous blog post called &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2023/08/10/chaos-engineering.html&quot;&gt;Chaos Engineering or Software Testing?&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Last week, I took a bit of heat for my post on chaos engineering for couple reasons. First, it was said that I didn’t actually understand what chaos engineering was and thought the technique was purely random. Second, it was said that I failed to understand that chaos engineering has a long history in the field of resilience engineering.&lt;/p&gt;

&lt;h3 id=&quot;commentary&quot;&gt;Commentary&lt;/h3&gt;

&lt;p&gt;First, I said chaos engineering was random.  In my post, the context I was referring to was the initial incarnation of chaos engineering a la “chaos monkey” with my textual reference to 2007. The initial chaos monkey &lt;em&gt;was&lt;/em&gt; random in its instance termination, because it was trying to mimic AWS instance termination, which from the consumer’s point of view, was random.  In retrospect, I &lt;em&gt;should have been more precise&lt;/em&gt;, as I was referring to this specifically because I had recently had seen many posts about using &lt;em&gt;this specific style&lt;/em&gt; of chaos engineering on LinkedIn under this hashtag.&lt;/p&gt;

&lt;p&gt;However, when it comes to random exploration, I was using it to draw a distinction from systematic exhaustive exploration.  What I’ve (personally) found is that talking to people about what could fail and designing experiments around it isn’t nearly as useful as a systematic exploration of all RPC that a service (or services) invoke on their dependencies.  Often, the weak point is an RPC or dependency that the developer doesn’t remember is being called as part of some process. Hence, I prefer using the computers to figure out what to do and have them do it automatically.&lt;/p&gt;

&lt;p&gt;And, let me tell you, microservices issue a &lt;em&gt;lot&lt;/em&gt; of RPCs and developers are consistently surprised: “Oh, I forgot we were calling that service.”&lt;/p&gt;

&lt;p&gt;Second, &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2022/03/17/what-is-chaos-engineering.html&quot;&gt;I’m well aware of the history of resilience engineering and chaos engineering&lt;/a&gt;.  While I personally sometimes feel that it’s history has been &lt;a href=&quot;https://en.wikipedia.org/wiki/Retroactive_continuity&quot;&gt;&lt;em&gt;recon-ed&lt;/em&gt;&lt;/a&gt; for the purposes of a coherent story with central planning, I would be remiss to mention that it’s got a rich history outside of  its resilience engineering background: &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/2367376.2371297&quot;&gt;Jesse’s history as a volunteer firefighter and controlled burns, Google’s GameDays, etc.&lt;/a&gt; I absolutely think it’s useful for making sure that on-call works properly, fail-over of AZ/region/DCs work, people get paged, &lt;a href=&quot;https://www.oreilly.com/library/view/chaos-engineering/9781492043850/&quot;&gt;the pager system isn’t located centrally in the system you take down in your experiment&lt;/a&gt;, whatever. In fact, I’ve cited Jesse’s papers, Google’s papers and talks, etc. all with high recommendations in my own &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/3472883.3487005&quot;&gt;papers&lt;/a&gt; and &lt;a href=&quot;http://cmu-313.github.io&quot;&gt;lectures&lt;/a&gt; that I’ve given on the topic at CMU.&lt;/p&gt;

&lt;h3 id=&quot;disagreements&quot;&gt;Disagreements&lt;/h3&gt;

&lt;p&gt;What I &lt;em&gt;do&lt;/em&gt; disagree with is using chaos engineering in place of isolated functional software testing. More specifically, I shouldn’t be turning a server or availability zone off to make sure that the RPCs I am issuing from a different service in a different availability zone handle a Connection Error Exception: I can easily do that in the code. In the way I see it, you are issuing RPCs (library call) and it’s your duty to test your application for what happens when those RPCs fail (exceptions), just like any other library call you’d test. If developer’s use RPCs like ordinary library calls, they should be tested like library calls.&lt;/p&gt;

&lt;p&gt;To be clear: I’m not saying this is the only testing that you do, it’s just one step along a path. Learn what things return when they are failing and return errors; then, figure out what the larger system impact is: compositional testing and reasoning. No use injecting a fault in production when your code doesn’t (correctly) handle errors without first finding out your code doesn’t have a (correct) error handler.&lt;/p&gt;

&lt;p&gt;(&lt;a href=&quot;https://scholar.harvard.edu/files/waldo/files/waldo-94.pdf&quot;&gt;Waldo is going to want to murder me&lt;/a&gt;, but this the reality. Thankfully, there are &lt;a href=&quot;http://filibuster.cloud&quot;&gt;tools&lt;/a&gt; for testing RPCs for failures, timeouts, high latency, etc.)&lt;/p&gt;

&lt;p&gt;You don’t use chaos engineering to randomly delete files to see if your application code handles a File Not Found exception, do you?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For those interested: learn more about &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2022/03/17/what-is-chaos-engineering.html&quot;&gt;chaos engineering and why I don’t think chaos engineering is a good substitute for software testing&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Sun, 13 Aug 2023 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/13/revisiting-chaos.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/13/revisiting-chaos.html</guid>
			</item>
		
			<item>
				<title>Building Tech-Transferrable Research Software</title>
				<description>&lt;p&gt;The distance between research software and software that can be used by real software engineers is — quite honestly — enormous. It just doesn’t get talked about enough, and it should.&lt;/p&gt;

&lt;h3 id=&quot;finding-bugs-in-zookeeper&quot;&gt;Finding bugs in Zookeeper&lt;/h3&gt;

&lt;p&gt;There’s a long lineage of work (short: papers) on finding concurrency bugs in systems like Zookeeper, with a lot of it done by the University of Chicago.&lt;/p&gt;

&lt;p&gt;Why is that?&lt;/p&gt;

&lt;p&gt;Well, there are multiple factors, but perhaps the most interesting to me is that some time in the past, someone found a way to instrument Zookeeper in such a way that they could control thread scheduling and each RPC sent. Therefore, when new Ph.D. students come along, they can build on this existing work and focus only on the &lt;em&gt;algorithmic&lt;/em&gt; improvements.&lt;/p&gt;

&lt;p&gt;For example, SAMC (OSDI ‘14) had semantics-aware dynamic reduction policies like local/crash-message independence; FlyMC (EuroSys ‘19) had parallel flips and state symmetry. Same system, more bugs found each paper. The bugs just keep getting more complicated too, as it’s a battle of attrition. From FlyMC: “The second bug was reported to appear once every 500 unit test cycles […] successfully discovered 5 new bugs […] bug depths ranged from 9 to 30 events.”&lt;/p&gt;

&lt;p&gt;In short, Zookeeper bugs get more and more complicated, we come up with better tricks to find them; and Zookeeper keeps getting better every day. One might even ask: when will Zookeeper run out of bugs to be found, if ever?&lt;/p&gt;

&lt;h3 id=&quot;microservices&quot;&gt;Microservices&lt;/h3&gt;

&lt;p&gt;In my own Ph.D. work, I’ve been building software for testing microservices applications where one of the key tenets is that services are developed by different teams. This means that each service can be written in a different style, using different versions of libraries, with different dependencies, using different testing frameworks, using different DI frameworks, etc., etc. This makes mechanization of a “one size fits all” approach to testing… very hard.&lt;/p&gt;

&lt;p&gt;I think there are two implications when it comes to software testing:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;From the industrial perspective, I feel like this is why developers often resort to infrastructure-level testing techniques, because utilizing an application-level testing technique across multiple services is hard, often less-valued, work.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;From the research perspective, it makes developing software that can run on real services very difficult. In short, don’t be too generic in that you sacrifice performance, but be generic enough, having just the right hooks, that you can wire yourself into any framework.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;open-questions&quot;&gt;Open Questions&lt;/h3&gt;

&lt;p&gt;This is a skill that Ph.D. students just do not have. In fact, I often fail at it myself, and I was a professional engineer for 15 years in senior roles before starting my Ph.D.&lt;/p&gt;

&lt;p&gt;Tech transfer isn’t a goal for academics and doesn’t seem to be really valued. As a community, we often rely on someone in industry to create a derivitive work that works on real things.&lt;/p&gt;

&lt;p&gt;Perhaps this is how research is supposed to be in the field of Computer Science.&lt;/p&gt;

&lt;p&gt;Is that how it should be in the field of Software Engineering?&lt;/p&gt;
</description>
				<pubDate>Sat, 12 Aug 2023 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/12/industrial-research.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/12/industrial-research.html</guid>
			</item>
		
			<item>
				<title>Chaos Engineering or Software Testing?</title>
				<description>&lt;p&gt;Recently, I’ve seen a lot of posts under the #chaosengineering hashtag on LinkedIn on Netflix-style &lt;em&gt;chaos engineering&lt;/em&gt; and as someone who has been researching resilience in software for 5 years as part of my Ph.D., I thought I would provide my thoughts.&lt;/p&gt;

&lt;p&gt;First, we need to keep things in context. When Netflix was first moving to the cloud and “inventing” chaos engineering, EC2 was unreliable and instances would be reclaimed, fail, or otherwise disappear. Therefore, they wanted to be able to tolerate this. This is key behind their “inject realistic faults” philosophy.&lt;/p&gt;

&lt;p&gt;But, that was 2007 and this is 2023. Then, Kubernetes didn’t exist. Heck, Mesos didn’t even exist. Now, we live in a world where we have more reliable cloud computing services, cluster orchestrators, advanced RPC clients, and all sort of infrastructure innovations. Randomly crashing nodes now shows us nothing, outside of ensuring that we have a reasonable crash/restart policies, liveness/readiness probes, and more than one node.&lt;/p&gt;

&lt;p&gt;When it comes to building resilience in our applications, we need to first think about what we want to test, in relation to our software and supporting infrastructure.&lt;/p&gt;

&lt;p&gt;For example, we might want to test if devs respond correctly to a failure. We also might want to test parts of our infrastructure. We also might want to test our application. A holistic approach addresses all of these, but as good software engineers know, strong testing is compositional, where we build guarantees over time through the testing of individual components and repeat those tests regularly (think: regression tests.)&lt;/p&gt;

&lt;p&gt;For example, if I want to know if my auto-scaling policies work, I target a specific test for this. I don’t run a test that may test my entire organization’s response to a failure, before knowing anything about how the system will handle the failure.&lt;/p&gt;

&lt;p&gt;If I’m looking for application bugs that happen when services are unavailable in a microservice architecture, I don’t randomly crash a server and see what happens. Why? First, it’s very hard to reproduce because code is complex, concurrent, and most importantly nondeterministic. In short, no way to easily attach a debugger, reproduce, and solve the problem. Second, it’s just testing too much! Test the service to failures of its dependencies by realistically simulating the failures and fixing them.&lt;/p&gt;

&lt;p&gt;Then, and only then, crash the service in a subsequent test to invalidate the hypothesis that your service is can handle any failure. Then, expand out further.&lt;/p&gt;

&lt;p&gt;Compositional reasoning: it’s how we reason about programs that we write.&lt;/p&gt;

&lt;p&gt;We test something to determine its behavior, and then we test components that use those components under the behavior we identified.&lt;/p&gt;

&lt;p&gt;Only when we know how everything should work do we run the end to end test of the system where we test all of the components working together. We have an assumption about what will happen: our test is used to possibly invalidate that assumption.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;For those interested: learn more about &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2022/03/17/what-is-chaos-engineering.html&quot;&gt;chaos engineering and why I don’t think chaos engineering is a good substitute for software testing&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</description>
				<pubDate>Thu, 10 Aug 2023 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2023/08/10/chaos-engineering.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2023/08/10/chaos-engineering.html</guid>
			</item>
		
			<item>
				<title>Handing off Maintenance of Partisan</title>
				<description>&lt;p&gt;I’m happy to announce that I am handing off maintenance of &lt;a href=&quot;https://github.com/lasp-lang/partisan&quot;&gt;Partisan&lt;/a&gt;, the distributed runtime system I built as part of my Ph.D. (and my first paper published at CMU!), to &lt;a href=&quot;https://github.com/aramallo/partisan&quot;&gt;Alejandro Ramallo&lt;/a&gt; at Leapsight.&lt;/p&gt;

&lt;p&gt;Alejandro has done tremendous work on Partisan over the last two years: from fixing bugs, feature enhancement to performance improvements.&lt;/p&gt;

&lt;p&gt;Alejandro has also brought Partisan to places that, as a PhD student who works mostly in a lab, never imagined:  LO/JACK LATAM, the lost vehicle recovery service, since 2019, has been using Partisan as part of its Magenta Platform, tracking 300k vehicles, 10k devices, receiving over 30M GPS transmissions each day!&lt;/p&gt;

&lt;p&gt;At LO/JACK LATAM, Leapsight uses Partisan as a foundational component for:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Magenta Twin Service: an IoT twins implementation using a modified version of Erleans, an implementation of the Microsoft Orlean’s virtual actor model for Erlang, modified to use PlumDB for metadata and state storage and Partisan for transport.&lt;/li&gt;
  &lt;li&gt;Magenta Agent Service: an IoT Rule-based system built using PlumDB and Partisan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can learn more about how it’s used here:&lt;/p&gt;

&lt;iframe class=&quot;youtube-player&quot; width=&quot;480&quot; height=&quot;360&quot; src=&quot;//www.youtube.com/embed/XxJ1IS8mo84&quot; frameborder=&quot;1&quot;&gt; &lt;/iframe&gt;

&lt;h2 id=&quot;industrial-usage&quot;&gt;Industrial Usage&lt;/h2&gt;

&lt;p&gt;Leapsight has used Partisan as the backbone of two of their products:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/Leapsight/bondy&quot;&gt;Bondy&lt;/a&gt;, an open source, always-on and scalable application networking platform for modern architectures. Bondy is an all-in-one event and service mesh that offers both Publish-Subscribe (PubSub) and routed Remote Procedure Calls (RPC). Bondy implements the open Web Application Messaging Protocol (WAMP) and is written in Erlang.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://gitlab.com/leapsight/plum_db&quot;&gt;PlumDB&lt;/a&gt;, a database globally replicated via Epidemic Broadcast Trees and lasp-lang’s Partisan.  An offspring of Plumtree and Partisan, a descendant of Riak Core Metadata Store.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;future-plans&quot;&gt;Future Plans&lt;/h2&gt;

&lt;p&gt;Alejandro has a lot in store for Partisan:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;API usage
    &lt;ul&gt;
      &lt;li&gt;Add remote monitoring support of the local node using partisan.&lt;/li&gt;
      &lt;li&gt;Allowing the parallelism setting to apply in a per-channel, not global, setting.&lt;/li&gt;
      &lt;li&gt;Normalize all APIs to use the URI encoded remote pids, references, and node names.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Feature: Casual delivery
    &lt;ul&gt;
      &lt;li&gt;Disk-based storage for failover.&lt;/li&gt;
      &lt;li&gt;Implement reliabile causal broadcast
        &lt;ul&gt;
          &lt;li&gt;Ishikawa? &lt;a href=&quot;https://github.com/lasp-lang/ishikawa/blob/master/src/ishikawa.erl&quot;&gt;https://github.com/lasp-lang/ishikawa/blob/master/src/ishikawa.erl&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;LoCaMu?  &lt;a href=&quot;https://www.gsd.inesc-id.pt/~ler/reports/valtersantosea.pdf&quot;&gt;https://www.gsd.inesc-id.pt/~ler/reports/valtersantosea.pdf&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Maintenance: Overlay Tree Construction
    &lt;ul&gt;
      &lt;li&gt;One &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_plumtree_broadcast&lt;/code&gt; server per channel to increase throughput&lt;/li&gt;
      &lt;li&gt;Explore allowing concurrent access to broadcast members by making &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_plumtree_broadcast&lt;/code&gt; server use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;Implement &lt;a href=&quot;https://asc.di.fct.unl.pt/~jleitao/pdf/srds10-mario.pdf&quot;&gt;Thicket&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;General improvements
    &lt;ul&gt;
      &lt;li&gt;OTP 25 support&lt;/li&gt;
      &lt;li&gt;QUIC Transport&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;changelog&quot;&gt;Changelog&lt;/h2&gt;

&lt;p&gt;Since Alejandro is hard at work on getting 5.0.0 release of Partisan prepared, I include the incredibly long list of changes, bug fixes, and improvements planned for the next week that him and his team completed over the last two years.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks Alejandro and LeapSight!&lt;/em&gt;&lt;/p&gt;

&lt;h1 id=&quot;changelog-1&quot;&gt;CHANGELOG&lt;/h1&gt;

&lt;h1 id=&quot;v500-beta&quot;&gt;v5.0.0 (beta)&lt;/h1&gt;

&lt;h2 id=&quot;api&quot;&gt;API&lt;/h2&gt;
&lt;p&gt;In general, the API was redesigned to concentrate all functions around two modules: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service&lt;/code&gt;.&lt;/p&gt;

&lt;h4 id=&quot;changes&quot;&gt;Changes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan&lt;/code&gt; module was repurposed as a replacement for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;erlang&lt;/code&gt; module for use cases related to distribution e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;erlang:nodes/0&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:nodes/0&lt;/code&gt;.
    &lt;ul&gt;
      &lt;li&gt;Several functions previously found in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_monitor&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_util&lt;/code&gt; are now in this module:
        &lt;ul&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:broadcast/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:cast_message/3&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:cast_message/4&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:cast_message/5&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:default_channel/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:demonitor/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:demonitor/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:forward_message/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:forward_message/3&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:forward_message/4&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:forward_message/5&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_connected/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_connected/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_fully_connected/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_local/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_pid/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_process_alive/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:is_reference/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:make_ref/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/3&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_node/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_nodes/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_nodes/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:nodestring/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node_spec/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node_spec/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node_spec/2&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:nodes/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:nodes/1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:self/0&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:send_message/2&lt;/code&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Added the following functions:
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:broadcast_members/0&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:broadcast_members/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:cancel_exchanges/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:exchanges/0&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:exchanges/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:get_local_state/0&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:inject_partition/2&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:leave/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:member/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:members_for_orchestration/0&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:on_down/2&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:on_up/2&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:partitions/0&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:reserve/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:resolve_partition/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:update_members/1&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:mynode/0&lt;/code&gt; has been replaced by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node/0&lt;/code&gt; to follow Erlang convention&lt;/li&gt;
  &lt;li&gt;Use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service:myself/0&lt;/code&gt; has been replaced by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node_spec/0&lt;/code&gt; to disambiguate from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:node/0&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Node&lt;/code&gt; variable name for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;node()&lt;/code&gt; type (as opposed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Name&lt;/code&gt;) and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NodeSpec&lt;/code&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;node_spec()&lt;/code&gt; (as opposed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Node&lt;/code&gt;) to disambiguate.&lt;/li&gt;
  &lt;li&gt;Adde new module &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_rpc&lt;/code&gt; that will provide and API that mirrors Erlangs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rpc&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;erpc&lt;/code&gt; modules&lt;/li&gt;
  &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_remote_ref&lt;/code&gt; to encapsulate the creation of reference and added an optional/alternative representation for encoded pids, references and registered names. The module offers all the functions to convert pids, references and names to/from Partisan encoded references.
    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Alternative representation: In cases where lots of references are stored in process state, ets and specially where those are uses as keys, a binary format is preferable to the tuple format in order to save memory usage and avoid copying the term every time a message is send between processes. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_remote_ref&lt;/code&gt; represents an encoded reference as binary URI. This is controlled by the config option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remote_ref_as_uri&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remote_ref_binary_padding&lt;/code&gt; in case the resulting URIs are smaller than 65 bytes.&lt;/p&gt;

        &lt;div class=&quot;language-erlang highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;partisan_remote_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_term&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()).&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partisan_remote_reference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nonode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nohost&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partisan_process_reference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&amp;lt;0.1062.0&amp;gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}}&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;partisan_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remote_ref_as_uri&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ok&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;partisan_remote_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_term&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()).&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;partisan:pid:nonode@nohost:0.1062.0&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;partisan_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remote_ref_binary_padding&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;ok&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;partisan_remote_ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_term&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()).&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;partisan:pid:nonode@nohost:0.1062.0:&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;        &lt;/div&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;peer-membership&quot;&gt;Peer Membership&lt;/h2&gt;

&lt;h4 id=&quot;fixes&quot;&gt;Fixes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;Replaced the use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;state_orset&lt;/code&gt; CRDT with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;state_awmap&lt;/code&gt; to avoid an issue where a node will crash and restart with a different IP address e.g. when deploying in K8s. As the membership set contains &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;node_spec()&lt;/code&gt; objects which contain the IP address we ended up with duplicate entries for the node.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;state_awmap&lt;/code&gt; tries to solve that by mapping a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;node() =&amp;gt; state_mvregister(node_spec())&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Fixes several bugs related to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leave&lt;/code&gt; operation in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_pluggable_peer_service_manager&lt;/code&gt;:
    &lt;ul&gt;
      &lt;li&gt;Added a missing call to update the membership set during leave&lt;/li&gt;
      &lt;li&gt;Fixed a concurrency issue whereby on self leave the peer service server will restart before being able to sending the new state with the cluster peers and thus the node would remain as a member in all other nodes.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Resolves an issue &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_plumtree_broadcast&lt;/code&gt; where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;all_members&lt;/code&gt; set was not updated when a member is removed.&lt;/li&gt;
  &lt;li&gt;Resolves the issue where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_plumtree_broadcast&lt;/code&gt; was not removing the local node from the broadcast member set.&lt;/li&gt;
  &lt;li&gt;Gen Behaviours take new option &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;channel&lt;/code&gt; if defined.&lt;/li&gt;
  &lt;li&gt;Fixed implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_up&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_down&lt;/code&gt; callback functions in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_pluggable_peer_service_manager&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;changes-1&quot;&gt;Changes&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Added function &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service_manager:member/1&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Replaced the use of in-process sets in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;plumtree_broadcast_backend&lt;/code&gt; with an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets&lt;/code&gt; table for outstanding messages keeping the gen_server stack lean and avoiding garbage collection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;peer-connection-management&quot;&gt;Peer Connection management&lt;/h2&gt;

&lt;h4 id=&quot;fixes-1&quot;&gt;Fixes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;Fixes a bug where connections where not properly kill during a leave&lt;/li&gt;
  &lt;li&gt;Split TLS options for client and server roles
    &lt;ul&gt;
      &lt;li&gt;Removed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tls_options&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tls_client_options&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tls_server_options&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;changes-2&quot;&gt;Changes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;New module &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;peer_service_connections&lt;/code&gt; replaces the former process state data structure and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_connection_cache&lt;/code&gt; module. It offers an ets-based solution with use of counters for quick node connection status check&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_connections&lt;/code&gt; has been re-implemented to use ets to increase perfomance and remove the need for an additional caching feature.
    &lt;ul&gt;
      &lt;li&gt;As a result, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_connection_cache&lt;/code&gt; module has been was removed.&lt;/li&gt;
      &lt;li&gt;Checking connection status is now a fast &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets&lt;/code&gt; lookup operation and leverages &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets:update_counter/4&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets:lookup_element/3&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ets:select_count/2&lt;/code&gt; to handle concurreny and minimise copying data into the caller’s process heap.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;process-and-peer-monitoring&quot;&gt;Process and Peer Monitoring&lt;/h2&gt;

&lt;h4 id=&quot;fixes-2&quot;&gt;Fixes&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;A more complete/safe implementation of process monitoring in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_monitor&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;More robust implementation of monitors using the new subscription capabilities provided by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;peer_service&lt;/code&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_up&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;on_down&lt;/code&gt; callback functions.
    &lt;ul&gt;
      &lt;li&gt;monitor a node or all nodes&lt;/li&gt;
      &lt;li&gt;use node monitors to signal a process monitor when the remote node is disconnected&lt;/li&gt;
      &lt;li&gt;avoid leaking monitors&lt;/li&gt;
      &lt;li&gt;new supervisor to ensure that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_monitor&lt;/code&gt; is restarted every time the configured &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_peer_service_manager&lt;/code&gt; is restarted.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;changes-3&quot;&gt;Changes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;New api in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan&lt;/code&gt; module following the same name, signature and semantics of their &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;erlang&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;net_kernel&lt;/code&gt; modules counterparts:
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/2&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor/3&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_node/2&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_nodes/1&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan:monitor_nodes/2&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;otp-compatibility&quot;&gt;OTP compatibility&lt;/h2&gt;

&lt;h4 id=&quot;fixes-3&quot;&gt;Fixes&lt;/h4&gt;

&lt;h4 id=&quot;changes-4&quot;&gt;Changes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;Partisan now requires &lt;strong&gt;OTP24 or later&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;Upgraded &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_gen&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_gen_server&lt;/code&gt; to match their OTP24 counterparts implementation&lt;/li&gt;
  &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_gen_statem&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_gen_fsm&lt;/code&gt; deprecated as it was not complete and focus was given to the implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_gen_statem&lt;/code&gt; instead&lt;/li&gt;
  &lt;li&gt;Module &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_mochiglobal&lt;/code&gt; has been removed and replaced by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;persistent_term&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;misc&quot;&gt;Misc&lt;/h2&gt;

&lt;h4 id=&quot;fixes-4&quot;&gt;Fixes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;Most existing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INFO&lt;/code&gt; level logs have been reclassified as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEBUG&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Fixed types specifications in various modules&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;changes-5&quot;&gt;Changes&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lager&lt;/code&gt; dependency has been removed and all logging is done using the new Erlang &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;logger&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Most uses of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;orddict&lt;/code&gt; module have been replaced by maps for extra performance and better usability&lt;/li&gt;
  &lt;li&gt;Most API options using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;proplists&lt;/code&gt; module have been replaced by maps for extra performance and better usability&lt;/li&gt;
  &lt;li&gt;In several functions the computation of options (merging user provided with defaults, validation, etc.) has been posponed until (and only if) it is needed for extra performance e.g. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_pluggable_peer_servie_manager:forward_message&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;More utils in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;partisan_util&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ex_doc&lt;/code&gt; (Elixir documentation) rebar plugin&lt;/li&gt;
  &lt;li&gt;Upgraded the following dependencies:
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uuid&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;types&lt;/code&gt;&lt;/li&gt;
      &lt;li&gt;rebar plugins&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;
</description>
				<pubDate>Mon, 13 Jun 2022 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/erlang/partisan/2022/06/13/handing-off-partisan.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/erlang/partisan/2022/06/13/handing-off-partisan.html</guid>
			</item>
		
			<item>
				<title>Extending Filibuster to Test Redis</title>
				<description>&lt;p&gt;&lt;em&gt;This blog post was written by &lt;a href=&quot;https://www.linkedin.com/in/eunice-chen-a31220178/&quot;&gt;Eunice Chen&lt;/a&gt;, an undergraduate student who contributed to the Filibuster research project this Spring.  Eunice did an excellent job extending Filibuster with a prototype of Redis support.  She recently gradudated and we will miss her!  Congrats, Eunice!&lt;/em&gt;&lt;/p&gt;

&lt;h1 id=&quot;summary&quot;&gt;Summary&lt;/h1&gt;

&lt;p&gt;The automated resilience testing tool, Filibuster, lets developers test microservice applications for resilience against remote service unavailability. 
Filibuster achieves this by instrumenting remote procedure calls (RPC), commonly made using libraries like gRPC and Python’s requests, to perform a systematic exploration of all possible RPC faults for a given microservice application. 
However, the current version of Filibuster does not support fault injection when client libraries for databases are used to issue the remote procedure calls. 
In this article, I discuss work that I performed as part of a student research project in the Composable Systems Lab, part of the Institute for Software Research at CMU, to explore the viability of using Filibuster to test external services that use 3rd party client libraries such as Redis.&lt;/p&gt;

&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;/h1&gt;

&lt;p&gt;As microservice applications become increasingly commonplace, it is critical to test the behavior of these applications under &lt;em&gt;partial failure&lt;/em&gt;: when one or more of the services that the application depends on fails. 
If the application lacks the necessary error handling, it may completely fail rather than gracefully degrade when one or more of its dependent services fails. 
To address this issue, Filibuster systematically enumerates the possible RPC failures that an application might observe and then subjects the application to these failures, allowing developers to ensure correct operation under these failure scenarios during testing and prior to deployment.&lt;/p&gt;

&lt;p&gt;Since microservice architectures are often used for web application backends, it is critical that Filibuster not only test services that communicate with other services, but also test services that communicate with databases, as databases play an important role in data persistence in backend services. 
Expanding Filibuster to support this style of fault injection on databases, and in the specific case Redis, will allow us to create a prototype and inform future research and engineering efforts towards this goal.&lt;/p&gt;

&lt;p&gt;We started this work with two interesting questions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Could Filibuster be expanded to support database calls?&lt;/li&gt;
  &lt;li&gt;How would this new support for database calls change the way the Filibuster tool interacts with microservice applications?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s work with an example.&lt;/p&gt;

&lt;h1 id=&quot;example-cinema-microservice-application&quot;&gt;Example: Cinema Microservice Application&lt;/h1&gt;

&lt;p&gt;To explore what a design for database fault injection might look like, we decided to start by expanding one of the examples in the Filibuster application corpus with Redis, an in-memory data store often used as a cache. 
The Filibuster application corpus contains several example microservice applications, some of which are reproduced from industry to demonstrate different request patterns. 
Request patterns that are exemplified by the corpus range from retries on failure, to fallback requests, to using default responses on failure, as well as the different types of ways requests can be structured and nested.&lt;/p&gt;

&lt;p&gt;The example we chose to expand was one of the Filibuster corpus’ cinema examples. 
It consists of four services: users, bookings, showtimes, and movies. 
These services allow the user to retrieve user information, book movies, retrieve showtime information, and perform other actions one might expect from a cinema site.&lt;/p&gt;

&lt;p&gt;While this application exposes several different REST APIs, one important API allows users to retrieve their movie bookings. 
Starting with a request from the client to the users service, the application makes several GET requests to retrieve their bookings, as shown by the diagram below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/eunice-cinema-example.png&quot; width=&quot;600&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;retrieve-bookings&quot;&gt;Retrieve Bookings&lt;/h2&gt;

&lt;p&gt;First, the users service makes a call to Redis to check if the username exists in the database. 
If the user does not exist, then the users service will return a 404 Not Found. 
If the Redis call returns a successful response, then the users service will contact the bookings service and request the data associated with the given username.&lt;/p&gt;

&lt;p&gt;Once the bookings service receives this GET request, it takes the following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Check that the username exists in the bookings database via Redis call.&lt;/li&gt;
  &lt;li&gt;If the username doesn’t exist, then the bookings service will return a 404.&lt;br /&gt;
a. When the users service receives this response, it returns a 404 back to the user.&lt;/li&gt;
  &lt;li&gt;If this call is successful, retrieve the dates on which the user has booked a movie from Redis.&lt;/li&gt;
  &lt;li&gt;For each booking date for that particular user, retrieve the movie identifiers. The movie identifiers are unique to each movie.&lt;/li&gt;
  &lt;li&gt;Return a JSON object containing the movie identifiers associated with each booking date for the user.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After the users service has received a response from the bookings service, it contacts the movies service and retrieves the movie data associated with each movie identifier. Then, the movies service retrieves the movie data from Redis and returns the data to the users service, which then returns the data to the user.&lt;/p&gt;

&lt;h2 id=&quot;testing-the-bookings-api&quot;&gt;Testing The Bookings API&lt;/h2&gt;

&lt;p&gt;To test this endpoint, we make a request to the users service that will retrieve a particular user’s bookings, which corresponds to the endpoint &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;users/{username}/bookings&lt;/code&gt;. 
In our test, we look for the bookings associated with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chris_rivers&lt;/code&gt;, so we make a call to the endpoint &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;users/chris_rivers/bookings&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We start with the basic functional test behavior that would exist prior to our adaptation for HTTP RPC fault injection.&lt;br /&gt;
Our test asserts that if the response code from this call is 200 OK, which indicates success, the data returned matches the data associated with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chris_rivers&lt;/code&gt; user.&lt;/p&gt;

&lt;p&gt;To handle the cases where Filibuster has injected faults for the HTTP RPCs, we add conditional code to account for behavior under failure.&lt;br /&gt;
In the case where an RPC fails, our test asks Filibuster to verify whether it injected a fault, &lt;em&gt;i.e.,&lt;/em&gt; that the fault was intentional — this is done using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;was_fault_injected()&lt;/code&gt; conditional. Our test asserts that if the RPC failed, Filibuster did actually inject a fault and the HTTP status matches the expected status when the call fails. 
In this case, since the only errors we expect are a 404 Not Found or a 503 Service Unavailable, we assert that the status code must be 404 or 503.&lt;/p&gt;

&lt;p&gt;We excerpt the test below:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;bookings_endpoint&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;{}/users/{}/bookings&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_service_url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;users&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;chris_rivers&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;users_bookings&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bookings_endpoint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timeout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_timeout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;bookings&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users_bookings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status_code&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users_bookings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;20151201&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;rating&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;8.8&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;title&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;Creed&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;uri&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&apos;/movies/267eedb8-0f5d-42d5-8f43-72426b9fb3e6&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}]}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;was_fault_injected&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users_bookings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status_code&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;404&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;503&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To run Filibuster with our functional test, we can run the following command which will inject HTTP exceptions at each RPC call site.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;filibuster &lt;span class=&quot;nt&quot;&gt;--functional-test&lt;/span&gt; ./functional/test_user_bookings.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s test our application’s resilience to Redis failures.&lt;/p&gt;

&lt;h2 id=&quot;testing-calls-to-redis&quot;&gt;Testing Calls to Redis&lt;/h2&gt;

&lt;p&gt;For Filibuster to inject faults at each Redis call site, we need to expand the list of faults that Filibuster uses for fault injection.&lt;br /&gt;
To do that, we modify Filibuster to inject connection errors, timeout errors, and response errors for calls that are made using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_command&lt;/code&gt; primitive.&lt;/p&gt;

&lt;p&gt;Below is the additional Filibuster configuration containing these new faults:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;s&quot;&gt;&quot;python.redis&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
   &lt;span class=&quot;s&quot;&gt;&quot;pattern&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;redis&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;.execute&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;_command&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;s&quot;&gt;&quot;exceptions&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
       &lt;span class=&quot;s&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;redis.exceptions.ConnectionError&quot;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
       &lt;span class=&quot;s&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;redis.exceptions.TimeoutError&quot;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
       &lt;span class=&quot;s&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;redis.exceptions.ResponseError&quot;&lt;/span&gt;
     &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
   &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once we modify the list of faults that Filibuster uses for fault injection to include Redis, we can rerun our Filibuster command from before. 
Filibuster will systematically inject these Redis errors (along with the HTTP RPC faults) one by one, then in combinations, while repeatedly executing our functional test.&lt;/p&gt;

&lt;p&gt;For example, in one test, Filibuster might inject a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ConnectionError&lt;/code&gt; exception when the user retrieves the user metadata. 
In another test, it may try to inject a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TimeoutError&lt;/code&gt; exception when the bookings service retrieves booking dates and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ResponseError&lt;/code&gt; exception when the bookings services retrieves movie identifiers. 
Filibuster will test all possible combinations of these errors in the different services until the fault space is exhausted.&lt;/p&gt;

&lt;p&gt;Because the application does not handle any Redis exceptions, the application returns a 500 Internal Service Error. 
Instead of altering our functional test to allow for a 500 Internal Server Error, we want the service to return a 424 Failed Dependency if one of the dependencies, in this case Redis, is down.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;booking_ids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hgetall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;redis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exceptions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;RedisError&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FailedDependency&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Before, our test asserted that if the call failed, then the fault was intentional and the HTTP status matched the expected status. 
We change our test to expect a 424 response code in addition to a 404 error:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
       &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;was_fault_injected&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users_bookings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;status_code&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;404&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;503&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;424&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When the test execution fails, Filibuster generates a counterexample file that contains the specific fault(s) that caused the failure. 
Counterexamples allow the developer to rerun the specific generated (and failed) test and attach an interactive debugger. 
To use the counterexample to rerun a failed test, we supply it to the Filibuster command as follows:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;filibuster &lt;span class=&quot;nt&quot;&gt;--functional-test&lt;/span&gt; ./functional/test_user_bookings.py &lt;span class=&quot;nt&quot;&gt;--counterexample-file&lt;/span&gt; counterexample.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once run, we see that the test now passes the previously failing test.&lt;br /&gt;
Therefore, we know that we have fixed this particular bug.&lt;/p&gt;

&lt;p&gt;Finally, we can run Filibuster again and test for the whole default set of failures as well:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;filibuster &lt;span class=&quot;nt&quot;&gt;--functional-test&lt;/span&gt; ./functional/test_user_bookings.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From our output, reproduced below, we can now see that everything passes.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: Number of tests attempted: 52
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: Number of &lt;span class=&quot;nb&quot;&gt;test &lt;/span&gt;executions ran: 52
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: Test executions pruned with only dynamic pruning: 21
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: Total tests: 73
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: 
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;FILIBUSTER] &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;INFO]: Time elapsed: 9.103456020355225 seconds.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Because Filibuster exhausts the failure space and everything now passes, we know that the application behaves exactly as specified by the functional test. 
Specifically, we know that the endpoint returns the correct data (in the case of a 200 OK response), or that its failure behavior matches the expected behavior (i.e. a 404, 503, or 424 response). 
In either case, because we have correctly handled any exceptions that may arise, we can verify that the application does not fail in unexpected ways when one of the services it depends on fails.&lt;/p&gt;

&lt;p&gt;For this example, Filibuster ran 73 tests: 72 generated tests from the original test plus the original test itself. 
Without Filibuster, developers would have to write 72 tests manually in order to exhaust the fault space. 
However, Filibuster does this automatically. In doing so, it is able to prune 19 redundant tests (for more information on how this is done, see &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2021/10/14/filibuster-4.html&quot;&gt;this post&lt;/a&gt;). 
It completes this entire process in only 9.1 seconds, allowing developers to test their code quickly.&lt;/p&gt;

&lt;h1 id=&quot;conclusions-and-future-work&quot;&gt;Conclusions and Future Work&lt;/h1&gt;

&lt;p&gt;When we started this work, we had two questions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Can Filibuster be expanded to support database calls?&lt;/li&gt;
  &lt;li&gt;How would this new support for database calls change the way the Filibuster tool interacts with microservice applications?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After building our Redis prototype, we can now begin to answer them.&lt;/p&gt;

&lt;p&gt;As demonstrated by the example above, we are, for the most part, able to extend Filibuster to accommodate database calls. 
While our first prototype of this in Filibuster uses Python’s Redis library, our early results indicate that extending Filibuster to more languages and database implementations is a promising direction for future work. 
Thus, the answer to our first question is &lt;em&gt;yes, we can expand Filibuster to support database calls.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While this prototype demonstrated that many of the design decisions made in Filibuster itself were correct, we also discovered several ways where we believe we need to modify Filibuster to support further, in-depth testing of database clients.  One specific way is simulating different types of failures – not spurious faults – in Redis, based on the Redis method that is being called. We demonstrate this using an example below.&lt;/p&gt;

&lt;p&gt;With respect to HTTP RPCs, whenever we want to inject a failure that doesn’t cause an exception, we can simply return an HTTP response back to the caller containing the associated status code to indicate failure. This is similar to GRPC as well: a single GRPC return type is provided and a status field set indicating the type of failure. However, with Redis, different return types are used depending on the Redis API used.&lt;/p&gt;

&lt;p&gt;For example, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hget&lt;/code&gt; method, which gets the value of a hash field, may return a different type of value than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sismember&lt;/code&gt; method, which determines whether a given value is a member of a set. 
The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hget&lt;/code&gt; method can return strings, integers, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;None&lt;/code&gt;, whereas the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sismember&lt;/code&gt; method returns only 0 or 1 depending on whether the value is or is not a member of a set. 
If the user calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;redis.hget(id)&lt;/code&gt;, and the id does not exist in the database, then Redis will simply return &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;None&lt;/code&gt;. 
However, if the user had instead called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;redis.sismember(set_name, id)&lt;/code&gt; for an id that didn’t exist, then the response would be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0&lt;/code&gt; (&lt;em&gt;i.e.,&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;False&lt;/code&gt;). 
Since what constitutes a failed response varies based on the Redis method called, we are exploring different ways Filibuster can simulate those failures. 
In doing so, we can expand the way the Filibuster tool interacts with microservice applications.&lt;/p&gt;

&lt;p&gt;Through this work, we have extended Filibuster to support Python’s Redis library. 
In doing so, we have learned that it is feasible for Filibuster to be successfully extended to more languages and database implementations. 
At the same time, we raised interesting new questions and have found new directions for the Filibuster tool to grow in.&lt;/p&gt;
</description>
				<pubDate>Thu, 02 Jun 2022 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2022/06/02/extending-filibuster-to-redis.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2022/06/02/extending-filibuster-to-redis.html</guid>
			</item>
		
			<item>
				<title>Understanding why Resilience Faults in Microservice Applications Occur</title>
				<description>&lt;p&gt;&lt;em&gt;In this blog post, I present results from May 2021, that were used during the development of 
&lt;a href=&quot;http://filibuster.cloud&quot;&gt;Filibuster&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Update 2022-03-20: Based on some feedback, I realized the section on core observations was not perfectly clear.  I revised that section and corrected a number of spelling and grammatical issues.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Before my automated resilience testing tool,
&lt;strong&gt;&lt;a href=&quot;http://filibuster.cloud&quot;&gt;Filibuster&lt;/a&gt;&lt;/strong&gt;, was targeted at the resilience
testing of microservice applications, it was targeted at the testing of
distributed protocols while they were in the development phase.  I referred to this process as 
&lt;em&gt;Resilience Driven Development&lt;/em&gt;, whereby, faults would be introduced during development of the 
protocol and force developers to adapt the protocol accordingly in real-time.&lt;/p&gt;

&lt;p&gt;The first version of Filibuster grew out of my previous Ph.D. work, &lt;a href=&quot;https://github.com/lasp-lang/partisan&quot;&gt;Partisan&lt;/a&gt;.  Partisan was an alternative distributed runtime for Erlang designed specifically for high-performance, large-scale distributed Erlang clusters.  After I developed, benchmarked, and released Partisan as open-source, I made the observation that once you control the distributed runtime and all message transmission, you could inject faults before, during, and after processing of messages through message omission, message duplication, and message transformation.  To facilitate this, I provided general &lt;em&gt;hooks&lt;/em&gt; within Partisan, where existing 
testing tools could hook into before, during, and after message processing to perform arbitrary 
transformations.  Along with this, I made sure that Partisan fixed the order of message transmission
to facilitate deterministic replay when developers identified counterexamples.  For demonstration purposes, I wired this up to Erlang QuickCheck/PropEr to test eventual consistency in a small application.  I talked about this at &lt;a href=&quot;https://www.youtube.com/watch?v=KrwhOkiifQ8&quot;&gt;Code BEAM SF 2019.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the time, this observation seemed novel.  It wasn’t; in fact, the authors of ORCHESTRA made a similar observation in &lt;a href=&quot;https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.6485&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;1996&lt;/a&gt;.  Furthering my embarassment, the second author on the ORCHESTRA paper happens to be the current president of Carnegie Mellon University, where I am currently a Ph.D. student.&lt;/p&gt;

&lt;p&gt;While my technique identified known faults in 2PC, CTP, and 3PC faster than the state-of-the-art, it required that message handlers be annotated with cumbersome annotations.
While these annotations could be automatically generated using static analysis when the programs were written with academic actor langauges that used high-level message handlers, these annotations had to be manually written with commonplace actor languages, like Akka and Erlang.
Since Partisan was implemented in Erlang, this posed some problems.
First, it made it almost impossible to compete with existing fault injection approaches used by distributed model checkers (e.g., CMC, SAMC, FlyMC, MoDist) that provided some form automatic instrumentation and had general state space reduction techniques (e.g., DPOR, symmetry reduction.)
Furthermore, the lack of an actual application corpus posed real problems when trying to perform an academic evaluation: there just simply are not enough distributed programs written in Erlang.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It was time to take a step back: can we choose a different domain where analysis might be simpler?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I decided to refocus Filibuster on microservice applications.  Based on a potential collaboration with a Pittsburgh-area company that used Python for building their microservice applications, my new implementation was focused on microservices, implemented using Flask, in Python.
Unfortunately, the collaboration did not pan out, so I resorted to an evaluation based on student applications.  This posed a number of problems as well, with the major problem being that these student applications did not represent the types of applications that are being written in industry.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It was time to take a larger step back.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first approach that I tried was to identify existing open-source microservice applications on GitHub, use GitHub’s revision system and issue tracker to identify any previously resolved resilience bugs, reintroduce those bugs, and then try to identify them using the new version of Filibuster.  However, these applications simply do not exist in the open-source ecosystem.  The ones that do exist are mostly tutorial applications that demonstrate how to properly write microservice applications; as one might imagine, these applications typically, and should not, contain bugs.&lt;/p&gt;

&lt;p&gt;Next, I decided to take the tutorial applications and introduce resilience bugs.  This is not a novel approach, in fact, it has been taken by several papers &lt;a href=&quot;https://dl.acm.org/doi/abs/10.1145/3472883.3486986&quot;&gt;as recent as SoCC ‘21&lt;/a&gt;, coincidentally, where our first paper on Filibuster was published.  One problem with this approach is finding the bugs: most open source discussions of bugs in postmortems or other incident reports rarely discuss the root causes of outages.  This made the identification and retrofitting of existing applications with realistic bugs difficult, if impossible.&lt;/p&gt;

&lt;p&gt;I also tried to use academic bug corpora.  Bug corpora used for automatic program repair, for example,
are typically constructed by harvesting programs from GitHub. These suffer from the same problem that we encountered: open-source microservice applications simply do not exist.  Another possible avenue, a recent academic paper whose primary contribution is a corpora of microservice application bugs contains &lt;em&gt;no bugs related to the microservice architecture itself&lt;/em&gt;, but contains only bugs that would exist in any monolithic web application.&lt;/p&gt;

&lt;p&gt;I’ve written at length about these problems in a &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2021/10/02/filibuster-1.html&quot;&gt;previous blog post on the lack of a bug corpus&lt;/a&gt;.  These problems make research, into an area that I personally believe is incredibly important, quite difficult.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In this post, I am going to discuss a small research study that I performed between August 2020 and December 2020, in order to identify the types of resilience issues that companies experience and the methods that they use to go about identifying them.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;/h2&gt;

&lt;p&gt;If you have read my &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2021/10/02/filibuster-1.html&quot;&gt;previous blog post on corpus construction&lt;/a&gt;, you will know that in order to solve this problem I started by scouring the internet for everything I could find on chaos engineering and resilience engineering.  This served as a starting point for my work – I figured if I want to find discussion of resilience bugs and methods used to identify them, I should start by looking for discussion of the most commonly used methods to identify them today.  I identified a combined 77 resources – the majority consisting of presentations from industry conferences (e.g., ChaosConf, AWS re:Invent) with a small number of blog posts.  After reducing this list to 50 presentations, by eliminating duplicate presentations given at multiple conferences, presentations that contained only an overview of chaos engineering techniques, and presentations that were merely marketing for chaos engineering products, I settled on a list of 50 presentations that we used in our paper on Filibuster.  This list of 50 presentations accounted for 32 different companies of all sizes in all sectors, including, but not limited to large tech firms (e.g., Microsoft, Amazon, Google), big box retailers (e.g., Walmart, Target), finanial institutions (e.g., JPMC, HSBC), and media and telecommunications companies (e.g., Conde Nast, media dpg, and Netflix.) 
If you are interested in the full list with direct links to each presentation, it is available from &lt;a href=&quot;https://christophermeiklejohn.com/publications/filibuster-socc-2021.pdf&quot;&gt;our paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For our published paper on Filibuster, the list was reduced even further.  Here, the focus was only on presentations that met one of the following criteria:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Was a microservice application that contained a resilience bug discussed in enough detail that we would recreate the application in a lab setting and identify the bug?&lt;/li&gt;
  &lt;li&gt;Was a microservice application where a chaos experiment was run in order to identify bugs discussed in enough detail that we could recreate the application in a lab setting and identify the bug through functional testing with fault injection?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This left only four: Audible, Expedia, Mailchimp, and Netflix.  We recreated these examples, identified bugs with Filibuster, and &lt;a href=&quot;https://github.com/filibuster-testing/filibuster-corpus&quot;&gt;released a public research corpus&lt;/a&gt; for researchers that had the same problem as us: lack of a corpus.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In this blog post, I am going to discuss results that were not included in our paper on Filibuster and that I do not intend to publish.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So, why talk about it then?  While there is not a lot of evidence to make strong claims – for all of the reasons that I outlined above – I do believe that sharing this preliminary information is useful in starting a conversation about resilience and how we go about testing for it in microservice applications.  I think that this information is useful for framing how we think about resilience and I hope for two possible outcomes:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;That an open discussion around these issues will cause developers to think differently about how they test for application resilience instead of just resorting to chaos engineering, which might not be &lt;a href=&quot;https://christophermeiklejohn.com/filibuster/2022/03/17/what-is-chaos-engineering.html&quot;&gt;the most appropriate technique, as discussed in a previous blog post.&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;That this will inspire sharing of information with researchers so that academics can work on relevant problems around industrial microservice application resilience.  Without knowing details of bugs, application structure, and the like, academic research on microservice resilience will be non-existent or significantly behind the times.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without further ado, let’s get into the details.&lt;/p&gt;

&lt;h2 id=&quot;methods&quot;&gt;Methods&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;My original analysis was extremely ad-hoc.  At the time of this analysis, I had not taken a qualitative nor quantative methods course.  I am currently taking one right now and recommend all early-stage researchers to not put it off for more exciting courses, and front-load it to become better researchers, as I wish I had.  In fact, if you are at Carnegie Mellon University, I highly recommend Laura Dabbish’s class on Advanced Research Methods in HCII.  Therefore, I will frame things using the terminology of grounded theory and case study analysis; however, I did not know what these were at the time and the process was done based on a “what feels right at the time” style.&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;coding&quot;&gt;Coding&lt;/h3&gt;

&lt;p&gt;For this study, I watched all 77 presentations.  For each presentation, I took notes related to application structure, testing procedures, resilience bugs discussed, the processes each organization used to identify reislience bugs, discussion of bug resolution techniques, and directions for future work.  I noted several quotations from developers during this observation; this information was recorded in a shared Google document.&lt;/p&gt;

&lt;p&gt;From here, I disregarded any discussions of organizational structure, methods or processes and technologies specific to that organization.  From there, I identified what I thought were the core concepts or categories.  I present them here as motivating questions that were used when analyzing my notes.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Experiments.&lt;/strong&gt;
Did the presenter talk about an actual chaos experiment they ran to identify a bug?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resilience Bugs.&lt;/strong&gt;
Did the presenter talk about a resilience bug they observed or discovered (not bugs related to normal logic errors that could occur in a non-distributed application)?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resilience Patterns.&lt;/strong&gt;
Did the presenter talk about a pattern (&lt;em&gt;e.g.,&lt;/em&gt; circuit breaker, etc.) that they used to improve resilience of their application?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Application Structure.&lt;/strong&gt;
Did the presenter talk about an architectural pattern in a microservice application?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From here, I further refined the sub-category of resilience bugs, based on the first analysis:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Bug in application code.&lt;/li&gt;
  &lt;li&gt;Bug in cloud service misconfiguration.&lt;/li&gt;
  &lt;li&gt;Bug could have been identified using a mock, if written.&lt;/li&gt;
  &lt;li&gt;Bug occurred in infrastructure, but was triggered by bug in the application.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I also created another sub-category of resilience bugs, based on the first analysis:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Bugs that could have been identified using traditional testing techniques involving mocks.&lt;/li&gt;
  &lt;li&gt;Bugs that could &lt;em&gt;not&lt;/em&gt; have been identified using traditional testing techniques involving mocks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then, I created a sub-category of experiments, based on the first analysis.  Reader should keep in mind that items from each presentation can belong to one or more categories in a multiple-inheritance style of analysis:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Experiments that revealed bugs that could have been identified using traditional techniques involving mocks.&lt;/li&gt;
  &lt;li&gt;Experiments that revealed bugs that could &lt;em&gt;not&lt;/em&gt; have been identified using traditional techniques involving mocks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;From here, I tried to understand the relationship between the coded concepts and develop a theory about what testing methods need to be used where in order to identify different classes of resilience bugs.&lt;/p&gt;

&lt;p&gt;For the reader who is well versed in research methods, this analysis is similar to using the constant comparative method from grounded theory when performing open, axial, and selective coding.&lt;/p&gt;

&lt;h2 id=&quot;the-preliminary-theory&quot;&gt;The Preliminary Theory&lt;/h2&gt;

&lt;p&gt;In general, many bugs that occur in software and are are identified using chaos engineering — a technique that was popularized by Netflix to test their service where faults are introduced in the live, production environment on real customers and the application’s behavior under fault is observed — could have been identified earlier using more traditional unit, integration, and functional tests in a local, development environment.  However, this would require more effort on the part of the individual developer, as they are required to: identify the faults that can occur at that location, write mocks to simulate those faults, and then write the required tests with assertions.  This insight served as the basis of my &lt;a href=&quot;https://www.youtube.com/watch?v=pyYh-vNspAI&quot;&gt;Service-level Fault Injection Testing&lt;/a&gt; technique, which combines static analysis, test generation, and functional testing.&lt;/p&gt;

&lt;p&gt;As a specific example, Expedia tested a simple fallback pattern where, when one dependent service is unavailable and returns an error, another service is contacted instead afterwards.  There is no need to run this experiment in production by terminating servers in production: a simple test that mocks the response of the dependent service and returns a failure is sufficient.  However, more complicated, and more interesting, examples exist.&lt;/p&gt;

&lt;h3 id=&quot;example-audible&quot;&gt;Example: Audible&lt;/h3&gt;

&lt;p&gt;One example that was particularly interesting was Audible.  The Audible example is quite complicated and involves a number of services in delivering an audiobook to an end user.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/audible.png&quot; width=&quot;800&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this example, when a user requests an audiobook using the Audible app, it first issues a request to the Content Delivery Engine to find the URL of the Content Delivery Service that contains the audiobook assets; we can think of this as a primitive sharding layer that is used to support scalability.  Once the Audible app retrieves this URL, it issues a request to the specified Content Delivery Service.&lt;/p&gt;

&lt;p&gt;The Content Delivery Service first makes a request to the Audible Download Service.  The Audible Download Service is responsible for first verifying that the user owns the audiobook.  Next, the Audible Download Service verifies that a DRM license can be acquired for that audiobook.  Finally, it updates statistics at the Stats service before returning a response to the Content Delivery Service.  If either ownership of the book cannot be verified or a DRM license cannot be activated, an error response is returned to the Audible Download Service, which propagates back to the user through an error in the client.&lt;/p&gt;

&lt;p&gt;Once the Audible Download Service completes it work, the Content Delivery Service contacts the Audio Assets service to retrieve the audiobook assets.  Then, the Content Delivery Service contacts the Asset Metadata Service to retrieve the chapter descriptions.  In this design, there is an implicit assumption that if an audiobook exists in the system and the assets are available, the metadata containing the chapter descriptions will also be available.&lt;/p&gt;

&lt;p&gt;In this example — a real outage reported by Audible, the asset metadata is unavailable either because of developer error or a race condition.  As a result of this, a latent bug in the Content Delivery Service that doesn’t expect the content to be unavailable if the assets are available, causes an error to be propagated back to the end user.  The mobile application, not expecting this error code to ever occurs then presents a generic error to the user after retrying the request a number of times, and, after presenting a generic error to the user, causes them to hit the retry button.  This influx of retries causes all of Audible to fail: the system incurs an outage.  In this case, a number of software bugs — all detectable through traditional unit, integration, and functional testing, causes the system to overload and exhaust the available compute capacity in the cloud.  Thus, a resulting outage.&lt;/p&gt;

&lt;h3 id=&quot;resilience-fault-taxonomy&quot;&gt;Resilience Fault Taxonomy&lt;/h3&gt;

&lt;p&gt;When we look at this example, and the other examples from our study, we can identify there are two major concerns for the developers of microservice applications:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;First, &lt;em&gt;anticipation and handling of application errors&lt;/em&gt;: ensuring the application contains code error handling for all possible errors.&lt;/li&gt;
  &lt;li&gt;Second, &lt;em&gt;ensuring proper behavior and scalability of resilience countermeasures:&lt;/em&gt; the methods that we take to contain unexpected, untested failures when they occur.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We diagram this relationship below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/types-of-taxonomy.png&quot; alt=&quot;Types in Taxonomy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We represent this as four quadrants, bifurcated by abstraction layer and whether or not resilience countermeasures are present.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Infrastructure: Missing Scalability Configuration&lt;/strong&gt;: Failures might occur because our system fails to handle expected load.  For example, we may be missing a security policy or auto-scaling rule that is required for operation inducing external faults to our application.  These external faults may trigger latent, internal faults in our application.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Infrastructure: Unscalable Configuration&lt;/strong&gt;: Failures might occur because our infrastructure configurations are wrong.  For example, we might fail to configure auto-scaling rules or concurrent execution limits for serverless functions; this may cause our application to experience an external fault (e.g., concurrent execution limit exceeded.) These external faults may trigger latent, internal faults in our application.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Application: Missing or Incorrect Error Handling Code&lt;/strong&gt;: Our application may be missing or contain incorrect error handling code for possible errors: error codes we do not expect from external services, error codes we fail to handle from our own services through malfunctioning endpoints, by way of software defects, or services that are unavailable.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Application: Unscalable Error Handling Code&lt;/strong&gt;: Our application may contain unscalable error handling code.  In this case, an error causes the application or service to take a code path designed for error handling, but that error handling path may introduce additional load on the system causing the system, or some number of the application’s services, to fail.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ok, so what is the theory?&lt;/p&gt;

&lt;h3 id=&quot;theory-of-resilience-testing-methods&quot;&gt;Theory of Resilience Testing Methods&lt;/h3&gt;

&lt;p&gt;Why does the taxonomy matter?  It matters, because each quadrant needs to be addressed by a separate testing methodology with separate tools, at a different point in the software engineering lifecycle.&lt;/p&gt;

&lt;p&gt;We can represent this as follows, by reimagining the same graph with different axes:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/taxonomy.png&quot; alt=&quot;Taxonomy&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In this example, we have modified the abscissa to represent where the analysis needs to be performed; for the ordinate, we have modified it to represent whether or not we can identify the problem in a local environment or in the cloud.&lt;/p&gt;

&lt;p&gt;When it comes to missing configurations, these can be detected locally in a development environment.  In fact, missing configurations for cloud services can easily be detected through static analysis.  On the other hand, missing or inorrect application error handling code can &lt;em&gt;sometimes&lt;/em&gt; be detected statically, but most definitely detected dynamically, &lt;em&gt;especially and most notably using fault-injection approaches like Filibuster.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;However, when it comes to the problems of cloud configuration things change.  Incorrect or unscalable cloud configurations, where auto-scaling rules or concurrent execution limits are incorrect, can only be detected in the cloud environment: we do not have this cloud runtime available locally (e.g., AWS, GCP) to detect these issues, even if we used a load generator.  Since there is no way to run these services locally using those configurations, the only way to identify these issues is to run those configurations in the actual cloud environment under either actual or synthetic load. In fact, we may need to inject infrastructure-level faults to do proper testing; for example, inducing instance failures to trigger an instance restart rule to fire.  We note that Azure has made some movement in this area by allowing them to run their serverless runtime environment locally.&lt;/p&gt;

&lt;p&gt;When it comes to incorrect, or unscalable cloud configurations, that only cause faults under a latent application fault, these failures must be tested in the cloud environment &lt;em&gt;when we can induce faults.&lt;/em&gt;  While chaos engineering can identify a number of these faults when they occur in the infrastructure – through failed instances or blackholed DNS – all of these failures surface themselves as errors or exceptions &lt;em&gt;in the application,&lt;/em&gt; which indicates that we should inject these failures in a cloud environment to identify them, in the application where they can be explicitly controlled and exercised.&lt;/p&gt;

&lt;h2 id=&quot;core-observation&quot;&gt;Core Observation&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Testing complexity increases as we move towards the upper-right quadrant.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As we move towards that direction, we need to run the applications in a cloud environment combined with principled fault injection.  For simple infrastruture-level faults, this can be as simple as crashing a service.  For application-level faults, more advanced techniques are required.  For instance, causing a particular service to fail with a particular GRPC response code that then causes the invokers of that service perform more expensive, alternate work.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Chaos engineering, a technique that works well for the upper-right quadarant, is commonly used for identifying issues across all quadrants.  It is sufficient, but not necessary.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To understand this, we use a small example.  Consider the case where Service A takes a dependency on Service B.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If A fails to handle the unavailability of B, or does not handle its failure properly, we should detect that locally.  That lives in the lower-right quadrant.&lt;/li&gt;
  &lt;li&gt;However, if upon experiencing a failure of Service B, Service A invokes a more expensive code path that, under load, crashes because of incorrect service provisioning or resource exhaustion, that’s the upper-right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chaos engineering is the hammer and all four quadrants look like nails.  It works because is operates at such a low, fundamental level.  While it is sufficient, it is not stricly necessary, for identifying these classes of resilience issues.  Instead of resorting to chaos engineering for identifying all resilience issues, a multi-lvel testing approach should be used in order to identify failure to handle errors first, ensure proper handling of errors, before testing these error handling and resilience countermeasures in the cloud environment.&lt;/p&gt;

&lt;h2 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h2&gt;

&lt;p&gt;What are the takeaways?&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Failures can occur because of both infrastructure and application-level problems.  Therefore, a multi-level testing approach is necessary to address both.&lt;/li&gt;
  &lt;li&gt;Application-level problems, such as the problems in the Audible example, can result in outages if left unhandled and then propagate to the infrastructure-level.  Therefore, try to eliminate as many problems as possible through local testing and then use testing in the cloud, under load with principled fault injection to determine if the system reacts to failures under load correctly.&lt;/li&gt;
  &lt;li&gt;Infrastructure-level misconfigurations should be detected statically, if possible, using configuration validation tools.  Verifying these configurations should be done using load testing tools, based on expected load, in the actual cloud environment.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you find this work interesting, &lt;a href=&quot;http://twitter.com/cmeik&quot;&gt;please reach out&lt;/a&gt;.  I’d love to hear more about application designs, failures, outages, and anything related to microservice resilience.&lt;/p&gt;
</description>
				<pubDate>Sat, 19 Mar 2022 00:00:00 +0000</pubDate>
				<link>https://christophermeiklejohn.com/filibuster/2022/03/19/understanding-faults.html</link>
				<guid isPermaLink="true">https://christophermeiklejohn.com/filibuster/2022/03/19/understanding-faults.html</guid>
			</item>
		
	</channel>
</rss>
