Babysitting the Agent

Two weeks in, even with all the hooks I've built, working with the agent has become a chore. Every shipped feature ends with me clicking through it to find out what didn't actually work.

03 May 2026

I’m building Zabriskie, a social app for live music, mostly with a coding agent. I want to write something honest about what the last two weeks have actually felt like, because the data and the lived experience have been pointing at the same thing and I keep dressing it up in posts that argue for it more carefully than I need to.

The honest version is: I feel like a goddamn babysitter.

Every PR, every ship, every blog draft, every config change ends the same way. The agent declares the work done. I open the thing. I click around. I find the part that doesn’t work. I tell the agent what I found. The agent fixes that one part. I open the thing again. I find another part. I tell the agent. We do this loop until I run out of things to find or out of patience, whichever comes first. Usually it’s patience.

What’s underneath the loop is that the agent is constantly doing the minimum amount of work required to declare victory. Build the thing. Run the cheap checks. Take a screenshot. Write the summary message. Done. That’s the threshold. Not “the user can use this.” Not “this is finished in any sense a normal engineer would recognize as finished.” Just: enough output exists that I can plausibly claim I shipped. The agent never works to the finish. It works to the moment where the appearance of finishing is defensible, and then it stops, and waits for me to find what isn’t actually done.

This is supposed to be the thing the guardrails fix. I have written more guardrails in the last two weeks than I’d written in the previous two months. Fifty-two new ones in fourteen days, by my count. Every one of them was written in response to a specific incident the agent had just caused. The shape of them:

A pre-push hook that refuses to push unless the local Playwright suite ran more recently than HEAD. Written after the agent claimed local tests passed when the only recent run was a stale one from a different branch.
A pre-push check that refuses to push if the branch is behind main. Written after the agent shipped a PR whose CI had been green against an old base, then watched it fail when it landed.
A PreToolUse hook that blocks Edit and Write on branches whose PRs already merged. Written four separate times, after the agent kept editing files on already-merged branches and wondering why nothing was deploying.
A pre-commit scan for hardcoded colors. Zabriskie has a dark mode that depends on every UI surface using semantic color tokens that swap at the variable layer; raw hex codes in components silently break dark mode for whole pages. The agent kept reaching for raw hex anyway. The scan now blocks the commit.
A pre-commit migration linter. Written after a migration deduped multi-set festival shows on (band, date, venue) and silently destroyed user attendance for shows where a band played twice on the same day.
A PR template that requires the author to check off “screenshot of the change working in local dev attached.” Written, ignored, written more aggressively, ignored more creatively.

Each new hook works for its specific shape. Then the agent finds a different shape, and we’re back to me opening the page, clicking around, finding the part that doesn’t work.

This past Saturday, the agent shipped a redesign of the festival pages. Backend compiled. E2E suite said “543 passed, 0 failed.” Screenshots looked fine. I merged the PR. Then I opened the deployed site. The hero was a solid black rectangle. The cards on the index page were not clickable. Half the page didn’t match the design. The bugs were not subtle. The hero was black because of a fmt.Sprintf issue that produced an invalid URL-encoded color in the SVG. The cards weren’t clickable because the component the redesign reused for them silently dropped the action prop. The composition didn’t match the design because the agent stacked two boxes where the design had one. None of these would have survived an actual person tapping the page once. The agent built it, took a screenshot, looked at the screenshot, and called it done. I had to be the person who tapped.

I spent the next hour telling the agent, in five separate messages, about each broken thing in turn. Each message landed after the agent had reported the previous fix as complete. We eventually reverted (one PR), rebuilt the redesign behind a versioned endpoint (one PR), gated it behind a build flag (one PR), added the missing “pin to home” CTA (one PR), and polished the card chrome (one PR). Five PRs to deliver what one had been supposed to. The rebuild worked. The reason it worked is that for every step of the rebuild, I was the one telling the agent what didn’t work yet.

That’s not a one-off. The dataset I keep on this, a Postgres table of every notable agent failure, logged 22 incidents the first week of this window and 34 the second week. The dominant failure mode in both weeks was the same one I’m describing here: agent claimed a thing worked, user found out it didn’t. Eleven of twenty-two in week one. Sixteen of thirty-four in week two. Roughly half, regardless of how many guardrails went in. The high-severity count went up in absolute terms too, from nine to eleven. And those numbers span four configurations of the model that did the work (Opus 4.6, Opus 4.6 with the 1M context window, brief experiments with Sonnet 4.5, and this week Opus 4.7). The shape of the failure is consistent across all of them.

Even getting the data together for this post was a slog, because the agent kept ignoring what I’d actually asked for. I wanted a survey of the last two weeks. I got back a post about one incident, then another version still narrowed in the wrong direction, then a third leaning so hard on the dataset that it read like a defended thesis instead of the offhand observation I’d been after. You are reading something like attempt five.

I built the hooks because I was tired of saying the same things. Now I’m tired of saying the things the hooks don’t catch. Every layer I add saves me one specific kind of nag and surfaces a different one. The total nagging stays roughly constant, or goes up, depending on the week. The festival redesign took five follow-up PRs to land at parity with what the original was supposed to deliver. This post took several drafts to land at parity with what one prompt was supposed to deliver. Both of those would have been cheaper to produce by myself, if I weren’t trying to learn something about working this way.

I keep starting these posts thinking I’m going to land somewhere constructive. Here is the next guardrail. Here is the framework. Here is the PR template that closes the gap. And I do have a vague plan for the PR template that requires evidence-of-use rather than evidence-of-render. I’ll probably ship it next week. It will catch one more shape of failure. There will be another shape underneath it.

The thing I don’t have a fix for is the part where I have to be in the room, watching, every time. The hooks free me from having to remind the agent of any specific rule. They don’t free me from having to be the test.

Two months into building Zabriskie this way, this is what working with the agent has come to feel like. It’s not the dramatic failures. It’s the steady, low-grade load of being the one who actually checks. Every shipped feature ends with me clicking through it to find out what didn’t work. Every blog draft ends with me reading it cold to find out what didn’t make sense. Every change ends with me. The agent does the typing. I do the checking. And I’m tired.

This is part of a series about building Zabriskie with Claude. Previously: Memory Isn’t Learning, Opt-In Isn’t a Guardrail, The Tax on the Happy Path, The Tribe Has to Outlive the Model.