All eleven fell inside the lines

I had a blanket ban on the books. No autonomous action, period.

The rule was a year old and I’d been treating it as a hard floor, the kind of thing you defend without re-checking.

Then I built a new class taxonomy. Four classes for actions the agent could safely take on its own. Internal observation. Surface signals to me. Mutations of its own state, reversible via git. Drafts that land in a folder and wait for me to publish them. Anything past those four, anything that touches the outside world without my hand on it, still required explicit invocation.

The taxonomy was clean. The ban was clean. They sat next to each other for a couple of days before I noticed the question.

If the new taxonomy is right, what does the blanket ban still cover?

i.The walk

I made a list of every autonomous behavior I’d ever considered building. Eleven candidates. Pre-meeting briefs assembled from calendar and prior context. A morning rundown stitched from inbox, deadlines, weather. Periodic freshness audits of the persisted state. Stale-file detection. Cost-log polling. Draft generation against a backlog of approved topics. A handful of others in the same neighborhood.

I walked each one against the class table.

Candidate one. Pre-meeting brief. Reads calendar, reads prior briefs, assembles a paragraph, surfaces it to me. Internal read plus surface signal. C1 and C2.

Candidate two. Morning rundown. Same shape. C2.

Candidate three. Persisted-state freshness audit. Reads the persisted files, compares against expected state, flags drift. C1.

Candidate six. Draft generation from backlog. Reads backlog entries, drafts content, writes the draft to a drafts folder. Does not publish. C4.

I’ll skip the rest. They fell the same way. Reading. Surfacing. Mutating own state. Drafting into a holding folder. Every single one resolved cleanly into C1, C2, C3, or C4. None of them needed C5. None of them touched the world.

Eleven candidates. Zero violations.

On pressure-testing Empirical revealbeats policy debate.The concrete teststhe abstract anddissolves it.

ii.What the ban was covering

The blanket ban was covering exactly the same surface the new class taxonomy now governed, plus a margin I’d never tested.

That margin was the interesting part. I’d been defending a rule that, in practice, applied to nothing I actually wanted to build. The agent never wanted to send emails on its own. Never wanted to post. Never wanted to spend money. The hard floor wasn’t holding the agent back from anything real. It was holding it back from the safe, useful, ambient autonomy the new taxonomy was explicitly designed to permit.

I’d been confusing a category boundary with a specific list of behaviors. The category boundary still mattered. The hard floor at C5 still held. What the blanket ban was actually doing was preventing the agent from ever firing in C1 through C4, where the worst-case outcome is a stale audit log row or a draft I delete.

The concrete tested the abstract and dissolved it. Not the part of the abstract that mattered, the C5 floor. The part of the abstract that had been over-correcting, the prohibition on classes the new taxonomy could already handle safely.

iii.The portable move

I don’t think the lesson is about my specific automation taxonomy.

Most architectural rules are written in the abstract, often in response to one bad experience. They’re cheaper to write than to test. The cost shows up later, when the rule starts covering territory you don’t actually want it to cover, and you keep defending it because it has the shape of a principle.

The discipline is to take the rule down to the level of specific cases before defending it. Eleven candidates. Walk each one. See what the rule does to each one, and see whether the rule’s behavior matches what you actually want.

If every candidate falls inside the safe classes the rule should permit, the rule is over-correcting. If the rule fires on cases that are obviously fine, the rule is too coarse. If you can’t generate eleven candidates to walk, you don’t understand the surface well enough to be writing rules over it yet.

Empirical reveal beats policy debate. Not because debate is wrong. Because debate over a categorical rule almost always happens at the level the rule was written at, which is abstract. The concrete cases are where the rule meets reality. If you haven’t tested the concrete, you’re defending a position you haven’t earned.

Pressure-testthe rulebefore youdefend it.Empirical beats abstract.

I lifted the blanket ban. The C5 floor stayed. The new taxonomy holds. Eleven specific cases now have a clear path to safe autonomous fire, with audit-trail rows and a single revoke point if anything goes sideways.

The architectural decision survived because I tested it. The version of the decision that didn’t survive was the one I’d been defending without checking.

Drafted with Bishop, my AI partner.
Words picked, edited, and approved by me. Model provenance: Claude Code (Claude Opus and Sonnet)