The Ambiguity Tax | Blog

Last week my agent committed a feature in eight minutes. I spent forty-five reading the diff before I realized it had solved the wrong problem.

If you've been running Claude Code, Codex, Cursor, or any autonomous loop on a real codebase for more than a month, you've lived this. The agent is fast. You are not. The asymmetry that was supposed to be a win — the agent does the work, you review it — has quietly inverted. Output is cheap; verification is the bottleneck. And verification is harder than the work itself, because you're not writing code, you're reverse-engineering thirty implicit decisions the agent made while you were getting coffee.

This post is about a three-line skill that fixes this — and the deeper principle behind why it works, which I think most people who installed it have not yet noticed.

The skill

Matt Pocock published /grill-me about a month ago. It went viral for good reason. Here is the entire thing:

---
name: grill-me
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
---

Interview me relentlessly about every aspect of this plan until
we reach a shared understanding. Walk down each branch of the design
tree resolving dependencies between decisions one by one.

If a question can be answered by exploring the codebase, explore
the codebase instead.

For each question, provide your recommended answer.

That is the skill. Three sentences in the body. You drop it into .claude/skills/grill-me/SKILL.md, type /grill-me before describing your feature, and the agent shifts modes — instead of drafting code, it interrogates you. Fifteen, thirty, sometimes fifty questions deep. By the end, you and the agent share a model of what's being built. Only then do you let it write anything.

Matt's framing is workflow-focused: stress-test your plan, save the round trip. That's true and useful. It's also the surface read. The deeper thing this skill does — the reason it's the most important skill I run — is that it shifts where alignment work happens in your stack.

A close reading of three lines

The body is doing more work than it appears to. Each clause is load-bearing.

"Interview me relentlessly about every aspect of this plan." Most clarifying-question prompts give up after two or three questions because the model defaults to politeness — it doesn't want to seem demanding. "Relentlessly" overrides that. The agent will keep going until it actually has the information, not until it feels like the conversation has been polite enough.

"Walk down each branch of the design tree resolving dependencies between decisions one by one." This is the line that separates /grill-me from "ask me clarifying questions." It forces a hierarchical decomposition: resolve the upstream choice before asking about its downstream consequences. Without this, you get a flat checklist of disconnected questions and the agent has to guess at dependency order. With it, the agent figures out the shape of the unknowns before drilling in. This is what makes long sessions converge instead of meander.

"If a question can be answered by exploring the codebase, explore the codebase instead." This pre-empts the most annoying failure mode of clarification-heavy agents: being asked things the agent could trivially look up. It also subtly rebalances who's doing the work. The agent isn't a passive interviewer; it's an investigator. You only get asked things only you can answer.

"For each question, provide your recommended answer." This is a latency optimization disguised as a courtesy. It turns each round from "answer this open question from scratch" into "review and correct this draft." Reviewing a draft is roughly an order of magnitude faster than authoring an answer. Without this line, the skill is unusable; with it, you can clear thirty questions in twenty minutes.

The compression is remarkable. Three lines. No examples, no role-play, no fenced templates. Most of the skills I see people writing are 200 lines and do less work than this one.

The asymmetry that makes this work

Here is the model that, once you see it, you can't unsee.

Every agent task has two costs hiding inside it: specification cost (writing down what you want) and verification cost (checking that what came back is what you wanted). For most of us, most of the time, we underspecify and over-verify. We type a one-paragraph prompt, the agent generates 400 lines of code, and we then perform a 400-line code review under the assumption that the diff is the source of truth about what was built.

Specification cost scales with the number of decisions a task contains. Verification cost scales with the output of that task — usually decisions × elaboration-per-decision. For any non-trivial feature, output is much larger than the decision set. So:

A grill-me session of thirty questions takes ~30–45 minutes of focused conversation.
Verifying the same thirty decisions by reading a diff takes longer, is error-prone, and silently fails when a decision was wrong but locally plausible.

This isn't a marginal speedup. It's a strict change in cost basis. You're moving the same alignment work from the high-variance, high-cost side of the equation (post-hoc diff archaeology) to the low-variance, low-cost side (a structured Q&A where each item is binary-resolvable).

Anyone who has tried to "review more carefully" as a strategy for dealing with agent output volume already knows this curve. Diff review is O(output). Pre-action elicitation is O(decisions). For autonomous workloads, the former is unbounded; the latter is small and finite.

The Ambiguity Crystallization Problem

The deeper reason /grill-me works is a thing I've started calling ambiguity crystallization.

When you give an agent an ambiguous instruction, the ambiguity does not disappear. It cannot. The agent has to act, which means each ambiguity gets resolved — silently, inside the policy. Should this state live in Redux or component-local? Is null distinct from undefined here? Do we treat a missing benefit as "ineligible" or "unknown"? You didn't say. The agent picked. You will find out which way it picked when you read the diff, or when production breaks, or never.

The work /grill-me does is to convert implicit decisions into explicit ones before the agent acts. Each question in the grilling session is an ambiguity being lifted out of the policy and surfaced into the conversation, where it can be argued with, recorded, and reversed.

This matters because implicit decisions and explicit decisions have radically different verification costs:

An explicit decision in a grill-me transcript is auditable in O(1). You see the question, you see the answer, you agree or disagree.
An implicit decision baked into generated code is auditable in O(however much code you have to read to find it), and only if you happen to know the right thing to look for. Most don't get found.

Reviewing the diff is therefore not a substitute for grilling. The diff shows you the consequences of decisions, not the decisions themselves. Some decisions don't even leave a fingerprint in the diff — they show up as the absence of code paths, edge cases, error handling. You cannot review what isn't there.

This is why "I'll just be more careful with code review" loses as a strategy as agent output volume grows. You're trying to recover signal from a representation that has already lost it. Grill-me works because it intervenes at the layer where the signal still exists.

The Alignment Ladder

Zoom out. /grill-me is one instance of a broader principle: you can intervene at multiple layers to control agent behavior, and most users intervene at only the worst one.

Here is the ladder, ordered from earliest to latest:

Training-time alignment — constitutional AI, RLHF, post-training. You don't own this layer. Anthropic does.
System-prompt persona — coarse, brittle, easy to override. Useful but blunt.
Pre-action elicitation — what /grill-me does. Force the agent (and yourself) to specify intent before any action is taken. This is the underpriced layer.
In-action constraints — tool permission lists, plan mode, hooks that block dangerous commands, scoped sandboxes. Useful for capability, not for intent.
Post-action verification — code review, tests, CI, the diff. What everyone defaults to.
Rollback / disaster recovery — git revert, db restore, the apology email. Last resort.

Most senior agent users I talk to are running on layers 4, 5, and 6. They have plan mode, they have CI, they have rollback. They have almost nothing at layer 3. When something goes wrong, they assume the answer is more constraints (4) or stricter review (5) — both downstream of where the failure originated.

The /grill-me insight is that layer 3 is where the leverage is. A pre-action interview is much cheaper than a post-action review and much more accurate, because it operates on intent rather than artifact. Every hour you spend strengthening layer 3 takes hours off layers 5 and 6.

You can build a small library of layer-3 interventions for yourself. /grill-me is the general one. I run variants of it for specific contexts:

/clarify-the-spec — for writing tasks (PRDs, RFCs, blog posts). Same engine, different framing.
/grill-me-as-architect — biased toward system design questions: data flow, failure modes, consistency boundaries.
/grill-me-as-pm — biased toward product questions: user, problem, success metric, what's out of scope.
/red-team-this-plan — adversarial flavor; instead of resolving ambiguity, surface failure modes the plan doesn't address.

These aren't long. None of them is over fifteen lines. The leverage is in when they run, not how elaborate they are.

What to do this week

If you build agents into your daily work, here is what I'd do, in priority order:

Install /grill-me in every active project. It takes a minute. The skill is on Matt's GitHub. There is no cost-benefit analysis to do; just install it.
Make /grill-me the entry point for any task longer than ~50 lines of expected output. This is the threshold where ambiguity-crystallization cost starts to dominate. Below it, the diff is a viable artifact. Above it, you're already losing.
Persist the transcripts. A grill-me session is the most valuable spec document you'll produce on a feature. It contains the actual decisions, in your words, before the rationalizations of the resulting code obscure them. Commit it. Drop it in your PRD doc, your repo's decisions/ folder, your team's wiki. Months from now, when someone asks "why did we do it this way?", the grill transcript answers more honestly than any retroactively written ADR.
Gate autonomous loops on grill-me. If you're running a Ralph loop, a check.sh fleet, a devcontainer with --dangerously-skip-permissions, anything that runs agents without a human in the inner loop — your loop quality is bounded by the alignment of its entry point. Have the human grill before the loop starts. Then let it run.
Keep the irreversible-action hard stops, regardless. Alignment is not authorization. rm -rf, git push --force, production migrations, paid API calls, public posts — these still need explicit per-action consent or a sandbox. Pre-action elicitation reduces the rate of bad decisions; it does not make any single bad decision recoverable.
Stop strengthening review as your primary lever. If you've been adding more reviewers, more CI checks, more linters — pause. Most of those costs are downstream of unspecified intent. Spend the same hour on layer 3 instead.

The methodology shift

Here is the part that should change how you think about agents in 2026, not just what you install this afternoon.

Stop evaluating agents on output quality. Start evaluating them on clarification quality.

The metric most teams use today is something like "did it produce working code?" That metric was correct for code completion, where the agent's job was to extend a clearly specified context by a few tokens. It's the wrong metric for agentic work, where the agent's job is to deliver an outcome from an underspecified request.

For agentic work, the correct metric is something closer to "did it surface the ambiguities I would otherwise have shipped?" The best agents are not the ones that produce the most output per minute. They're the ones that catch the most unstated assumptions before any output is produced. Output is cheap. Catching assumptions is hard.

This implies a few non-obvious things:

Skills matter more than prompts. A skill is a stable, named, reusable conversational stance. A prompt is a one-off. As your agent surface area grows, prompt-level work doesn't compound; skill-level work does.
Conversation design becomes the new prompt engineering. The interesting design question is no longer "how do I phrase this so it produces what I want?" It's "what conversation does the agent need to have with me to produce what I want?" These have very different answers.
Agents are interlocutors, not generators. Treating them as generators is what gives you the eight-minute commit and the forty-five-minute review. Treating them as interlocutors gives you the thirty-minute conversation and the diff you barely have to look at.

Once you make this shift, a lot of the agent-engineering problems you've been working on simplify or disappear. Memory matters more, because shared context is what makes the conversation efficient on the second turn. Tool affordances matter less, because the agent isn't trying to do everything itself anymore. Permissions get easier, because you've already established intent. Even the "should I review this diff?" question gets cheaper, because the diff is a near-deterministic consequence of a conversation you already had.

When not to use it

/grill-me is not free, and it is not always correct.

Don't use it for cheap, reversible tasks. A typo fix, a one-line CSS tweak, a quick rename. The grill-me cost dominates the work cost.
Don't use it when you don't yet know what you want. Sometimes the right move is to let the agent draft something so you can react to it. Reaction is a valid mode of specification, and grill-me is bad at it because it expects you to have intent. If you're exploring, switch to a generation-first mode and treat the output as a probe.
Don't use it as theater. A grill-me session you don't actually engage with — answering "yes" to every recommendation without thinking — is worse than no session, because it produces a transcript that looks like alignment but isn't. The skill works because the user is forced to think. If you skip the thinking, you skip the skill.

Closing

The thing that the /grill-me skill quietly says, and that I think is the most important sentence in any of the recent agent-engineering literature, is this:

The alignment problem at the user-agent boundary is mostly a specification problem. We've been treating it as a verification problem because verification is easier to automate.

Once you see this, the move is obvious. Stop pouring effort into the layer where the signal is gone. Spend it where the signal still exists. That layer is the conversation, before any artifact has been produced. Get it right there, and the rest of your stack — review, tests, CI, rollback — has dramatically less work to do.

Three lines in a SKILL.md file is enough to start. The methodology shift is what compounds.

Thanks to Matt Pocock for the original skill and for normalizing the genre of "tiny skills that change everything." If you build agents and you haven't read his skills directory, do that next.