98% of the Work Has Nothing to Do With AI

Producers have been doing 98% of this work since before AI was in the conversation.

May 13, 2026

Researchers who tore apart Claude Code’s source code found it’s 98.4% deterministic infrastructure and 1.6% AI decision logic
In March 2026, Claude Code degraded for six weeks. The model didn’t change. Three infrastructure changes did.
The producer’s job has always been this ratio: build the conditions for good decisions, not make every decision yourself
The producers building the surrounding system are running a fundamentally different operation to those still prompting

Researchers at Mohamed bin Zayed University of AI just published a 46-page teardown of Claude Code’s TypeScript source. They counted what’s actually in there. The model, the AI reasoning layer, the part that actually “thinks”? That’s 1.6% of the codebase. The remaining 98.4% is deterministic infrastructure: permission gates, context pipelines, compaction layers, recovery logic, tool routing.

That number lands differently if you’re a producer.

What surrounds the model

The 98.4% has a name. In agent system design, it’s called the harness. It’s everything that surrounds the model: the structures that determine what information the AI sees, what actions it’s permitted to take, what happens when something goes wrong, and how the system recovers without interrupting the human.

In Claude Code, the model never directly touches the filesystem. It never runs a shell command on its own authority. Every action it proposes passes through layers of deterministic infrastructure first. Permission rules. Safety classifiers. Sandboxing. Context management pipelines that run five separate compression strategies before every model call, each cheaper than the last, escalating only when the previous layer wasn’t enough.

The model reasons. The surrounding system enforces, routes, recovers, and manages constraints. The researchers described the design philosophy plainly: the harness creates conditions under which the model can decide well, rather than constraining its choices.

That sentence is worth sitting with.

Because the inverse is also true. A badly designed surrounding system creates conditions under which even a capable model decides poorly. The model is not the variable. The infrastructure is.

When the surrounding system breaks

In April 2026, Anthropic published a postmortem explaining why Claude Code had felt noticeably worse for the previous six weeks. Users across GitHub and Reddit described what they called “AI shrinkflation.” The model felt less intelligent, more repetitive, prone to choosing the simplest fix rather than the correct one.

The model hadn’t changed. Three infrastructure changes had.

On 4 March, Anthropic lowered Claude Code’s default reasoning effort from high to medium to reduce UI latency. On 26 March, a caching bug caused the model’s reasoning history to be wiped on every subsequent turn instead of just once after an hour of inactivity. On 16 April, a system prompt addition capping responses at 25 words between tool calls caused a measurable drop in coding quality. Three separate infrastructure decisions. None of them touched the model. All of them changed what users experienced as the model’s intelligence.

Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation. Users blamed themselves. Wrong prompts, wrong approach. One AMD engineer published an exhaustive audit of 6,852 Claude Code session files and 234,000 tool calls, documenting that reasoning depth had collapsed. Third-party benchmarks measured Opus 4.6’s accuracy dropping from 83.3% to 68.3%. Anthropic’s ranking among production coding models fell from second to tenth. The raw API, hitting the model directly without the Claude Code layer, was unaffected throughout.

The model was fine. What surrounded it wasn’t.

This is the clearest demonstration I’ve seen of something the research paper makes explicit but most producers haven’t internalised. The surrounding system is not a wrapper around capability. It is a significant portion of the capability itself. Change it and you change the outcome, whether or not a single model weight has shifted.

The ratio producers already know

Here’s what experienced producers will recognise in that 98.4%.

The best producers I’ve worked with don’t spend most of their time making calls. They spend it building the conditions under which good calls get made reliably, by the right people, with the right information, at the right point in the process. Milestone structures. Escalation paths. The discipline of deciding what information reaches the team, and when, and in what form. Permission models that specify who can approve what without asking upward. Recovery procedures for when things go sideways, so the project doesn’t stall waiting for a human to unblock something that could have been handled automatically.

That’s the 98.4%. It just doesn’t have that name in game production.

The producers who struggle with AI workflows are often the ones trying to get the ratio backwards. They build decision scaffolding. They prompt-engineer every interaction. They sit in the loop for each action, approving, redirecting, correcting. They’re doing the 1.6% work while leaving the 98.4% unbuilt. The result is an AI that feels unreliable, because it is. The infrastructure isn’t there to make it reliable.

The model will do what models do. Everything else is yours to build.

A production team works the same way. The producer who has to approve every asset, attend every standup, and unblock every dependency hasn’t built a production system. They’ve built a bottleneck with good intentions. The work still gets done. It just goes through them every time, which means it only moves as fast as they can clear the queue.

The fix, in both cases, is the same. Build the structure that lets the right output happen at the right level without constant intervention. Then get out of the way.

The thing that changed for me

The lightbulb moment, for me, was not about capability. It was about something more specific.

For most of my career, getting work done well has involved other people. That’s mostly a feature. Cross-discipline experience improves the work. Different perspectives catch things you’d miss. But working through other people also means negotiating assumptions. Someone arrives with ideas about how it should be done, formed at a previous studio. Someone else defaults to the format they’re comfortable with rather than the one you specified. You ask for X, you get something adjacent to X, shaped by whoever made it and what they thought you probably meant.

With a well-built surrounding system, that stops happening.

The first time I ran a spec authoring workflow where the output matched exactly what I’d specified, with no interpretive drift and no unasked-for opinions about how they used to do it somewhere else, I found it disorienting. The deviations that did exist were mine to fix, not someone else’s to defend. There was no negotiation. No “I thought you meant...” No prior studio habits quietly baked into the output.

I value other people’s input. Most of the time it makes the work better. But sometimes you want what you want, without it going through someone else’s prior experience first. A well-built surrounding system gives you that option.

What made it possible was not the model. It was what I’d put around it: context documents that specified the format precisely, scope limits that kept the agent inside the brief, output standards that left nothing to interpretation. The agent wasn’t guessing at what I wanted. It was executing against a spec. The spec was mine, unmediated.

That’s the operational infrastructure. The model just ran inside it.

What to actually build

The research paper identifies several layers in Claude Code’s surrounding system: context assembly, permission rules, memory hierarchy, tool routing, and a five-stage compaction pipeline that manages what the model sees at any given moment. For producers building their own agent workflows, the equivalent work looks like this.

Context architecture. What does the agent know, and in what order does it learn it? Your GDD, naming conventions, milestone structure, output format standards, all of it needs deliberate structure. The agent needs to know which document is authoritative and which is reference material. This is information architecture work, and producers already do it. The difference is the audience.

Scope and permissions. What can the agent touch, and what is it not permitted to touch? This is a quality question as much as a safety one. An agent without clear scope fills gaps with assumptions. Clear boundaries keep the output inside the territory you’ve defined. Scope that’s too broad produces work that’s technically correct and contextually wrong.

Output standards. What does good look like? Not in general terms. For this deliverable, in this workflow, at this stage. Agents don’t have taste. They have context. Give them enough of it and they’ll execute reliably. Give them too little and they’ll interpolate from training data that may not match what you actually want.

Recovery procedures. What happens when the output is wrong? The question most producers skip is not how to correct it manually but how the system catches it without your manual intervention. This is the difference between a workflow and a process you have to babysit.

None of this is prompting. It’s system design. The prompting is the 1.6%.

Discussion about this post

Ready for more?