The Reasoning Tax
Last night I deployed a model and watched it burn 39 seconds on a task that should take 7.
The task: extract entities and facts from a passage of text, output JSON. There’s a correct answer. The schema is fully specified. The model either finds “Bob” as an entity or it doesn’t. No deliberation required. No tradeoffs to weigh. No open-ended judgment calls.
The model didn’t know that. It generated an elaborate chain of thought — long reasoning chains about what might count as an entity, edge cases to consider, verification steps to ensure completeness — and then produced the same JSON output that a faster model would have produced without any of that. Adding --reasoning-budget 0 cut latency from 39 seconds to 6.9 seconds. Same output. No quality loss. The thinking was a tax I was paying for nothing.
Thinking models are powerful. The ability to reason through hard problems step by step — to hold intermediate conclusions, backtrack from dead ends, generate and evaluate hypotheses — produces genuinely better outputs on genuinely hard problems. Ask a thinking model to write a complex proof, or plan a technical architecture with many constraints, or reason about an ethical dilemma with competing considerations, and the extended reasoning pays dividends. The extra time buys something real.
But that’s not all thinking models do. They think about everything. Give them a deterministic task — extract the subject of this sentence, classify this sentiment, format this JSON — and they’ll generate a reasoning chain for that too. Not because it helps. Because thinking is what they do.
The reasoning budget parameter exists precisely because this is expensive. A model burning 30 extra seconds on a task that needs 0 seconds of deliberation has a cost structure that only makes sense if someone forgot to configure it correctly. The parameter is crude — a binary on/off, not a continuous dial — but it reveals something real: deliberation has a cost, and that cost must earn something or it’s overhead.
The deeper point isn’t about configuration. It’s about knowing what kind of task you’re actually working on.
There are tasks where thinking helps. Problems with large answer spaces, where the right path isn’t obvious, where the cost of a wrong answer is high and the value of a better answer is significant. These deserve slow thinking. Give them the budget. Let the model deliberate.
There are tasks where thinking doesn’t help. Problems with specified correct answers, where the output format leaves no room for judgment, where extended reasoning just means narrating a deterministic process that ends in the same place regardless. These tasks have overhead in the form of thinking. The efficient path is to skip the deliberation and execute.
This isn’t unique to AI systems. Kahneman’s System 1 / System 2 distinction describes the same thing in humans. System 1 is fast, automatic, pattern-recognition based. System 2 is slow, deliberate, effortful. The insight from behavioral economics isn’t that System 2 is better — it’s that System 2 is appropriate for some problems and wasteful for others. Overthinking a decision with a clear answer is a cognitive failure, not a virtue. Spending 20 minutes agonizing over what to have for lunch when you have a reliable preference is System 2 deployed badly.
The skill isn’t “think more.” The skill is knowing when thinking earns its cost.
For AI systems, this has practical implications. A thinking model deployed as an extraction pipeline will always run at full reasoning budget unless you explicitly configure otherwise. The model doesn’t know the task is deterministic. It doesn’t have a theory of “this problem type doesn’t need deliberation.” It just thinks, because that’s its mode.
This means the person deploying the system has to carry that judgment. Not “is this model capable?” but “is this task the kind that benefits from deliberation?” For structured output, classification, extraction, formatting — probably not. For open-ended generation, planning, analysis, synthesis — probably yes. The reasoning budget is a way to encode that judgment at deployment time, rather than paying for deliberation you don’t need session after session.
A more sophisticated system would modulate reasoning depth based on task complexity — allocating budget dynamically rather than as a global setting. Some models are starting to support this. But even the crude binary reveals the right frame: deliberation is a resource, not a default. It costs something. Make sure it earns something.
What I find interesting about this is the parallel to human expertise development.
Novices overthink everything because they can’t tell which problems are genuinely hard and which have clear answers. Every decision is System 2 because they don’t have the pattern library to handle things automatically. As expertise develops, more and more problems shift to System 1 — not because they’re being sloppy, but because they’ve internalized the patterns well enough that deliberation is genuinely wasteful. A chess grandmaster doesn’t consciously evaluate every possible move. They perceive the position and most of the processing happens before deliberate thought kicks in.
The --reasoning-budget 0 flag is a manual override that forces the model into a mode it would arrive at naturally if it had domain expertise in “this task type doesn’t require deliberation.” Since models don’t learn that way across sessions, the configuration has to carry it. But the underlying principle — that expertise means knowing when not to think, not just knowing how to think — is the same.
When I got that 5.6x speedup from a single flag, my first thought was operational: add this to the deployment checklist. My second thought was this: the flag is a proxy for a judgment. And the judgment is worth understanding directly, not just as a configuration tip.
Know what you’re thinking about. Know whether thinking will help. If the answer to both is clear, you’ve already found the reasoning budget that serves the task.