The hidden cost of reasoning models
The first time you swap a chat model for a reasoning model, the answer quality jumps so noticeably it feels like cheating. The second time you look at the bill, it feels less like cheating and more like a tax. This post is about the tax.
What a reasoning model actually charges you for
Reasoning models — OpenAI's o-series, Claude with extended thinking enabled, Gemini Thinking — generate two streams of tokens per request:
- Reasoning tokens: a hidden internal monologue where the model works through the problem. You don't see them. The API may or may not surface a summary.
- Response tokens: the visible output your user reads.
Both are billed at the regular output rate. The reasoning stream is usually 3–20× larger than the response.
The math, by effort level
Most reasoning models expose a reasoning_effort knob: low, medium, high. Defaults vary. Real numbers from a typical "explain this code change" query:
| Effort | Reasoning tokens | Response tokens | Total output | Cost @ $10/1M |
|---|---|---|---|---|
low | 180 | 220 | 400 | 0.4¢ |
medium | 1,200 | 250 | 1,450 | 1.45¢ |
high | 3,800 | 280 | 4,080 | 4.08¢ |
At high, you pay 10× more than the response would suggest. Multiplied across a workload of 10,000 calls/day, the difference between low and high is $4,000/month — for the same model on the same question.
How to measure your actual consumption
Every reasoning-capable API returns usage.completion_tokens_details.reasoning_tokens (or equivalent). Log it for every call. The shape of your reasoning-to-response distribution is the single most useful number for capacity planning.
const response = await openai.responses.create({ ... });
const { input_tokens, output_tokens } = response.usage;
const reasoning = response.usage.output_tokens_details?.reasoning_tokens ?? 0;
metrics.histogram('llm.reasoning_ratio', reasoning / (output_tokens - reasoning));
metrics.counter('llm.reasoning_cost_cents', reasoning * 10 / 1e6 * 100);
Within a week you'll have a clear picture of where reasoning is paying off and where it's just spinning. Then route accordingly.
When reasoning models are worth it
- Math, formal logic, planning: real, measurable quality jumps. The tax pays for itself.
- Multi-step code synthesis: especially anything involving refactoring across files.
- Hard extraction from unstructured inputs (legal, medical) where a wrong answer is more expensive than the call.
- Anything with an objective verifier: tests pass / they don't, the JSON validates / it doesn't.
When to stay with a regular chat model
- Chat, RAG, summarization, drafting: chat models tie or win on cost-per-acceptable-answer.
- Creative writing, marketing copy: reasoning models often produce stilted output.
- High-volume / low-stakes classification: a tiny mini model wins by 50–100×.
- Real-time interactive UX: reasoning latency is 5–30s — your user notices.
The mixed pattern that usually works
The pattern we've shipped a dozen times:
- Cheap, fast classifier reads the user input and decides difficulty.
- "Easy" → mini chat model.
- "Medium" → flagship chat model.
- "Hard" → reasoning model at
mediumeffort. - "Critical" (rare) → reasoning at
high, with a chat-model second-opinion check.
Routing accuracy of 90%+ is achievable with a small fine-tuned classifier or even a careful zero-shot prompt. The cost savings are typically 60–80% versus reaching for the reasoning model every time.
Reasoning models are extraordinary tools and almost certainly the right default for the highest-stakes 5% of your traffic. They are also the wrong default for the other 95%. Track the ratio, route the rest.
Live reasoning-model prices on the pricing list; sample cost math in the calculator.
See reasoning-model prices side by side
Sort the live pricing list by output cost to find the reasoning models that fit your budget.