← All posts
Reasoning

The hidden cost of reasoning models

The first time you swap a chat model for a reasoning model, the answer quality jumps so noticeably it feels like cheating. The second time you look at the bill, it feels less like cheating and more like a tax. This post is about the tax.

What a reasoning model actually charges you for

Reasoning models — OpenAI's o-series, Claude with extended thinking enabled, Gemini Thinking — generate two streams of tokens per request:

  • Reasoning tokens: a hidden internal monologue where the model works through the problem. You don't see them. The API may or may not surface a summary.
  • Response tokens: the visible output your user reads.

Both are billed at the regular output rate. The reasoning stream is usually 3–20× larger than the response.

The real-world ratio. Across ~500 production calls we instrumented, the median reasoning-to-response ratio was 8.5×. P95 was 23×. The cheapest-looking models on the list are the ones with the largest hidden multiplier — be careful.

The math, by effort level

Most reasoning models expose a reasoning_effort knob: low, medium, high. Defaults vary. Real numbers from a typical "explain this code change" query:

EffortReasoning tokensResponse tokensTotal outputCost @ $10/1M
low1802204000.4¢
medium1,2002501,4501.45¢
high3,8002804,0804.08¢

At high, you pay 10× more than the response would suggest. Multiplied across a workload of 10,000 calls/day, the difference between low and high is $4,000/month — for the same model on the same question.

How to measure your actual consumption

Every reasoning-capable API returns usage.completion_tokens_details.reasoning_tokens (or equivalent). Log it for every call. The shape of your reasoning-to-response distribution is the single most useful number for capacity planning.

const response = await openai.responses.create({ ... });
const { input_tokens, output_tokens } = response.usage;
const reasoning = response.usage.output_tokens_details?.reasoning_tokens ?? 0;

metrics.histogram('llm.reasoning_ratio', reasoning / (output_tokens - reasoning));
metrics.counter('llm.reasoning_cost_cents', reasoning * 10 / 1e6 * 100);

Within a week you'll have a clear picture of where reasoning is paying off and where it's just spinning. Then route accordingly.

When reasoning models are worth it

  • Math, formal logic, planning: real, measurable quality jumps. The tax pays for itself.
  • Multi-step code synthesis: especially anything involving refactoring across files.
  • Hard extraction from unstructured inputs (legal, medical) where a wrong answer is more expensive than the call.
  • Anything with an objective verifier: tests pass / they don't, the JSON validates / it doesn't.

When to stay with a regular chat model

  • Chat, RAG, summarization, drafting: chat models tie or win on cost-per-acceptable-answer.
  • Creative writing, marketing copy: reasoning models often produce stilted output.
  • High-volume / low-stakes classification: a tiny mini model wins by 50–100×.
  • Real-time interactive UX: reasoning latency is 5–30s — your user notices.

The mixed pattern that usually works

The pattern we've shipped a dozen times:

  1. Cheap, fast classifier reads the user input and decides difficulty.
  2. "Easy" → mini chat model.
  3. "Medium" → flagship chat model.
  4. "Hard" → reasoning model at medium effort.
  5. "Critical" (rare) → reasoning at high, with a chat-model second-opinion check.

Routing accuracy of 90%+ is achievable with a small fine-tuned classifier or even a careful zero-shot prompt. The cost savings are typically 60–80% versus reaching for the reasoning model every time.

Reasoning models are extraordinary tools and almost certainly the right default for the highest-stakes 5% of your traffic. They are also the wrong default for the other 95%. Track the ratio, route the rest.

Live reasoning-model prices on the pricing list; sample cost math in the calculator.

See reasoning-model prices side by side

Sort the live pricing list by output cost to find the reasoning models that fit your budget.

Open pricing list →