GPT-5 vs Claude Sonnet 4.6 vs Gemini 2.5: real-world cost comparison

Frontier-tier pricing in 2026 has converged in a strange way: the three big labs charge roughly the same per token, but the token shape of each model is wildly different. The one with the cheapest list price often ends up most expensive in production, because it talks more, thinks more, or needs longer context to do the same job.

So instead of comparing $/1M tokens directly, we ran three real production tasks across all three frontier models and measured total cost per request. Same prompts, same eval harness, real traffic shape.

The contenders

Pricing from the live AI Fees data, per 1M tokens:

Model	Input	Output	Context
OpenAI GPT-5	$1.25	$10	400K
Anthropic Claude Sonnet 4.6	$3	$15	200K
Google Gemini 2.5 Pro	$1.25	$10	2M

On paper, GPT-5 and Gemini 2.5 Pro tie on per-token rate, with Claude Sonnet roughly 2× more expensive. That's not the whole story.

Task 1: Customer-support summarization

2,500 real support tickets, summarized to a 3-sentence resolution note for the CRM. Single-turn, structured output.

Model	Avg input	Avg output	Cost/call	Quality
GPT-5	1,200	95	0.245¢	4.6 / 5
Claude Sonnet 4.6	1,200	110	0.525¢	4.7 / 5
Gemini 2.5 Pro	1,200	140	0.290¢	4.4 / 5

Claude is the most concise per request but charges the highest rate. Gemini talks the most — it generated 47% more output tokens than GPT-5 for the same task, eating the per-token price parity.

Winner: GPT-5 — cheapest with near-tied quality. The mini variants would beat all three on cost; check the cost-cutting playbook first.

Task 2: Codebase Q&A over a 50K-token repo

200 questions over a real TypeScript project. Full repo loaded as context, single-turn Q&A.

Model	Avg input	Avg output	Cost/call	Quality
GPT-5	50,000	350	6.6¢	4.3 / 5
Claude Sonnet 4.6	50,000	320	15.5¢	4.8 / 5
Gemini 2.5 Pro	50,000	420	6.7¢	4.4 / 5

This is where Claude earns its premium. On code understanding, it scores half a point higher on a 5-point scale, which compounds across iterations. If you're paying engineers $100/hour, the extra 9¢ per query is round-off error.

Winner: Claude Sonnet 4.6 for quality-critical code tasks. For volume-grade code search, GPT-5 wins on cost.

Don't sleep on caching here. With prompt caching enabled, the 50K-token repo prefix is cached. The cost-per-call drops 70–80% for all three models, which doesn't change the ranking but does change the absolute math.

Task 3: Long-document analysis (1M tokens)

Synthesizing across a 1M-token corpus (legal contracts). This isolates the context-window advantage.

Model	Approach	Cost/call	Quality
GPT-5	3-chunk pipeline (400K each)	$1.51	4.1 / 5
Claude Sonnet 4.6	5-chunk pipeline (200K each)	$2.99	4.0 / 5
Gemini 2.5 Pro	Single 1M-token call	$1.25	4.4 / 5

Gemini's 2M context window is the killer feature here. Single-call analysis avoids the inter-chunk synthesis errors that drag the other two down. Winner: Gemini 2.5 Pro for any task above ~300K tokens of context.

Putting it together

The framing "which frontier model is cheapest" has the wrong unit. The right unit is cost per acceptable answer for your task. Across our three workloads:

Short structured tasks: GPT-5 wins on cost, all three tie on quality.
Code / reasoning tasks: Claude wins on quality, often worth the premium when humans are downstream.
Long-context tasks: Gemini wins by avoiding chunking entirely.

None of this matters if you're not splitting traffic. The biggest cost win on most LLM products is not picking the right frontier model — it's recognizing which queries don't need a frontier model at all. See the cost-cutting playbook for the routing pattern.

Current live prices are on the pricing list; tools to compare your own numbers are in the calculator.

The contenders

Task 1: Customer-support summarization

Task 2: Codebase Q&A over a 50K-token repo

Task 3: Long-document analysis (1M tokens)

Putting it together

Compare these three models on your traffic