GPT-5 vs Claude Sonnet 4.6 vs Gemini 2.5: real-world cost comparison
Frontier-tier pricing in 2026 has converged in a strange way: the three big labs charge roughly the same per token, but the token shape of each model is wildly different. The one with the cheapest list price often ends up most expensive in production, because it talks more, thinks more, or needs longer context to do the same job.
So instead of comparing $/1M tokens directly, we ran three real production tasks across all three frontier models and measured total cost per request. Same prompts, same eval harness, real traffic shape.
The contenders
Pricing from the live AI Fees data, per 1M tokens:
| Model | Input | Output | Context |
|---|---|---|---|
| OpenAI GPT-5 | $1.25 | $10 | 400K |
| Anthropic Claude Sonnet 4.6 | $3 | $15 | 200K |
| Google Gemini 2.5 Pro | $1.25 | $10 | 2M |
On paper, GPT-5 and Gemini 2.5 Pro tie on per-token rate, with Claude Sonnet roughly 2× more expensive. That's not the whole story.
Task 1: Customer-support summarization
2,500 real support tickets, summarized to a 3-sentence resolution note for the CRM. Single-turn, structured output.
| Model | Avg input | Avg output | Cost/call | Quality |
|---|---|---|---|---|
| GPT-5 | 1,200 | 95 | 0.245¢ | 4.6 / 5 |
| Claude Sonnet 4.6 | 1,200 | 110 | 0.525¢ | 4.7 / 5 |
| Gemini 2.5 Pro | 1,200 | 140 | 0.290¢ | 4.4 / 5 |
Claude is the most concise per request but charges the highest rate. Gemini talks the most — it generated 47% more output tokens than GPT-5 for the same task, eating the per-token price parity.
Winner: GPT-5 — cheapest with near-tied quality. The mini variants would beat all three on cost; check the cost-cutting playbook first.
Task 2: Codebase Q&A over a 50K-token repo
200 questions over a real TypeScript project. Full repo loaded as context, single-turn Q&A.
| Model | Avg input | Avg output | Cost/call | Quality |
|---|---|---|---|---|
| GPT-5 | 50,000 | 350 | 6.6¢ | 4.3 / 5 |
| Claude Sonnet 4.6 | 50,000 | 320 | 15.5¢ | 4.8 / 5 |
| Gemini 2.5 Pro | 50,000 | 420 | 6.7¢ | 4.4 / 5 |
This is where Claude earns its premium. On code understanding, it scores half a point higher on a 5-point scale, which compounds across iterations. If you're paying engineers $100/hour, the extra 9¢ per query is round-off error.
Winner: Claude Sonnet 4.6 for quality-critical code tasks. For volume-grade code search, GPT-5 wins on cost.
Task 3: Long-document analysis (1M tokens)
Synthesizing across a 1M-token corpus (legal contracts). This isolates the context-window advantage.
| Model | Approach | Cost/call | Quality |
|---|---|---|---|
| GPT-5 | 3-chunk pipeline (400K each) | $1.51 | 4.1 / 5 |
| Claude Sonnet 4.6 | 5-chunk pipeline (200K each) | $2.99 | 4.0 / 5 |
| Gemini 2.5 Pro | Single 1M-token call | $1.25 | 4.4 / 5 |
Gemini's 2M context window is the killer feature here. Single-call analysis avoids the inter-chunk synthesis errors that drag the other two down. Winner: Gemini 2.5 Pro for any task above ~300K tokens of context.
Putting it together
The framing "which frontier model is cheapest" has the wrong unit. The right unit is cost per acceptable answer for your task. Across our three workloads:
- Short structured tasks: GPT-5 wins on cost, all three tie on quality.
- Code / reasoning tasks: Claude wins on quality, often worth the premium when humans are downstream.
- Long-context tasks: Gemini wins by avoiding chunking entirely.
None of this matters if you're not splitting traffic. The biggest cost win on most LLM products is not picking the right frontier model — it's recognizing which queries don't need a frontier model at all. See the cost-cutting playbook for the routing pattern.
Current live prices are on the pricing list; tools to compare your own numbers are in the calculator.
Compare these three models on your traffic
Pin GPT-5, Claude Sonnet 4.6 and Gemini 2.5 Pro in the live tool and see the cost math for your exact workload.