How to cut your LLM bill by 80% in 2026
Most teams ship their first LLM feature on the default model (whatever the demo used) and discover the bill three months later, on a stand-up where nobody wants to be the one to say it. The good news: in 2026 there is more room than ever to bring that bill back down without changing what the product does. Pricing has stratified, providers ship caching, and there's a discount tier for almost every workload.
This post walks the seven moves we've seen consistently produce 50–90% savings in production, in order of effort. Skip the ones you've already done and start where you haven't.
1. Right-size to mini / nano variants
Every major lab now ships three tiers of the same family — flagship, mini, nano. The mini is typically 10–20% of the flagship's price; the nano is 2–5%. For classification, routing, extraction, summarization, fan-out tasks, structured rewriting and 80% of "agent step" work, the mini variant is indistinguishable in production.
The practical move: take your top three LLM call sites, swap the model, run them on a labeled eval set of 50–100 examples. If the mini wins or ties, ship it. You will almost always see a tie.
Example math from a real customer-support bot:
- Flagship at $2/$8 per 1M tokens, 5M input + 1M output per day → $18/day
- Mini at $0.15/$0.60 per 1M, same traffic → $1.35/day
- Savings: 92.5%, eval scores within 1.5 points
2. Turn on prompt caching
Anthropic, OpenAI and Google all offer cached-input pricing in 2026. The cache hit rate on a typical RAG or system-prompted assistant is 70–95%, and the cached read is 10–25% of regular input.
The catch: you have to structure your prompts so the cacheable prefix is at the front. Put static system prompt + tool definitions + retrieved context first; put user input last. If you string-format the user message into the middle of your prompt, none of it caches.
// Bad — user input breaks the prefix
const prompt = `System: ${SYSTEM}\nUser ${userId}: ${userInput}\n${CONTEXT}`;
// Good — stable prefix, user input at the end
const prompt = `System: ${SYSTEM}\nContext:\n${CONTEXT}\n\nUser: ${userInput}`;
3. Cap output tokens explicitly
Always pass max_tokens (or max_output_tokens). The default is "whatever the model wants" — and the model will occasionally decide it wants 6,000 tokens to answer "what time is it?".
Output is the expensive direction. A flagship at $10/1M output that's allowed 8K tokens can produce a single 8¢ response when a 200-token reply (0.2¢) would have been fine. Multiply by 100K daily requests and that's $80/day wasted, or ~$30K/year per call site.
Heuristic: estimate the longest answer your product needs, double it, cap there.
4. Move async work to the Batch API
OpenAI, Anthropic and Google all run a batch tier at ~50% off, with turnaround within 24 hours. If your work isn't real-time — enrichment, classification, embeddings, overnight summaries, eval runs, dataset rewriting, document chunking — it belongs on batch.
The interface is simple: upload a JSONL of requests, poll for completion, download the results. Treat it like an SQS queue with a 50% discount.
Quick rule:
- Sync API when the user is waiting on the response (chat, generation, autocomplete)
- Batch API when a job or scheduled run is consuming it (ETL, analytics, training data prep)
5. RAG over context-stuffing
Long-context models (1M tokens) seduce you into "just paste the whole wiki." Don't. A 1M-token prompt at $1/1M input is $1 per call. At 10K calls/day, that's $300K/year — to pre-load context you mostly don't use.
Retrieval-augmented generation chunks your corpus, embeds it, and fetches the top-K relevant chunks per query. A typical RAG call uses 2-5K tokens of context. The exact same answer quality at 0.2–0.5% of the price.
Embeddings are cheap ($0.02–0.10 per 1M tokens) and a one-time cost per document. Use a small embedding model (text-embedding-3-small or equivalent) unless you've measured a real recall gap.
6. Route by task complexity
Once you have eval data, build a router. A cheap classifier (mini variant, ~0.1¢ per call) reads the user's request and picks a model: "trivia / chitchat / small edit" → nano; "code or reasoning" → flagship.
Most production traffic is overwhelmingly easy. A naive router that sends 80% of requests to nano and 20% to flagship saves 60–80% with no quality regression — because the 80% never needed flagship in the first place.
7. Mind the reasoning-token tax
Reasoning models (o-series, Claude with thinking, Gemini Thinking) generate hidden "thinking" tokens before responding. Those count as output, billed at the regular output rate. A 200-token visible answer can hide 3,000 tokens of reasoning behind it — 15× the cost.
Reasoning models are great when correctness matters more than cost (math, code, planning). For chat, classification, and creative writing, a regular chat model usually beats them on cost-per-acceptable-answer. Measure both before defaulting.
If you do use reasoning models, configure reasoning_effort (where exposed) to low or medium for most queries and reserve high for the hard ones.
Where to start
If you do nothing else this quarter:
- Find your top 3 call sites by spend (your provider dashboard shows this).
- Swap each one to the mini variant and run a quick eval.
- Turn on prompt caching on the largest one.
- Cap
max_tokenseverywhere.
This alone is usually 60–80% off. Compounding the rest gets you to 90+. We've seen six-figure annual bills drop to five figures without any product changes.
Cross-check your numbers against the live pricing list — it updates every 6 hours — and use the calculator to plug in your real traffic shape.
Calculate your potential savings
Plug your input + output token volume into the live calculator and see the top 5 cheapest models for your workload.