How to cut your LLM bill by 80% in 2026

Most teams ship their first LLM feature on the default model (whatever the demo used) and discover the bill three months later, on a stand-up where nobody wants to be the one to say it. The good news: in 2026 there is more room than ever to bring that bill back down without changing what the product does. Pricing has stratified, providers ship caching, and there's a discount tier for almost every workload.

This post walks the seven moves we've seen consistently produce 50–90% savings in production, in order of effort. Skip the ones you've already done and start where you haven't.

TL;DR. The single biggest lever is moving non-frontier work to mini/nano variants. The second is prompt caching for any system prompt longer than a few hundred tokens. Everything else compounds on those two.

1. Right-size to mini / nano variants

Every major lab now ships three tiers of the same family — flagship, mini, nano. The mini is typically 10–20% of the flagship's price; the nano is 2–5%. For classification, routing, extraction, summarization, fan-out tasks, structured rewriting and 80% of "agent step" work, the mini variant is indistinguishable in production.

The practical move: take your top three LLM call sites, swap the model, run them on a labeled eval set of 50–100 examples. If the mini wins or ties, ship it. You will almost always see a tie.

Example math from a real customer-support bot:

Flagship at $2/$8 per 1M tokens, 5M input + 1M output per day → $18/day
Mini at $0.15/$0.60 per 1M, same traffic → $1.35/day
Savings: 92.5%, eval scores within 1.5 points

2. Turn on prompt caching

Anthropic, OpenAI and Google all offer cached-input pricing in 2026. The cache hit rate on a typical RAG or system-prompted assistant is 70–95%, and the cached read is 10–25% of regular input.

The catch: you have to structure your prompts so the cacheable prefix is at the front. Put static system prompt + tool definitions + retrieved context first; put user input last. If you string-format the user message into the middle of your prompt, none of it caches.

// Bad — user input breaks the prefix
const prompt = `System: ${SYSTEM}\nUser ${userId}: ${userInput}\n${CONTEXT}`;

// Good — stable prefix, user input at the end
const prompt = `System: ${SYSTEM}\nContext:\n${CONTEXT}\n\nUser: ${userInput}`;

3. Cap output tokens explicitly

Always pass max_tokens (or max_output_tokens). The default is "whatever the model wants" — and the model will occasionally decide it wants 6,000 tokens to answer "what time is it?".

Output is the expensive direction. A flagship at $10/1M output that's allowed 8K tokens can produce a single 8¢ response when a 200-token reply (0.2¢) would have been fine. Multiply by 100K daily requests and that's $80/day wasted, or ~$30K/year per call site.

Heuristic: estimate the longest answer your product needs, double it, cap there.

4. Move async work to the Batch API

OpenAI, Anthropic and Google all run a batch tier at ~50% off, with turnaround within 24 hours. If your work isn't real-time — enrichment, classification, embeddings, overnight summaries, eval runs, dataset rewriting, document chunking — it belongs on batch.

The interface is simple: upload a JSONL of requests, poll for completion, download the results. Treat it like an SQS queue with a 50% discount.

Quick rule:

Sync API when the user is waiting on the response (chat, generation, autocomplete)
Batch API when a job or scheduled run is consuming it (ETL, analytics, training data prep)

5. RAG over context-stuffing

Long-context models (1M tokens) seduce you into "just paste the whole wiki." Don't. A 1M-token prompt at $1/1M input is $1 per call. At 10K calls/day, that's $300K/year — to pre-load context you mostly don't use.

Retrieval-augmented generation chunks your corpus, embeds it, and fetches the top-K relevant chunks per query. A typical RAG call uses 2-5K tokens of context. The exact same answer quality at 0.2–0.5% of the price.

Embeddings are cheap ($0.02–0.10 per 1M tokens) and a one-time cost per document. Use a small embedding model (text-embedding-3-small or equivalent) unless you've measured a real recall gap.

6. Route by task complexity

Once you have eval data, build a router. A cheap classifier (mini variant, ~0.1¢ per call) reads the user's request and picks a model: "trivia / chitchat / small edit" → nano; "code or reasoning" → flagship.

Most production traffic is overwhelmingly easy. A naive router that sends 80% of requests to nano and 20% to flagship saves 60–80% with no quality regression — because the 80% never needed flagship in the first place.

Routing isn't free. The router itself costs tokens, and a wrong route on a hard query is expensive (you retry on the better model). Aim for >95% router accuracy on a held-out set before turning it on. Measure savings net of retries.

7. Mind the reasoning-token tax

Reasoning models (o-series, Claude with thinking, Gemini Thinking) generate hidden "thinking" tokens before responding. Those count as output, billed at the regular output rate. A 200-token visible answer can hide 3,000 tokens of reasoning behind it — 15× the cost.

Reasoning models are great when correctness matters more than cost (math, code, planning). For chat, classification, and creative writing, a regular chat model usually beats them on cost-per-acceptable-answer. Measure both before defaulting.

If you do use reasoning models, configure reasoning_effort (where exposed) to low or medium for most queries and reserve high for the hard ones.

Where to start

If you do nothing else this quarter:

Find your top 3 call sites by spend (your provider dashboard shows this).
Swap each one to the mini variant and run a quick eval.
Turn on prompt caching on the largest one.
Cap max_tokens everywhere.

This alone is usually 60–80% off. Compounding the rest gets you to 90+. We've seen six-figure annual bills drop to five figures without any product changes.

Cross-check your numbers against the live pricing list — it refreshes daily — and use the calculator to plug in your real traffic shape.

1. Right-size to mini / nano variants

2. Turn on prompt caching

3. Cap output tokens explicitly

4. Move async work to the Batch API

5. RAG over context-stuffing

6. Route by task complexity

7. Mind the reasoning-token tax

Where to start

Calculate your potential savings