When the Batch API discount actually pays off
The pitch is irresistible: 50% off, same models, same quality. The catch is the SLA — your batch completes "within 24 hours," which in practice means anywhere from 20 minutes to 18 hours. For some workloads that's nothing; for others it's a deal-breaker. This post helps you tell which is which.
What the Batch API actually is
All three big labs offer a batch tier:
- OpenAI Batch API — 50% off, 24h window
- Anthropic Message Batches — 50% off, 24h window
- Google Gemini Batch — 50% off, 24h window
The mechanics are the same: upload a JSONL of requests, get a job ID, poll for completion, download a JSONL of responses. No streaming, no tool-call loops mid-batch — single-turn requests only.
The simple decision rule
Ask: Is anyone — user or process — actively waiting on this response?
- Yes (chat, autocomplete, generation in front of a person) → sync API.
- No (overnight ETL, eval runs, dataset prep, enrichment, scheduled re-classification) → Batch API.
For the in-between cases — "we want it within an hour but can survive an hour delay" — the answer is usually still sync. The batch SLA tail will burn you. Batch is for jobs where you don't notice a 12-hour delay.
Workloads that scream Batch
- Embeddings backfill. Embedding your entire content corpus is the textbook batch use case. Saves thousands of dollars on a large library.
- Eval / regression runs. Running your eval set against 5 model variants nightly. Zero user impact from latency.
- Dataset cleaning & rewriting. Taking 100K customer-support tickets and generating clean summaries for analytics.
- Periodic re-classification. Reprocessing tagged items because you updated the taxonomy.
- Translation backlogs. Localizing a knowledge base.
Workloads where Batch hurts you
- Anything interactive (chat, autocomplete, generation UIs).
- Tool-using agents — batch is single-turn only.
- Anything streamed.
- Anything with a hard SLA < 24h.
- Anything where downstream code blocks on the response — you'd hold a job for a day.
The math
Take a representative workload: 1M requests, each ~2K input + 500 output, on a $1.25/$10 frontier model.
| Tier | Per-call cost | 1M calls | Savings |
|---|---|---|---|
| Sync | 0.75¢ | $7,500 | — |
| Batch | 0.375¢ | $3,750 | $3,750 (50%) |
$3,750 saved per million calls — for jobs where the user wasn't waiting anyway. This is the single highest-leverage cost lever available to most production workloads, and it's nearly free to adopt.
What to watch out for
- Rate limits are different. Batch has its own token-per-day quota — request increases ahead of large jobs.
- Partial failure. A request inside a batch can fail individually. Plan for retries on the failed subset (also via batch).
- Compatibility. Not every model is available on batch in every region; check before architecting.
- Idempotency. Batch jobs occasionally need re-runs. Tag each request with a stable ID you can dedup on.
A simple architectural pattern
For data pipelines, queue requests into a "batch buffer" (S3 bucket, Postgres table, whatever). A nightly cron flushes the buffer into a single batch job. The results land back in storage, and downstream code reads them.
// nightly
const requests = await db.query(`
SELECT id, prompt FROM enrichment_queue WHERE status = 'pending'
`);
const batchFile = await uploadBatch(requests);
const job = await openai.batches.create({
input_file_id: batchFile.id,
endpoint: '/v1/chat/completions',
completion_window: '24h'
});
await db.update('jobs').set({ batch_id: job.id, status: 'submitted' });
That's typically 20 lines of code for half off the bill.
Compare batch-eligible models on the live pricing list; plug in your workload shape on the calculator.
What would Batch save you?
Plug your workload's input + output tokens into the calculator and halve the result — that's your Batch number.