When the Batch API discount actually pays off

The pitch is irresistible: 50% off, same models, same quality. The catch is the SLA — your batch completes "within 24 hours," which in practice means anywhere from 20 minutes to 18 hours. For some workloads that's nothing; for others it's a deal-breaker. This post helps you tell which is which.

What the Batch API actually is

All three big labs offer a batch tier:

OpenAI Batch API — 50% off, 24h window
Anthropic Message Batches — 50% off, 24h window
Google Gemini Batch — 50% off, 24h window

The mechanics are the same: upload a JSONL of requests, get a job ID, poll for completion, download a JSONL of responses. No streaming, no tool-call loops mid-batch — single-turn requests only.

Important quirk. "24 hours" is a max, not a typical. In our usage, ~70% of batches finish in under an hour, ~20% within 4 hours, ~10% take 8-24h. Plan for the tail.

The simple decision rule

Ask: Is anyone — user or process — actively waiting on this response?

Yes (chat, autocomplete, generation in front of a person) → sync API.
No (overnight ETL, eval runs, dataset prep, enrichment, scheduled re-classification) → Batch API.

For the in-between cases — "we want it within an hour but can survive an hour delay" — the answer is usually still sync. The batch SLA tail will burn you. Batch is for jobs where you don't notice a 12-hour delay.

Workloads that scream Batch

Embeddings backfill. Embedding your entire content corpus is the textbook batch use case. Saves thousands of dollars on a large library.
Eval / regression runs. Running your eval set against 5 model variants nightly. Zero user impact from latency.
Dataset cleaning & rewriting. Taking 100K customer-support tickets and generating clean summaries for analytics.
Periodic re-classification. Reprocessing tagged items because you updated the taxonomy.
Translation backlogs. Localizing a knowledge base.

Workloads where Batch hurts you

Anything interactive (chat, autocomplete, generation UIs).
Tool-using agents — batch is single-turn only.
Anything streamed.
Anything with a hard SLA < 24h.
Anything where downstream code blocks on the response — you'd hold a job for a day.

The math

Take a representative workload: 1M requests, each ~2K input + 500 output, on a $1.25/$10 frontier model.

Tier	Per-call cost	1M calls	Savings
Sync	0.75¢	$7,500	—
Batch	0.375¢	$3,750	$3,750 (50%)

$3,750 saved per million calls — for jobs where the user wasn't waiting anyway. This is the single highest-leverage cost lever available to most production workloads, and it's nearly free to adopt.

What to watch out for

Rate limits are different. Batch has its own token-per-day quota — request increases ahead of large jobs.
Partial failure. A request inside a batch can fail individually. Plan for retries on the failed subset (also via batch).
Compatibility. Not every model is available on batch in every region; check before architecting.
Idempotency. Batch jobs occasionally need re-runs. Tag each request with a stable ID you can dedup on.

A simple architectural pattern

For data pipelines, queue requests into a "batch buffer" (S3 bucket, Postgres table, whatever). A nightly cron flushes the buffer into a single batch job. The results land back in storage, and downstream code reads them.

// nightly
const requests = await db.query(`
  SELECT id, prompt FROM enrichment_queue WHERE status = 'pending'
`);
const batchFile = await uploadBatch(requests);
const job = await openai.batches.create({
  input_file_id: batchFile.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h'
});
await db.update('jobs').set({ batch_id: job.id, status: 'submitted' });

That's typically 20 lines of code for half off the bill.

Compare batch-eligible models on the live pricing list; plug in your workload shape on the calculator.

What the Batch API actually is

The simple decision rule

Workloads that scream Batch

Workloads where Batch hurts you

The math

What to watch out for

A simple architectural pattern

What would Batch save you?