← All posts
Batch API

When the Batch API discount actually pays off

The pitch is irresistible: 50% off, same models, same quality. The catch is the SLA — your batch completes "within 24 hours," which in practice means anywhere from 20 minutes to 18 hours. For some workloads that's nothing; for others it's a deal-breaker. This post helps you tell which is which.

What the Batch API actually is

All three big labs offer a batch tier:

  • OpenAI Batch API — 50% off, 24h window
  • Anthropic Message Batches — 50% off, 24h window
  • Google Gemini Batch — 50% off, 24h window

The mechanics are the same: upload a JSONL of requests, get a job ID, poll for completion, download a JSONL of responses. No streaming, no tool-call loops mid-batch — single-turn requests only.

Important quirk. "24 hours" is a max, not a typical. In our usage, ~70% of batches finish in under an hour, ~20% within 4 hours, ~10% take 8-24h. Plan for the tail.

The simple decision rule

Ask: Is anyone — user or process — actively waiting on this response?

  • Yes (chat, autocomplete, generation in front of a person) → sync API.
  • No (overnight ETL, eval runs, dataset prep, enrichment, scheduled re-classification) → Batch API.

For the in-between cases — "we want it within an hour but can survive an hour delay" — the answer is usually still sync. The batch SLA tail will burn you. Batch is for jobs where you don't notice a 12-hour delay.

Workloads that scream Batch

  • Embeddings backfill. Embedding your entire content corpus is the textbook batch use case. Saves thousands of dollars on a large library.
  • Eval / regression runs. Running your eval set against 5 model variants nightly. Zero user impact from latency.
  • Dataset cleaning & rewriting. Taking 100K customer-support tickets and generating clean summaries for analytics.
  • Periodic re-classification. Reprocessing tagged items because you updated the taxonomy.
  • Translation backlogs. Localizing a knowledge base.

Workloads where Batch hurts you

  • Anything interactive (chat, autocomplete, generation UIs).
  • Tool-using agents — batch is single-turn only.
  • Anything streamed.
  • Anything with a hard SLA < 24h.
  • Anything where downstream code blocks on the response — you'd hold a job for a day.

The math

Take a representative workload: 1M requests, each ~2K input + 500 output, on a $1.25/$10 frontier model.

TierPer-call cost1M callsSavings
Sync0.75¢$7,500
Batch0.375¢$3,750$3,750 (50%)

$3,750 saved per million calls — for jobs where the user wasn't waiting anyway. This is the single highest-leverage cost lever available to most production workloads, and it's nearly free to adopt.

What to watch out for

  • Rate limits are different. Batch has its own token-per-day quota — request increases ahead of large jobs.
  • Partial failure. A request inside a batch can fail individually. Plan for retries on the failed subset (also via batch).
  • Compatibility. Not every model is available on batch in every region; check before architecting.
  • Idempotency. Batch jobs occasionally need re-runs. Tag each request with a stable ID you can dedup on.

A simple architectural pattern

For data pipelines, queue requests into a "batch buffer" (S3 bucket, Postgres table, whatever). A nightly cron flushes the buffer into a single batch job. The results land back in storage, and downstream code reads them.

// nightly
const requests = await db.query(`
  SELECT id, prompt FROM enrichment_queue WHERE status = 'pending'
`);
const batchFile = await uploadBatch(requests);
const job = await openai.batches.create({
  input_file_id: batchFile.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h'
});
await db.update('jobs').set({ batch_id: job.id, status: 'submitted' });

That's typically 20 lines of code for half off the bill.

Compare batch-eligible models on the live pricing list; plug in your workload shape on the calculator.

What would Batch save you?

Plug your workload's input + output tokens into the calculator and halve the result — that's your Batch number.

Open calculator →