OpenAI Batch API
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,142 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,142 words
Add missing citations, update stale details, or suggest a clearer explanation.
The OpenAI Batch API is an asynchronous service from OpenAI that lets developers submit large groups of API requests in a single file for processing within a target window of 24 hours, in exchange for a 50% discount on token costs relative to the equivalent synchronous (real-time) endpoints. It became available on April 15, 2024, and is designed for high-volume, non-time-sensitive workloads such as evaluations, bulk classification, embeddings generation, and summarization.[1][2]
The Batch API addresses a common pattern in production use of large language models: jobs that involve many thousands of requests but do not require an immediate, low-latency response. Rather than sending each request individually to a synchronous endpoint, a developer collects the requests into one file, uploads it, and creates a single batch job that OpenAI processes in the background.[3]
Compared with calling the standard endpoints directly, the Batch API offers three documented advantages. First, it applies a 50% cost discount on both input and output tokens relative to the synchronous price of the same model.[2][4] Second, it provides a separate pool of substantially higher rate limits, so batch work does not draw down the standard per-minute token and request limits used by real-time traffic; at launch OpenAI cited a ceiling of 250 million input tokens enqueued for GPT-4 Turbo.[2] Third, each batch is processed within a 24-hour completion window, and OpenAI states that results often return more quickly than the full window.[3]
The trade-off is latency. Because work is queued and processed asynchronously, the Batch API is not suitable for interactive applications that need a response in seconds. If a batch cannot finish inside the 24-hour window, the unfinished requests are marked as expired and any completed results are still returned.[3]
The Batch API uses a small set of endpoints to build a request file, start a job, monitor it, and collect results. Each input line is an independent request that mirrors the body a developer would otherwise send to a synchronous endpoint, wrapped with a unique identifier so results can be matched back to inputs. The typical workflow is as follows.[3]
| Step | Action | Endpoint / mechanism | Notes |
|---|---|---|---|
| 1 | Prepare a .jsonl file | Local file creation | One JSON object per line; each line needs a unique custom_id, plus method, url, and the request body. A single file may target only one model. |
| 2 | Upload the input file | Files API, POST /v1/files with purpose="batch" | Returns an input_file_id. |
| 3 | Create the batch | POST /v1/batches | References the input_file_id, the target endpoint, and a completion_window (set to "24h"). |
| 4 | Poll the batch status | GET /v1/batches/{batch_id} | Status moves through validating, in_progress, finalizing, then completed (or failed, expired, cancelling, cancelled). |
| 5 | Retrieve the results | GET /v1/files/{output_file_id}/content | When complete, the batch object exposes an output_file_id (successful results) and an error_file_id (failed requests), each in JSONL form keyed by custom_id. |
The custom_id field is required and must be unique within the file, because results in the output are not guaranteed to be in the same order as the inputs.[3] Batches can also be cancelled before completion, and a list endpoint allows developers to enumerate their batch jobs.[5]
Batch jobs are billed at 50% of the standard synchronous token price for the same model, applied to both input and output tokens.[2][4] Billing follows the model used, so the absolute per-token rate varies by model (for example, a GPT-4o batch is billed at half of the GPT-4o synchronous rate). The discount is the defining commercial feature of the service.
The API enforces explicit size limits on each batch and its input file, summarized below.[3]
| Limit | Value |
|---|---|
| Maximum requests per batch | 50,000 |
| Maximum input file size | 200 MB |
| Model per input file | Exactly one |
| Completion window | 24 hours |
| Input/output file retention | Files expire after 30 days |
| Cost vs. synchronous endpoints | 50% discount on input and output tokens |
For embeddings, batches are additionally restricted to a maximum of 50,000 embedding inputs across all requests in the batch.[3] Rate limits for batch work are tracked in a dedicated pool, separate from the synchronous rate limits, and are expressed in part as a cap on the number of input tokens that can be enqueued at once per model.[2][3]
At launch on April 15, 2024, the Batch API supported only the Chat Completions endpoint (/v1/chat/completions).[2][6] On April 29, 2024, OpenAI published a dedicated Batch API guide and added support for embeddings models via /v1/embeddings, allowing bulk generation with models such as text-embedding-3.[7] When GPT-4o launched in the API on May 13, 2024, it was available through the Batch API as a text and vision model, extending batch processing to image inputs handled by chat completions.[8]
OpenAI has continued to broaden coverage over time. As documented, the Batch API supports the following endpoints:[3]
/v1/responses (the Responses API)/v1/chat/completions/v1/embeddings/v1/completions/v1/moderations/v1/images/generations/v1/images/edits/v1/videosEach input file targets a single endpoint, specified when the batch is created.[3]
The Batch API is intended for workloads where throughput and cost matter more than immediate latency. Common applications include:[1][3]
Because the service decouples submission from completion, it is well suited to scheduled or overnight pipelines, and to organizations that want to process large jobs without exhausting the rate limits reserved for their real-time, customer-facing traffic.[3]