GPT API
Last reviewed
May 8, 2026
Sources
40 citations
Review status
Source-backed
Revision
v3 ยท 8,651 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
40 citations
Review status
Source-backed
Revision
v3 ยท 8,651 words
Add missing citations, update stale details, or suggest a clearer explanation.
The GPT API is the public HTTP interface that [[OpenAI]] exposes for programmatic access to its hosted language, vision, audio, image, and video models. The phrase has been used loosely since the GPT-3 beta in 2020 and remains the way most developers refer to the surface today, even though OpenAI's own marketing has shifted between "the API," "OpenAI API," and product-specific names like the [[Responses API|Responses]] and [[Chat Completions API|Chat Completions]] endpoints. In practice, "GPT API" covers everything served from https://api.openai.com/v1/, including text generation, embeddings, image generation with gpt-image-1, speech synthesis, transcription with [[Whisper]], real-time voice over [[WebRTC]], moderation, file storage, fine-tuning, batch jobs, and webhooks.
This article is the canonical reference for the API surface itself. "GPT API" and "[[OpenAI API]]" are widely used interchangeably; the OpenAI API article covers the same ground from a slightly broader angle (history, business model, ecosystem) while this article focuses on the endpoints, authentication, SDKs, pricing, and operational details. Specific models such as GPT-5, GPT-4o, [[o3]], and gpt-image-1 have their own articles; the [[ChatGPT]] consumer product is documented separately. For Microsoft's parallel offering see Azure OpenAI Service.
The API is what turned OpenAI from a research lab into a platform business. By 2024 it was the largest piece of the company's revenue mix, by 2025 it was the substrate underneath most enterprise [[generative AI]] integrations, and by 2026 the shape of /v1/chat/completions had become a de facto industry standard that almost every other model provider tried to imitate. The history below traces how that happened, what each generation of the API looked like, what the current endpoints do, what they cost, what they can be replaced with, and where they have been quietly broken in ways that catch new developers off guard.
OpenAI announced the API in private beta on June 11, 2020, framing it as the company's first commercial product and the home of [[GPT-3]]. The launch post described a single "text-in, text-out" interface that any developer could call to perform virtually any English language task, with early access limited to companies that had been piloting the technology, including Algolia, Koko, MessageBird, Sapling, Replika, Casetext, Quizlet, and Reddit. The waitlist stayed long for most of 2020 and 2021. OpenAI removed the waitlist in November 2021, opening access to anyone in supported countries with a credit card on file.
The early API only exposed a single endpoint, POST /v1/completions, which accepted a prompt string and returned a continuation. Models were addressed by short names like davinci, curie, babbage, and ada, each available in base and instruction-tuned variants. The Embeddings endpoint shipped in January 2022, fine-tuning followed for the base GPT-3 series, and the moderation classifier joined the lineup in August 2022. None of those primitives looked like an "API platform" yet; they looked like a single research model behind a credit card form.
The biggest structural shift came on March 1, 2023, when OpenAI launched the Chat Completions API at POST /v1/chat/completions alongside gpt-3.5-turbo. The new endpoint took an array of role-tagged messages instead of a single prompt, and it became the dominant surface within months. Pricing was deliberately aggressive: gpt-3.5-turbo undercut the older text-davinci-003 by about 10x, which converted a generation of hobbyist projects into paid traffic almost overnight. Function calling arrived in the same endpoint on June 13, 2023, exposing structured tool descriptions through functions and function_call parameters. OpenAI later renamed those parameters to tools and tool_choice in November 2023 to match the multi-tool model that GPT-4 Turbo supported. Vision input followed on November 6, 2023, when GPT-4 Turbo with Vision became available, letting clients pass image URLs or base64 data inside chat messages.
The rest of 2023 added Whisper transcription, [[DALL-E]] 3 image generation, and the original Assistants API, which introduced server-side conversation threads, runs, and file-search retrieval. None of those Assistants primitives stuck. OpenAI announced the Responses API on March 11, 2025, called Assistants "v1 beta," and started a one-year sunset clock that ends August 26, 2026. The Assistants story is one of the few times OpenAI shipped a new abstraction that did not graduate; the developer feedback was that threads, runs, and steps were heavier than what most use cases needed, and the framework's lock-in to OpenAI-hosted state was uncomfortable for teams that wanted to keep conversation history in their own database.
2024 was an infrastructure year. Project-scoped API keys arrived in April 2024, the [[Batch API]] launched the same month with a 50% discount, the [[Realtime API]] entered public beta on October 1, 2024, prompt caching turned on automatically across the latest model snapshots on the same day, and Structured Outputs guaranteed JSON schema conformance starting August 6, 2024. The official .NET library went stable in October 2024, the Go SDK shipped in July 2024, and a Java SDK followed shortly after. The Admin and Audit Log APIs landed on August 1, 2024, which is when OpenAI's offering started to feel like it could pass an enterprise security review without a wrapper layer. The omni-moderation model replaced the older text-only classifier on September 26, 2024 and became free across the board.
In 2025, OpenAI pushed the API toward agent workflows. The Responses API, the Conversations API, the Agents SDK for Python (March 2025) and TypeScript (June 2025), the computer-use tool, and the GA Realtime release on August 28, 2025 with the new gpt-realtime model all landed in that window. Reinforcement fine-tuning became generally available, webhooks went live, the Codex CLI shipped as an open-source Rust binary in April 2025, and a year of model releases (o3, o4-mini, gpt-4.1, gpt-5, gpt-5.1, gpt-5.2) kept the model picker churning. The current state, as of mid-2026, is a multi-endpoint API that is still backward compatible with most 2023 client code while quietly rotating new traffic toward Responses and Conversations.
A handful of milestones worth pinning down because they show up in many third-party guides without dates:
| Date | Event |
|---|---|
| June 11, 2020 | API beta launches with [[GPT-3]] and /v1/completions |
| November 18, 2021 | Public availability, waitlist removed |
| January 25, 2022 | First Embeddings models shipped |
| March 1, 2023 | Chat Completions and gpt-3.5-turbo launch |
| June 13, 2023 | Function calling lands in Chat Completions |
| November 6, 2023 | GPT-4 Turbo with Vision goes GA in the API |
| January 25, 2024 | text-embedding-3-small and text-embedding-3-large ship |
| April 15, 2024 | Batch API launches with a 50% discount |
| April 2024 | Project-scoped API keys roll out |
| August 1, 2024 | Admin and Audit Log APIs released |
| August 6, 2024 | Structured Outputs and gpt-4o-2024-08-06 launch |
| September 26, 2024 | omni-moderation-latest replaces text-only moderation |
| October 1, 2024 | Realtime API public beta, prompt caching turns on, .NET SDK GA |
| March 11, 2025 | Responses API and Agents SDK launch |
| April 23, 2025 | gpt-image-1 opens image generation in the API |
| May 23, 2025 | Reinforcement fine-tuning GA on o4-mini |
| August 20, 2025 | Conversations API launches |
| August 26, 2025 | Assistants API deprecation announced |
| August 28, 2025 | Realtime API GA with gpt-realtime |
| March 24, 2026 | Sora discontinuation announced |
| May 7, 2026 | Realtime beta interface retires |
| May 12, 2026 | DALL-E 2 and DALL-E 3 retire |
| August 26, 2026 | Assistants API sunset |
| September 24, 2026 | Sora 2 video API sunset |
All endpoints share the base URL https://api.openai.com/v1/ and use JSON request and response bodies, with the exception of audio transcription (multipart form upload) and file uploads. Every successful response returns a 200 status; errors use the standard 4xx and 5xx codes documented in the [[OpenAI error codes]] reference. Most endpoints accept a small set of optional headers in addition to Authorization: OpenAI-Organization, OpenAI-Project, OpenAI-Beta (used historically for Assistants and the original Realtime), and Idempotency-Key for safe retries on POST calls.
| Endpoint | Path | Status | Purpose |
|---|---|---|---|
| Responses | POST /v1/responses | GA | Stateful, agent-oriented endpoint that combines chat, tools, web search, file search, code interpreter, and computer use in one call |
| Chat Completions | POST /v1/chat/completions | GA | Stateless message-array endpoint, the de facto standard since 2023 |
| Completions | POST /v1/completions | Legacy | Single-prompt interface from the 2020 era, supported only by older base models |
| Embeddings | POST /v1/embeddings | GA | Vector embeddings via text-embedding-3-small and text-embedding-3-large |
| Conversations | POST /v1/conversations | GA | Container API for long-running conversations used with Responses |
| Images | POST /v1/images/generations, /edits, /variations | GA | Image generation with gpt-image-1; DALL-E 2 and 3 retire May 12, 2026 |
| Audio | POST /v1/audio/speech, /transcriptions, /translations | GA | Text-to-speech, [[Whisper]] transcription, and audio translation |
| Realtime | wss://api.openai.com/v1/realtime, WebRTC, SIP | GA | Low-latency speech-to-speech with gpt-realtime |
| Moderations | POST /v1/moderations | GA | Free safety classifier, currently omni-moderation-latest |
| Files | POST /v1/files, GET /v1/files/{id} | GA | Upload up to 512 MB per file, 2.5 TB per project |
| Uploads | POST /v1/uploads | GA | Multipart uploads for files larger than 512 MB, up to 8 GB |
| Fine-tuning | POST /v1/fine_tuning/jobs | GA | Supervised, DPO, and reinforcement fine-tuning |
| Batch | POST /v1/batches | GA | Asynchronous bulk processing at a 50% discount |
| Vector Stores | POST /v1/vector_stores | GA | Managed embedding indexes used by file search |
| Webhooks | POST /v1/webhooks | GA | Subscribe to batch, fine-tuning, response, and realtime events |
| Models | GET /v1/models | GA | List models the caller can access |
| Usage and billing | GET /v1/organization/usage, /v1/organization/costs | GA | Programmatic usage and cost reporting |
| Audit logs | GET /v1/organization/audit_logs | GA | Admin API surface for compliance |
| Admin keys | POST /v1/organization/admin_api_keys | GA | Create keys for org-level automation |
| Project API keys | POST /v1/organization/projects/{id}/api_keys | GA | Create project-scoped keys |
| Assistants | POST /v1/assistants | Deprecated | Sunset August 26, 2026; migrate to Responses + Conversations |
| Threads, Runs, Run Steps | POST /v1/threads, /v1/threads/{id}/runs | Deprecated | Same Assistants sunset window |
| Videos (Sora) | POST /v1/videos | Sunsetting | Sora 2 API shuts down September 24, 2026 |
| Edits | POST /v1/edits | Removed | Folded into chat completions |
| Search, Classifications, Answers, Engines | POST /v1/{search,classifications,answers,engines} | Removed | Shut down December 3, 2022 |
The Responses endpoint is the one OpenAI now points new projects at. It is stateful by default, can carry tool state across turns when used with a Conversation object, and natively supports the four built-in tools: web_search, file_search, code_interpreter, and computer_use. It also accepts remote MCP servers as tools, which is how OpenAI's documentation suggests integrating third-party data sources without writing custom function-calling glue. Responses support background mode (background: true), which returns immediately with a job id and notifies the caller via webhook when the run is complete; that pattern is the right way to handle deep research jobs and long-running computer-use sessions.
Chat Completions is not deprecated and OpenAI has been explicit that it will keep working. In practice, most existing client code, the [[LangChain]] integration, the [[LlamaIndex]] integration, and almost every OpenAI-compatible third-party endpoint still target /v1/chat/completions, so the endpoint will likely outlive several model generations. The product team has said that new features will land in Responses first and may eventually appear in Chat Completions, but there is no committed sunset date. The migration cost is real for any application that uses tools, since the tool call format shifts from tool_calls arrays inside an assistant message to typed output items, but for plain text completion the migration is essentially renaming messages to input.
The legacy Completions endpoint is a different story. It only works with older base models like gpt-3.5-turbo-instruct and babbage-002, and OpenAI has flagged it as a candidate for retirement once the underlying models are deprecated. Modern chat-tuned models including GPT-4o, GPT-5, and the [[o-series]] reasoning models reject /v1/completions requests outright with a 400 error. There is one ongoing use case for the old endpoint: certain logprob-style evaluations and zero-shot classification recipes still rely on its logprobs parameter, which the chat endpoint exposes only in a more limited form.
The original /v1/engines, /v1/search, /v1/classifications, and /v1/answers endpoints from the GPT-3 beta were shut down on December 3, 2022, with migration guides pointing developers to the Embeddings and Completions endpoints. The original /v1/fine-tunes endpoint shut down on January 4, 2024 in favor of /v1/fine_tuning/jobs. The Edit endpoint, /v1/edits, was removed in 2023 and its use cases moved into chat completions. Each of these removals followed OpenAI's standard pattern: at least six months of advance notice in the Deprecations page, a migration guide with side-by-side examples, and a hard cutoff after which requests return 404.
The Assistants API is the largest deprecation in flight. OpenAI announced on August 26, 2025 that Assistants would be removed exactly one year later, on August 26, 2026. Migration paths point to Responses (for the model interaction itself) and Conversations (for the thread-like state container that Assistants exposed). The migration guide is one of the few times OpenAI has shipped detailed side-by-side examples instead of a one-line deprecation notice, which suggests the team understands the cost of the change for production deployments. The Realtime beta interface was also retired May 7, 2026 in favor of the GA Realtime contract, which is similar but not byte-compatible.
DALL-E 2 and DALL-E 3 retire on May 12, 2026. Sora 2 video generation, which had a brief life from late 2025 through early 2026, is scheduled to shut down on September 24, 2026 after OpenAI announced it was discontinuing the product on March 24, 2026. The text moderation models (text-moderation-latest, text-moderation-stable) retired October 27, 2025, with omni-moderation as the replacement. Several model snapshots also have hard sunset dates: legacy GPT-3.5 Turbo and GPT-4 variants are scheduled for October 23, 2026, with gpt-4.1-mini and gpt-4.1 as the recommended replacements; the original o1 series is going away the same day, with o3 as the migration target.
Every API call carries an Authorization: Bearer <API_KEY> header. There is no OAuth flow for first-party applications, no JWT exchange, and no signed URL scheme. The bearer token is the only thing that sits between a client and the model, which keeps the surface simple and makes any leak immediately catastrophic. OpenAI scans GitHub and a few other public surfaces for leaked keys and revokes them automatically when they appear, but the gap between the leak and the revocation is long enough for someone to drain a credit balance.
OpenAI introduced project-scoped API keys in April 2024 and has been steering developers off the older organization-wide "user keys" since. A project key is bound to a single project inside an organization, and the project itself carries usage limits, member lists, and rate-limit settings. Compromising a project key cannot reach data or billing in another project, which makes the model better suited to multi-tenant SaaS deployments than the old shared keys. Most enterprises now create one project per environment (development, staging, production) and a separate project per major product line, so a key leak is contained both blast-radius-wise and audit-wise.
A typical request looks like:
curl https://api.openai.com/v1/responses \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-5", "input": "hello"}'
Clients that belong to several organizations can disambiguate with two optional headers: OpenAI-Organization for the organization id and OpenAI-Project for the project id. Both default to whatever the API key is bound to. Older user-level keys still work but are scheduled for eventual deprecation; OpenAI has not committed to a specific date but has been moving the dashboard's defaults toward project keys for over a year.
Service accounts let an organization or project own a key that is not tied to an individual user account. Service accounts work like normal project keys but survive employee turnover, which is the kind of detail that only matters until the day a developer leaves and a key has to be rotated under pressure. Admin API keys are a separate class managed at the organization level. They cannot call inference endpoints; instead, they unlock the audit logs, project management, and user provisioning APIs that OpenAI shipped alongside the project key system in August 2024. Admin keys are how SCIM-style provisioning, automated key rotation, and compliance reporting get built on top of the platform.
OpenAI also supports SSO with SAML, JIT user provisioning, and SCIM for organizations on the enterprise plan. Those mechanisms govern who can log in to the dashboard and create keys; they do not change how the API itself authenticates a request, which remains the bearer token model.
OpenAI maintains first-party libraries for most major languages, all generated from a shared OpenAPI specification. Generating from a single spec is what keeps feature support roughly synchronized across SDKs, although the Python and Node libraries usually pick up new endpoints first.
| SDK | Repository | First release | Notes |
|---|---|---|---|
| Python | openai/openai-python | 2020 | Reference implementation, ships with async, streaming, retry, and Pydantic-typed responses |
| Node.js / TypeScript | openai/openai-node | 2020 | Edge-runtime compatible, used by most JavaScript and Bun projects |
| .NET | openai/openai-dotnet | June 2024 (beta), October 2024 (GA) | Built in collaboration with Microsoft, full Assistants v2 and chat support |
| Go | openai/openai-go | July 2024 | Requires Go 1.22+, official replacement for the popular community sashabaranov/go-openai |
| Java | openai/openai-java | 2024 (beta), 2025 (GA) | Maven coordinates com.openai:openai-java, current major version 4.x |
| Agents SDK (Python) | openai/openai-agents-python | March 2025 | Higher-level orchestration on top of the Responses API |
| Agents SDK (TypeScript) | openai/openai-agents-js | June 2025 | Same surface for Node/Deno/Bun |
The community ecosystem is enormous. sashabaranov/go-openai predated the official Go SDK by more than a year and still has more downloads. OkGoDoIt/OpenAI-API-dotnet was the most popular .NET option before Microsoft and OpenAI shipped the official package. Spring AI, LangChain4j, simple-openai, and the various "OpenAI for Rust" crates remain widely used. There is no official Rust SDK, although the Codex CLI itself is written in Rust and contains a usable client. PHP, Ruby, Swift, Kotlin, and Elixir all have well-maintained community libraries; OpenAI links to a curated list on the SDKs and CLI page.
Most SDKs default to reading the API key from the OPENAI_API_KEY environment variable, expose synchronous and asynchronous variants, and stream chunks back as iterators or async iterables. They all wrap the same HTTP API, so downgrading to a raw curl or [[fetch]] call is straightforward when something does not work. The official libraries also share a few quality-of-life features: automatic retries with exponential backoff on retryable errors, configurable timeouts, structured error types, and request-level idempotency keys. The Python and Node SDKs additionally expose typed event streams for the Responses and Realtime APIs, which is what the Agents SDK builds on.
The Agents SDK deserves a separate mention. It is not strictly necessary; everything it does can be coded directly against Responses and Conversations. What it provides is opinion: a Runner class that executes the agent loop, a Handoff primitive for delegating to other agents, a Tracing integration that logs every step to the dashboard, and a Guardrail system for input and output validation. For teams that are starting from scratch on a multi-agent workflow, it removes a few hundred lines of glue code. For teams that already have an agent harness, the Responses API is usable directly without it.
Pricing is per-token for text and per-second or per-megabyte for audio, image, and video. The published rates change often, and the pricing page is the authoritative source. The structure has been more stable than the numbers, which makes it worth understanding the categories rather than memorizing the rates.
The standard rate is what most developers pay. Each model has a per-million input rate and a per-million output rate, with output charged at a 4x to 8x multiple of input on the flagship models. As of mid-2026, GPT-5 is $1.25 per million input tokens and $10 per million output tokens; GPT-5.2 sits at roughly $1.75 input and $14 output; the smaller gpt-4.1-mini, gpt-4o-mini, and o4-mini price an order of magnitude lower. Reasoning tokens generated by o-series models are billed as output tokens even when they are not visible in the response, which means a request that prints fifty tokens of visible text can still bill for fifty thousand tokens of reasoning.
Cached input tokens are billed at 10% of the standard input rate when prompt caching applies. Caching is automatic for prompts of at least 1,024 tokens, kicks in across calls within a few minutes, and was rolled out on October 1, 2024. The 90% discount can be the difference between a profitable RAG system and an unprofitable one, so most production stacks now structure prompts so that the static portion (system prompt, tool schemas, retrieved documents) sits at the front of the message array where it can be cached. The cache is per-organization, lives on a single inference cluster for several minutes, and refreshes lazily on hits. There is no API to manage cache entries; they are populated and evicted automatically.
The Batch API trades latency for a flat 50% discount on every model. A batch is a JSONL file uploaded to the Files API and submitted to /v1/batches; OpenAI guarantees results within 24 hours and frequently returns them in minutes. Batch traffic does not count against the synchronous rate limits, which makes it the standard way to run nightly evaluations or large embedding jobs without throttling production traffic. Webhooks fire on completion. The discount applies to input, output, and even cached tokens, and stacks with prompt caching when the batch contains repeated prefixes.
Priority processing is a pay-as-you-go premium tier introduced in 2025. The headline rate is roughly 1.5x to 2x the standard input and output prices, and in exchange OpenAI promises lower latency, fewer 503s during peak hours, and a separate rate-limit pool. It is enabled per request via a service_tier parameter and is positioned as the right choice for latency-sensitive consumer features. Priority is still pay-as-you-go and does not require a contract, which sets it apart from Scale Tier.
Scale Tier is the enterprise version. Customers buy "token units" (a fixed number of input and output tokens per minute) for a single model snapshot, with a 30-day minimum commitment, dedicated capacity, and a 99.9% uptime SLA. Pricing is custom and contractual rather than self-serve. The same model is available to Azure customers as Provisioned Throughput Units, although the unit math differs. Scale Tier customers also get earlier access to new model snapshots and to capacity guarantees during launches, which is a non-trivial advantage when a flagship model launches and the standard tier rate-limits everyone for a week.
Region-specific endpoints for data residency add a 10% surcharge for models released after March 5, 2026. The list of supported regions has grown gradually, and includes EU, UK, Japan, Korea, Canada, India, and Australia at the time of writing. Regional traffic is processed inside the region; usage and billing remain global.
The free moderation tier is the only place where calls are not metered. The omni-moderation model is free to use through /v1/moderations, with rate limits that scale with the caller's usage tier. This is a deliberate policy choice on OpenAI's side: free moderation removes the financial reason for developers to skip safety filtering on user input.
A simple worked example may help calibrate expectations. A chatbot that processes 1,000 requests per day with an average of 2,000 input tokens and 500 output tokens on GPT-5 costs about $2.50 + $5.00 = $7.50 per day at standard rates, before caching. Most of that input is a static system prompt and is cached after the first call, dropping the input portion to $0.25 per day, for a total around $5.25 daily. Running the same workload on the Batch API would cut another 50% off, but only matters if 24-hour latency is acceptable. Add Priority processing during business hours and the math gets more complicated; most teams end up running mixed-tier strategies where production user traffic is on Priority and offline analytics are on Batch.
Rate limits are dimensional, per-model, and per-organization. The dimensions are RPM (requests per minute), TPM (tokens per minute), RPD (requests per day), TPD (tokens per day) for some models, IPM (images per minute) for image endpoints, and audio minutes per minute for streaming audio. Hitting any single dimension returns a 429.
There are six usage tiers: Free, Tier 1, Tier 2, Tier 3, Tier 4, and Tier 5. Tier promotions are automatic and happen on a combination of cumulative paid spend and account age:
| Tier | Qualification | Indicative scale |
|---|---|---|
| Free | New accounts, limited models | A few requests per minute |
| Tier 1 | Any payment method on file | 500 RPM and 30,000 TPM on GPT-4o; ~500k TPM on GPT-5 |
| Tier 2 | $50 cumulative paid + 7 days | About 5x Tier 1 |
| Tier 3 | $100 cumulative paid + 7 days | About 10x Tier 1 |
| Tier 4 | $250 cumulative paid + 14 days | Several million TPM on GPT-5 |
| Tier 5 | $1,000 cumulative paid + 30 days | The published ceiling for self-serve |
For accounts that need more than Tier 5 provides, the path forward is Scale Tier, Priority processing, or a direct conversation with OpenAI's sales team. Limits are visible in the dashboard under Settings, then Limits, which lists the cap for every dimension on every model that the account can use. Limits at the project level are set independently and act as ceilings underneath the organization-wide caps. That nesting is useful for protecting a flagship model's quota from being burned by a runaway development project; setting a low project cap on a development project means the production project always has its full share.
Access to specific models is also tier-gated. The computer-use tool was originally restricted to Tier 3 and above, the o-series reasoning models had similar gating during their preview windows, and brand-new model snapshots typically start with reduced limits while OpenAI watches for abuse patterns. Reinforcement fine-tuning was originally Tier 4 and up. The pattern is consistent enough that "wait a week and try again" is a reasonable workaround when a feature is gated above the current tier.
The Batch API has its own pool. Each batch request counts against the daily batch quota in tokens (typically several billion per day on the high-volume tiers), but does not consume synchronous TPM or RPM. That separation is the main reason large embedding pipelines and offline evaluations move to batch even when the 24-hour window is not strictly necessary.
The 429 response carries useful headers. x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens, x-ratelimit-limit-requests, and x-ratelimit-remaining-requests show the current quota and what is left. retry-after (or its sibling retry-after-ms) indicates the wait time in seconds. Honoring the header is faster and friendlier than guessing, and it is the only way to avoid a tight retry loop that just keeps blowing through the rate limit and consuming budget on failed requests.
Function calling, which is what OpenAI now calls "tool use," is the mechanism that lets a model decide to call an external function and pass it structured arguments. It launched on June 13, 2023 inside Chat Completions, originally as functions and function_call parameters. The November 2023 update generalized those parameters into tools and tool_choice, allowing several tool types in a single call.
A tool definition is a JSON Schema describing the function name, description, and parameters. The model returns a tool_calls array containing the chosen function and a JSON arguments string; the client executes the function locally, sends the result back as a tool role message, and the model continues. Multiple tool calls per turn are supported, including parallel calls when the model judges them independent. The parallel_tool_calls parameter (default true) lets the developer turn that off when ordering matters, which is common in workflows where tool A's output feeds tool B's input.
In the Responses API, tools are first-class items in the input array and tool calls are emitted as discrete output items rather than wrapped inside a chat message. The Responses surface also supports OpenAI's built-in tools that run on OpenAI infrastructure rather than the client:
web_search performs live web searches and grounds the response in citations.file_search retrieves from a vector store the developer has populated with uploaded files.code_interpreter runs Python in a sandboxed container.computer_use drives a virtual browser or desktop, paired with a computer-use-preview model.Tools are charged for what they cost OpenAI to run. Web search calls are billed per query (typically a few cents per call), code interpreter sessions per minute, and computer-use turns at a higher rate than ordinary tokens because they include both reasoning and execution. Built-in tools also count against per-tool rate limits; web search has its own QPS cap that is independent of the model's TPM.
The choice between custom function calling and built-in tools is a build-or-buy question. Custom functions give complete control over what the tool does, where it runs, and how it logs. Built-in tools save the integration work but lock the application to OpenAI's implementation. Most production agents end up with a mix: built-in web_search and code_interpreter because they are hard to replicate cleanly, and a long tail of custom tools that hit the application's own database, internal services, and proprietary APIs.
Structured Outputs guarantees that the model's text output exactly matches a developer-supplied JSON Schema. It launched on August 6, 2024 alongside gpt-4o-2024-08-06, which was the first model trained to handle complex schemas, and OpenAI also added a constrained decoding path so the guarantee is engineering-backed rather than just a model behavior. Structured Outputs is enabled with response_format: { type: "json_schema", json_schema: ... } on Chat Completions, or text.format on Responses. It also works on tool definitions, which is how most production agents now describe their tools because it removes the entire class of "the model returned almost-valid JSON" bugs.
The supported schema features are a subset of full JSON Schema. Required: string, number, boolean, integer, array, object, enum, anyOf. Not supported: oneOf, allOf, conditional schemas, recursive references with $ref to anywhere outside the document. There is also a hard cap on schema depth (5 levels of nesting) and total property count (100). Schemas that exceed those limits return a 400 at request time, not silently at response time, which makes the failure mode obvious during development.
The older JSON Mode (response_format: { type: "json_object" }) still works and is supported on more models, but it only guarantees that the response parses as JSON; the schema constraint is on the developer to enforce. JSON Mode predates Structured Outputs by about ten months and remains useful when the application needs free-form JSON whose shape changes per call. For everything else, Structured Outputs is the better default.
Most endpoints support streaming. For Chat Completions, setting stream: true makes the response a stream of [[Server-Sent Events]] (SSE), each carrying a delta object with the next chunk of content, tool call arguments, or finish reason. The stream terminates with a data: [DONE] line.
Responses uses a richer event model. The response is still SSE, but each event has a typed name like response.created, response.output_text.delta, response.output_item.added, response.completed, or error. That extra structure is what lets agent frameworks render reasoning steps, tool calls, and final text separately without parsing inline JSON. The Conversations API uses the same event types when responses are streamed back through it.
Streaming reduces time-to-first-token from several seconds to under one second on most models, which is critical for chat-style interfaces. It does not reduce total cost, since billing is by token regardless of how the tokens are delivered. Most SDKs hide the SSE plumbing behind an async iterator: in Python, for chunk in client.chat.completions.create(..., stream=True); in Node, for await (const chunk of stream).
There are a few common pitfalls. Streaming hides errors that occur mid-response: a 500 error returned after some tokens have been emitted will appear as a truncated stream rather than as a clear failure, so production code needs to distinguish "stream ended with [DONE]" from "stream ended without a finish reason." Backpressure is another concern; consumers that cannot keep up with the stream can cause the connection to back up and eventually time out. Buffering chunks into larger updates before passing them to a UI usually solves both problems.
For Responses, the include parameter controls which event types the server emits. By default, all events are sent. Setting include: ["response.output_text.delta", "response.completed"] keeps the bandwidth down for clients that only need the final text and does not need to render intermediate states.
The API has supported image input on chat models since November 2023. The image is passed as either a public URL or a base64-encoded data URI inside a chat message; clients control fidelity with a detail parameter (low, high, or auto). Image tokens are billed alongside text tokens and the cost scales with resolution and detail level. The supported formats are PNG, JPEG, GIF (first frame only), and WebP. Maximum image size is 20 MB per image at the API level; the model's effective resolution cap depends on the model, with most flagship models accepting images up to about 2048 by 2048 pixels at high detail.
Audio input arrived with gpt-4o-audio-preview in 2024 and went GA across the GPT-5 family in 2025. Models accept a base64 WAV or MP3 inside a message and can return audio output the same way. The Realtime API uses the same models with a streaming transport. For non-realtime audio chat, the latency is comparable to text chat plus the time to upload the audio file, which is usually a few hundred milliseconds for clips under a minute.
The Audio endpoints are separate. POST /v1/audio/speech does text-to-speech with tts-1, tts-1-hd, and the newer gpt-4o-mini-tts family, supporting voices alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar, in mp3, opus, aac, flac, wav, and pcm. The newer voices support style instructions ("speak slowly," "sound enthusiastic") embedded in the input text, which the older tts-1 voices ignore. POST /v1/audio/transcriptions does speech-to-text with whisper-1 and the newer gpt-4o-transcribe family, accepting mp3, mp4, mpeg, mpga, m4a, wav, and webm files up to 25 MB. POST /v1/audio/translations does the same but always returns English.
Image generation moved off DALL-E and onto gpt-image-1 after April 23, 2025, and onto gpt-image-1.5 later that year. The endpoint is POST /v1/images/generations, with /edits and /variations for image-to-image work. Output is returned as base64 by default or as a URL with a short expiry. The images endpoint accepts a size parameter (1024x1024, 1024x1536, 1536x1024, and a few smaller variants), a quality parameter (low, medium, high), and a style parameter for some variants. Pricing is per generated image and varies with size and quality. The same model is also accessible from inside Chat Completions and Responses via the image_generation tool, which lets a single conversational call mix text reasoning with image creation.
The Files API handles uploads of documents, images, and audio for use across endpoints. Individual files can be up to 512 MB, and each project can store up to 2.5 TB of files in total. Uploads to this endpoint are rate-limited to 1,000 requests per minute per authenticated user. The Uploads API handles files up to 8 GB by accepting them in multiple parts and assembling a final File object on completion.
The Realtime API solves a problem that the standard request-response surface cannot: low-latency, full-duplex voice. It launched in public beta on October 1, 2024 with WebSocket as the only transport, added WebRTC support in December 2024, and went generally available on August 28, 2025 alongside the new gpt-realtime model. SIP is supported as a third transport, which is what enables direct phone integration without a media server in the middle.
A Realtime session is an open connection that streams audio frames in both directions plus a control channel of typed JSON events. The model can interrupt itself when it detects the user starting to speak, emit transcripts of both sides, call tools mid-sentence, and switch voices on the fly. Image input is supported during a session, which is how voice agents can answer questions about something the user is showing on camera. Remote MCP servers can be wired in as tools, the same as in Responses.
The transport choice matters. WebSockets are the simplest to implement from a server-side application but introduce 200 to 500 ms of round-trip latency on top of the model latency. WebRTC adds a media stack and STUN/TURN configuration but cuts the network latency to under 100 ms, which is what lets a voice agent feel conversational rather than walkie-talkie-like. SIP is the choice when integrating with an existing phone system; the API accepts an inbound call directly and can dial out as well.
Pricing is per audio token rather than per second, with a separate input and output rate. The GA gpt-realtime is roughly $32 per million audio input tokens and $64 per million audio output tokens, about 20% cheaper than the preview version. There is also a less expensive gpt-realtime-mini variant for use cases where the full model is overkill. The original beta interface was retired May 7, 2026 in favor of the GA contract.
Server-side controls let an application supervise an in-progress session: redact transcripts before they reach the model, cancel a response mid-generation, override the voice or speaking style, and inject system messages without restarting the connection. Webhooks deliver realtime.call.incoming events for inbound SIP calls and realtime.call.completed for session summaries, which is how production deployments wire up call-center workflows.
The two endpoints look superficially similar but model the world differently. Chat Completions is stateless: every request carries the entire message array, the server processes it, and nothing is retained. Responses is stateful by default: the server stores the conversation history and tool state, the next request only carries the new turn, and reasoning context can persist across calls.
The practical differences:
| Aspect | Chat Completions | Responses |
|---|---|---|
| State | Stateless, client owns history | Stateful, optional store: false for stateless mode |
| Input shape | messages: [...] | input: [...] plus previous_response_id |
| Tools | tools: [...] with custom JSON schemas | Same plus built-in web_search, file_search, code_interpreter, computer_use, MCP |
| Reasoning | o-series reasoning tokens are billed but not surfaced cleanly | Reasoning items are first-class output items |
| Streaming | Generic delta events | Typed events (response.output_text.delta, etc.) |
| Conversation containers | None, client-side only | Conversations API holds threads |
| Background jobs | Not supported | background: true, webhook on completion |
| Migration story | None needed | Responses can call the same models |
OpenAI has been clear that new projects should start on Responses, but it has been equally clear that Chat Completions is not deprecated. The two endpoints will likely coexist for years, in the way that the old /v1/completions endpoint coexisted with /v1/chat/completions even after the latter became the obvious choice. For most teams, the migration question comes down to whether the application benefits from server-side state and the built-in tools. Pure RAG over a developer-managed index? Chat Completions is fine. Multi-step agent that uses web search, runs code, and remembers what it did three turns ago? Responses pays back the migration cost.
The Conversations API sits underneath Responses as the thread-like container. A Conversation object is a server-side bundle of message history that several Responses calls can attach to, and it is the closest replacement for the Assistants Thread object. Conversations are not strictly required; an application can pass previous_response_id to chain calls without ever creating a Conversation. They become useful when several agents collaborate on the same conversation, when a conversation needs to span sessions or devices, or when the application wants OpenAI to handle context compaction automatically.
Webhooks let OpenAI push events to a URL the developer controls instead of forcing the client to poll. The webhook system shipped in 2025 and uses the [[Standard Webhooks]] specification, which means HMAC-SHA256 signatures sent in a webhook-signature header in the format v1,base64_encoded_signature. Developers configure a webhook with a name, a public HTTPS endpoint, and a list of subscribed event types. OpenAI generates a signing secret on creation that is shown exactly once.
Supported event categories include:
response.completed, response.cancelled, response.failedbatch.completed, batch.cancelled, batch.expired, batch.failedfine_tuning.job.succeeded, fine_tuning.job.failed, fine_tuning.job.cancelledrealtime.call.incoming, realtime.call.completedFor batch and fine-tuning workloads, webhooks are the difference between a polling loop that may run for hours and a single push event that arrives the moment the job is done. For Realtime calls coming in over SIP, the realtime.call.incoming webhook is what lets a backend route the call to a session it has prepared, the same pattern that traditional telephony platforms use.
The signature verification model is straightforward: concatenate the message id, timestamp, and body, HMAC with the secret, base64 encode, and compare with the signature header. OpenAI's documentation includes copy-paste implementations for Python, Node, and Go. Most teams reuse a Standard Webhooks library rather than rolling their own.
The shape of /v1/chat/completions has become a de facto industry standard. [[Anthropic]], [[Google Gemini]], [[Mistral]], [[Cohere]], [[Together AI]], [[Fireworks AI]], [[Groq]], [[OpenRouter]], [[vLLM]], [[Ollama]], [[LM Studio]], and most other LLM serving platforms expose either an OpenAI-compatible endpoint or a near-compatible one. Most SDKs let clients swap in a different base_url and an alternate API key, then talk to a non-OpenAI backend with the same code. The compatibility is rarely complete: tool calling formats drift, streaming event shapes differ in the details, and provider-specific parameters (cache control, thinking budgets, safety modes) get added or removed at the edges. For migrations, the gap is small enough that most code works without changes for the chat endpoint and large enough that anything using Responses, Realtime, or built-in tools needs a rewrite.
Azure OpenAI Service is the closest thing to OpenAI's own API but is not byte-compatible. Azure uses an api-key header instead of Authorization: Bearer, requires a deployment name in the path instead of exposing models directly, and lags on new endpoints by weeks or months. The Responses API arrived on Azure several months after OpenAI's launch; Sora and computer-use lagged similarly. For developers who need to dual-target both, the official Azure SDKs handle the differences, and the OpenAI SDKs can usually be pointed at Azure with a custom base URL plus a small adapter. Azure also supports its own concept of capacity reservation (Provisioned Throughput Units) that does not match OpenAI's Scale Tier exactly, so cross-cloud comparisons require care.
Local-first runtimes like Ollama and LM Studio also implement the chat completions surface. The trade-off is the usual one: local models are private and free at the marginal call but lag the closed models on capability, and they almost never implement Responses, Realtime, or the built-in tools. For development, prototyping, and offline use cases, the local OpenAI-compatible servers are excellent. For production, most teams use them as a fallback or for non-sensitive batch workloads rather than as the primary path.
A handful of router services (LiteLLM, OpenRouter, Helicone, Portkey) sit in front of the API and offer a single OpenAI-shaped endpoint that fans out to multiple providers. Those routers usually add their own observability, retry, caching, and key management, and they are the standard answer for organizations that want to A/B test model providers without rewriting their application code.
The API uses the standard HTTP status codes. The ones that show up most often:
| Status | Meaning | Recommended action |
|---|---|---|
| 200 | Success | Process the response |
| 400 | Bad request, schema or parameter error | Fix the client; do not retry |
| 401 | Invalid or missing API key | Check the key, do not retry |
| 403 | Region not supported, or access not granted | Confirm region and account status |
| 404 | Model or resource not found | Verify the model name, especially after a deprecation |
| 408 | Request timed out | Retry with backoff, consider streaming |
| 409 | Conflict, common on idempotency-key reuse | Use a different idempotency key |
| 413 | Payload too large | Trim the request or use the Uploads API |
| 422 | Unprocessable entity | Fix the input |
| 429 | Rate limit exceeded | Honor the retry-after header, exponential backoff with jitter |
| 500 | Internal server error | Retry with backoff |
| 503 | Service unavailable, model overloaded | Retry with backoff, consider a fallback model |
| 529 | Overloaded (rare on OpenAI; common on competitor compatible endpoints) | Retry with backoff |
Beyond the status code, every error response carries a JSON body with error.type, error.code, error.param, and error.message. The code field is the machine-readable identifier and is what production code should switch on; common values include invalid_api_key, insufficient_quota, model_not_found, context_length_exceeded, tokens_exceeded, and rate_limit_exceeded. The message field is human-readable and changes wording occasionally, so matching on it is brittle.
The best-practice patterns OpenAI recommends in its cookbook have been stable for years:
retry-after header on a 429 is authoritative; respecting it is usually faster than guessing.max_completion_tokens and timeouts. Both protect against runaway reasoning and silently expensive calls.gpt-5 resolve to whatever snapshot is current; pinning to gpt-5-2026-04-15 (or whatever snapshot the application was tested against) keeps behavior reproducible across deploys.A few additional items show up often enough in incident postmortems to be worth listing:
context_length_exceeded is a 400 not a 429, so it does not retry; it surfaces immediately to the user. Token-counting before sending is cheap insurance.OpenAI publishes a status page at status.openai.com that lists incidents per service. The API generally has 99.9% availability across a quarter, but individual models have noticeably worse incident rates during launch weeks. Building a fallback path to a different model (or a different provider through one of the OpenAI-compatible routers) is the single most effective way to keep an application up during those windows.