OpenAI API
Last reviewed
May 8, 2026
Sources
42 citations
Review status
Source-backed
Revision
v5 ยท 7,994 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
42 citations
Review status
Source-backed
Revision
v5 ยท 7,994 words
Add missing citations, update stale details, or suggest a clearer explanation.
The OpenAI API is a REST-based application programming interface that gives developers programmatic access to OpenAI's family of artificial intelligence models, including the GPT series of large language models, the o-series reasoning models, Codex coding models, DALL-E and gpt-image-1 image generators, Whisper and gpt-4o-transcribe speech recognition, and the gpt-realtime audio agent stack. First launched in June 2020 as a private beta alongside GPT-3, the API has grown into one of the most widely used AI developer platforms in the world, serving over one million organizations as of December 2025 [1].
The API follows a pay-per-use pricing model based on tokens (roughly four characters of English text), with separate rates for input and output tokens, plus deep discounts for cached input and asynchronous batch jobs. Developers authenticate using API keys scoped to projects within organizations, and they can access models through HTTP endpoints or official SDKs for Python, Node.js, .NET, Java, and Go [2]. As of May 2026, the primary endpoint is the Responses API, an agent-friendly successor to Chat Completions that bundles built-in tools (web search, file search, code interpreter, computer use, image generation, and remote Model Context Protocol servers) into a single stateful interface.
OpenAI opened its API to developers as a private beta in June 2020, coinciding with the release of GPT-3. This marked the company's first commercial product, positioning GPT-3's 175 billion parameters as a general-purpose text engine available over HTTP [3]. The initial API offered text completions through a single endpoint, with pricing based on four model tiers (Ada, Babbage, Curie, and Davinci) that traded off capability for cost. Davinci, the most capable tier, was priced at $0.06 per 1,000 tokens, or roughly $60 per million tokens.
The beta attracted thousands of developers and spawned early AI-powered applications in copywriting, customer support, and code generation. Microsoft secured an exclusive license to GPT-3's underlying technology in September 2020, though the API itself remained accessible to all approved developers [4]. By the end of 2021, OpenAI had also released Codex through the API, which would go on to power the original GitHub Copilot.
On March 1, 2023, OpenAI released the gpt-3.5-turbo model through the API, introducing the Chat Completions endpoint that would become the standard interface for conversational AI [5]. This model was priced at $0.002 per 1,000 tokens, a 10x reduction from GPT-3 Davinci, while delivering superior instruction-following and dialogue quality. The launch coincided with an explosion of developer interest following ChatGPT's viral success in late 2022.
The Chat Completions format introduced the now-standard message-based structure with system, user, and assistant roles, replacing the older text-in/text-out completions paradigm. This format has since been adopted across the industry by Anthropic, Google, and others.
GPT-4 arrived in the API on March 14, 2023, bringing multimodal capabilities (text and image input) and significantly improved reasoning [6]. Throughout 2023 and 2024, OpenAI rapidly expanded the API's feature set:
| Date | Feature |
|---|---|
| June 2023 | Function calling, enabling models to generate structured JSON for tool use |
| November 2023 | GPT-4 Turbo with 128K context, JSON mode, and the Assistants API beta |
| April 2024 | Batch API at 50% off list price with a 24-hour SLA |
| May 2024 | GPT-4o with native multimodal processing at half the cost of GPT-4 Turbo |
| July 2024 | GPT-4o mini at $0.15 per million input tokens |
| August 2024 | Structured Outputs guaranteeing schema-compliant JSON |
| September 2024 | o1 reasoning models with chain-of-thought capabilities |
| October 2024 | Realtime API beta over WebSocket for voice agents |
| December 2024 | o1 GA, function calling, structured outputs, and developer messages |
In March 2025, OpenAI introduced the Responses API as the successor to both the Chat Completions API and the Assistants API [7]. The Responses API functions as a superset of Chat Completions, adding built-in tools (web search, file search, code interpreter, computer use, image generation), server-side conversation state, and a more flexible input format for multimodal data. The same announcement introduced the Agents SDK, an open source orchestration framework for multi-agent workflows. The Assistants API was deprecated in August 2025 with a sunset date of August 26, 2026 [8]. WebRTC support arrived in the Realtime API in early 2025, followed by the production gpt-realtime model in late 2025. Reinforcement Fine-Tuning (RFT) became generally available in May 2025 for o-series reasoning models, and DPO (Direct Preference Optimization) followed for the GPT-4.1 family.
Major model releases continued at a rapid cadence: GPT-4.1 in April 2025, GPT-5 in August 2025, GPT-5.1 in November 2025, GPT-5.2 in December 2025, GPT-5.4 in March 2026, and GPT-5.5 in April 2026. Built-in image generation via gpt-image-1 launched in April 2025, and the computer use tool, originally exposed through the Operator consumer product, became available as a Responses API tool for tier 3-5 customers later that year.
The OpenAI API exposes several endpoint families. Some are recommended for new work, others remain available primarily for backward compatibility [2][9]. The table below summarizes the current state.
| Endpoint | Path | Status (May 2026) | Primary use |
|---|---|---|---|
| Responses | /v1/responses | Recommended for all new work | Stateful, agentic, multi-tool generation |
| Conversations | /v1/conversations | GA, paired with Responses | Persistent multi-turn state |
| Chat Completions | /v1/chat/completions | Maintained, no new agent features | Wire-compatible legacy interface |
| Completions | /v1/completions | Legacy, text-only | Older base models, fine-tuned bases |
| Realtime | wss://api.openai.com/v1/realtime | GA over WebSocket, WebRTC, and SIP | Speech-in/speech-out agents |
| Assistants v2 | /v1/assistants, /v1/threads | Deprecated, sunsets Aug 26 2026 | Legacy assistant projects |
| Embeddings | /v1/embeddings | GA | Vector representations for retrieval |
| Images | /v1/images/generations, /edits, /variations | GA | DALL-E 3 and gpt-image-1 |
| Audio Speech | /v1/audio/speech | GA | TTS-1, TTS-1-HD, gpt-4o-mini-tts |
| Audio Transcriptions | /v1/audio/transcriptions | GA | Whisper-1, gpt-4o-transcribe family |
| Audio Translations | /v1/audio/translations | GA | Translate audio to English text |
| Files | /v1/files | GA | Upload data for fine-tuning, batch, file search |
| Vector stores | /v1/vector_stores | GA | Hosted vector database for file search |
| Fine-tuning | /v1/fine_tuning/jobs | GA, supports SFT, DPO, RFT | Custom model training |
| Batch | /v1/batches | GA, 50% discount, 24h SLA | High-throughput async jobs |
| Moderations | /v1/moderations | GA, free | Safety classification |
| Models | /v1/models | GA | List and inspect available models |
The Responses API at /v1/responses is the most important addition to the platform since Chat Completions itself. It combines text generation, tool use, and built-in services like web search and code interpreter into a single request, and it stores reasoning state server-side so subsequent turns can reuse it [10]. OpenAI reports that GPT-5 served through the Responses API scores about 3% higher on SWE-bench than the same model served through Chat Completions, primarily because the Responses API can keep the model's hidden reasoning chain across tool calls instead of throwing it away every turn. Cache utilization improves 40-80% relative to Chat Completions on long sessions for the same reason.
A minimal request looks like a Chat Completions call with messages replaced by an input field that accepts strings, message arrays, image parts, or file parts, plus a tools array mixing function definitions with hosted tools. The response includes an output array of typed items (message, tool_call, reasoning, mcp_list_tools), which is easier to parse than the older choices/messages structure.
State management follows two patterns. With store: true (the default), OpenAI persists the response object for 30 days, and a follow-up call can pass previous_response_id to chain turns without resending the full history. For organizations on Zero Data Retention, or anyone who prefers stateless deployments, the API returns encrypted reasoning items that the client passes back on the next turn to preserve thinking without storing it on OpenAI's servers [10]. The Responses API also exposes a phase field on streamed messages (introduced in early 2026) that labels assistant text as either commentary or final_answer. A WebSocket variant arrived in February 2026 for long-running streams.
Chat Completions remains supported and continues to receive new model snapshots, but it does not get new agent features. Built-in tools, the computer use tool, image generation as a tool, and MCP connectors are Responses-only. Function calling, vision, structured outputs, streaming, and reasoning effort all work in both endpoints, so existing Chat Completions code keeps running.
OpenAI's official guidance is that new projects should use Responses, and existing Chat Completions projects can migrate when they need an agent feature. The migration path is deliberately cheap: most Chat Completions code converts to Responses by changing the URL and renaming messages to input [11].
The original /v1/completions endpoint still works for the handful of base models that predate the chat format and for fine-tuned models built on those bases. New flagship models do not ship with a base completion endpoint. Most developers will never use this path.
The Realtime API is the audio-native interface, designed for low-latency voice agents. It speaks WebSocket, WebRTC, or SIP (for telephony), and it carries audio in both directions plus text, transcripts, and tool calls on the same channel [12]. The dedicated Realtime API section below covers transports, voices, function calling, and pricing in detail.
The Assistants API, introduced at OpenAI's DevDay in November 2023, provided a higher-level abstraction for building AI assistants with persistent threads, automatic context management, and built-in tools [13]. The v2 release in April 2024 replaced the original retrieval tool with the file search tool backed by hosted vector stores. In August 2025, OpenAI deprecated the Assistants API in favor of the Responses API. The Assistants API will be removed on August 26, 2026 [8]. The Assistants API v2 and migration section below covers the migration path.
The /v1/embeddings endpoint converts text into numerical vector representations useful for semantic search, clustering, and retrieval-augmented generation (RAG). The current models, text-embedding-3-small and text-embedding-3-large, are trained with Matryoshka Representation Learning, so a developer can request fewer dimensions (down to 256 or 512) and still get most of the retrieval quality at a fraction of the storage cost [14]. On MTEB, text-embedding-3-large truncated to 256 dimensions beats text-embedding-ada-002 at its full 1,536. text-embedding-3-large defaults to 3,072; text-embedding-3-small defaults to 1,536. The legacy ada-002 still works but is not recommended.
The Images API hosts two model families. DALL-E 3 generates images at fixed sizes (1024x1024, 1792x1024, 1024x1792) and quality settings (standard or HD). The newer gpt-image-1, released in April 2025, is a natively multimodal model that accepts both text and image inputs and is much better at rendering text inside images, following style instructions, and producing consistent characters across edits [15]. gpt-image-1 supports image generation, editing, and inline use as a Responses tool. A successor, gpt-image-1.5, rolled out alongside GPT Image 2 in early 2026. Images are priced per image with size and quality multipliers.
Three audio endpoints handle the major speech tasks. /v1/audio/speech generates spoken audio using TTS-1, TTS-1-HD, or gpt-4o-mini-tts (which supports prompt-based voice direction). /v1/audio/transcriptions converts speech to text using whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize, the last of which produces speaker-aware transcripts via the diarized_json response format [16]. /v1/audio/translations translates non-English audio into English text. The gpt-4o-transcribe family has lower word error rates than Whisper and supports streaming transcription. Whisper-1 remains popular for batch transcription because it is cheap and well-understood.
The Files API stores artifacts (training data, batch payloads, RAG documents) for use across the platform. Vector stores are managed retrieval indexes that back the file search tool: a developer uploads files, attaches them to a vector store, and then either calls file search inside a Response or reads the index directly through /v1/vector_stores. Vector stores handle chunking, embedding, and ranking automatically, so most projects avoid running their own vector database [17]. The Moderations endpoint classifies text (and with omni-moderation, images) across hate, harassment, sexual, self-harm, and violence categories; it is free. /v1/models lists every model the calling key has access to, with metadata like ownership and creation date. Fine-tuning and Batch are covered in dedicated sections below.
As of May 2026, the OpenAI API offers a wide range of models across several families. The following tables list the primary models with current pricing per million tokens [20][21]. Pricing changes regularly; the platform pricing page is the source of truth.
| Model | Release | Context | Input | Cached input | Output | Notes |
|---|---|---|---|---|---|---|
| GPT-5.5 Pro | April 2026 | 1M+ | $30.00 | varies | $180.00 | Maximum capability variant |
| GPT-5.5 | April 2026 | 1M+ (922K in / 128K out) | $5.00 | $0.50 | $30.00 | New frontier; 2x/1.5x for >272K input |
| GPT-5.4 Pro | March 2026 | 1.05M | premium | premium | premium | Computer-use native, top tier |
| GPT-5.4 | March 2026 | 1.05M | $2.50 | $0.25 | $15.00 | Most token-efficient reasoning |
| GPT-5.4 mini | March 2026 | 400K | $0.75 | $0.075 | $4.50 | Lower-latency, lower-cost |
| GPT-5.4 nano | March 2026 | 400K | $0.20 | $0.02 | $1.25 | Budget tier |
| GPT-5.2 | December 2025 | 400K | $1.75 | $0.175 | $14.00 | Professional knowledge work |
| GPT-5.1 | November 2025 | 400K | $1.25 | $0.125 | $10.00 | Faster routing variant |
| GPT-5 | August 2025 | 400K | $1.25 | $0.125 | $10.00 | Unified base + thinking architecture |
| GPT-5 mini | August 2025 | 400K | $0.25 | $0.025 | $2.00 | High-volume budget |
| GPT-5 nano | August 2025 | 400K | $0.05 | $0.005 | $0.40 | Smallest GPT-5 variant |
| GPT-4.5 | February 2025 | 128K | $75.00 | $37.50 | $150.00 | Research preview, deprecated July 2025 |
| GPT-4.1 | April 2025 | 1M | $2.00 | $0.50 | $8.00 | Coding-optimized |
| GPT-4.1 mini | April 2025 | 1M | $0.40 | $0.10 | $1.60 | Fast, affordable |
| GPT-4.1 nano | April 2025 | 1M | $0.10 | $0.025 | $0.40 | Lowest-cost GPT-4.1 |
| GPT-4o | May 2024 | 128K | $2.50 | $1.25 | $10.00 | Multimodal flagship of its era |
| GPT-4o mini | July 2024 | 128K | $0.15 | $0.075 | $0.60 | High-volume budget option |
The o-series models are trained with reinforcement learning to perform internal chain-of-thought reasoning before generating responses. They produce reasoning tokens that count against the output-token bill but are not visible in the response by default [22]. With the GPT-5 family rolling reasoning into the unified router, the o-series is now positioned as the specialist track for hard math, scientific reasoning, and reinforcement fine-tuning targets.
| Model | Release | Context | Input | Cached input | Output | Notes |
|---|---|---|---|---|---|---|
| o3 | April 2025 | 200K | $10.00 | $2.50 | $40.00 | General reasoning flagship |
| o3-mini | January 2025 | 200K | $1.10 | $0.55 | $4.40 | Fast STEM reasoning |
| o4-mini | April 2025 | 200K | $1.10 | $0.275 | $4.40 | "Thinks with images," RFT target |
| o1 | December 2024 | 200K | $15.00 | $7.50 | $60.00 | First production reasoning model |
| o1-mini | September 2024 | 128K | $1.10 | $0.55 | $4.40 | Deprecated for o3-mini |
The Codex line, revived in 2025 as an agentic coding family, is a separate snapshot lineage from the main GPT-5 family. The current production model is gpt-5.3-codex; gpt-5.2-codex and gpt-5.1-codex are still available as snapshot pins [23]. Codex models are tuned for long-running agent loops and tool use, and they support a xhigh reasoning effort setting alongside the standard low/medium/high.
| Model | Release | Status | Notes |
|---|---|---|---|
| gpt-5.3-codex | March 2026 | Recommended | Most capable agentic coding model |
| gpt-5.2-codex | December 2025 | Available | Long-running coding agents |
| gpt-5.1-codex | November 2025 | Available | Earlier snapshot |
| gpt-5-codex | August 2025 | Available | Original GPT-5-based variant |
| codex-1 | May 2025 | Legacy | Original 2025 Codex; o3-derived |
| Model | Purpose | Notes |
|---|---|---|
| gpt-image-1.5 | Image generation | Successor to gpt-image-1, faster |
| gpt-image-1 | Image generation, editing | Native multimodal, strong text rendering |
| DALL-E 3 | Image generation | Prompt-faithful, fixed sizes |
| Whisper-1 | Speech-to-text | Cheap batch transcription |
| gpt-4o-transcribe | Speech-to-text | Lower WER than Whisper, streaming |
| gpt-4o-mini-transcribe | Speech-to-text | Cost-efficient streaming transcription |
| gpt-4o-transcribe-diarize | Speech-to-text + diarization | Speaker-aware transcripts |
| TTS-1 / TTS-1-HD | Text-to-speech | Lower-cost / higher-quality voices |
| gpt-4o-mini-tts | Text-to-speech | Prompt-controllable voice direction |
| gpt-realtime | Realtime audio | Production speech-in/speech-out |
| gpt-realtime-mini | Realtime audio | Cost-sensitive variant |
| Model | Purpose | Input price (per 1M tokens) | Default dimensions |
|---|---|---|---|
| text-embedding-3-large | High-quality embeddings | $0.13 | 3,072 (configurable) |
| text-embedding-3-small | Cost-efficient embeddings | $0.02 | 1,536 (configurable) |
| text-embedding-ada-002 | Legacy embeddings | $0.10 | 1,536 |
| omni-moderation-latest | Text + image moderation | Free | n/a |
| text-moderation-latest | Text-only moderation | Free | n/a |
In 2025 OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0. These are not hosted on the platform, but they appear on the model index for parity and can be served on third party infrastructure. Developers who want OpenAI quality with full self-hosting now have an option that did not exist for most of the API's history.
The Responses API ships with hosted tools that the model can call without any developer-side wiring. Each one runs server-side, returns its output back into the model's context, and is billed per use [10][24].
File search is a managed retrieval-augmented generation tool over uploaded vector stores. The developer creates a vector store, attaches files (PDFs, Markdown, code, JSON, images with OCR), and passes the vector store ID into the tools array. The model decides when retrieval would help, calls the tool with a query, and gets back ranked chunks plus citations. File search handles chunking, embedding, hybrid retrieval, and reranking, removing most of the boilerplate early RAG systems had to write [25].
The web search tool lets the model fetch live results from the public internet and ground its answer in cited sources. It is automatically invoked when the model judges that current information is needed, and the response includes inline source URLs the developer can render as citations. Web search supports geographic targeting and freshness filters.
The code interpreter tool runs Python in a sandboxed container. The container has filesystem access scoped to the response, can install most common scientific Python packages, and can read or generate files (CSVs, plots, PDFs). This is the same engine ChatGPT exposes for data analysis, available as a tool any Responses API call can invoke [26]. Code interpreter is the easiest path to numerical reasoning that is actually correct, because the model offloads arithmetic and table manipulation to the Python runtime instead of trying to do it in tokens.
The computer use tool, originally exposed through the Operator consumer product, lets the model drive a virtual desktop or browser by emitting screenshots-and-mouse-clicks instructions [27]. The tool runs on a Computer-Using Agent (CUA) model that combines GPT-4o vision with reinforcement-learned UI understanding, and it is available as a research preview in the Responses API for usage tier 3-5 customers. Pricing is $3 per million input tokens and $12 per million output tokens. The developer is responsible for the virtual machine; OpenAI ships a reference sample app that uses Docker, Browserbase, or Anchor.
gpt-image-1 can be invoked as a tool from inside a Responses call, which means an agent can produce text, search the web, run Python, and generate images in a single conversation without the developer wiring up four separate APIs. The image generation tool returns image IDs and URLs that flow back into the conversation history.
The Responses API supports Model Context Protocol servers as remote tools. The developer registers an MCP server URL with optional authentication, the API discovers its tool schema on first use (cached as an mcp_list_tools item), and the model calls those tools transparently. This is how OpenAI is connecting the API to the broader MCP ecosystem that Anthropic introduced in late 2024 and that has since been adopted by most major AI vendors. MCP turns every tool author into a third-party any model can discover, which is more or less the long-promised plugin ecosystem the original ChatGPT plugins tried and failed to deliver.
Function calling, introduced in June 2023, lets a developer define functions with names, descriptions, and JSON Schema parameter schemas. The model decides when to call them, generates a JSON object that conforms to the schema, and the developer's code executes the function and returns the result on the next turn [28]. This is the foundation of nearly every modern AI agent, because it gives the model a structured way to request actions in the outside world.
With strict: true set on a function definition, the API guarantees that the model's arguments will be valid against the schema. Strict mode uses constrained decoding to force the model to only sample tokens that produce schema-valid JSON. The trade-off is a small latency hit on the first request with a new schema (because OpenAI compiles the schema into a finite-state machine and caches it) and a few unsupported schema features (like arbitrary regex on strings).
By default, models can emit multiple tool calls in a single turn. The API returns them as an array, and the developer's code can run them in parallel before sending all results back together. This is materially faster than serial tool use.
The tool_choice parameter accepts auto (default), none, required (the model must call some tool), or a specific function name. Combined with strict: true, this is how developers reliably extract structured data: pass a single tool definition with tool_choice set to require it, and the model returns schema-valid arguments rather than free text. Common patterns include tools as routers (each tool corresponds to a different downstream pipeline) and tools as confirmations (the model emits a tool call describing what it wants to do, and a human or downstream service approves before execution).
Structured Outputs, launched in August 2024, guarantees that model responses conform exactly to a provided JSON Schema [29]. This goes beyond JSON mode (which only ensured syntactically valid JSON) by enforcing strict schema adherence. The developer enables it by setting strict: true in a function definition or by passing response_format: { type: "json_schema", strict: true, schema: {...} } for non-tool responses.
On complex schema-following evals, gpt-4o-2024-08-06 with Structured Outputs scored a perfect 100% versus less than 40% with prompt-only schema instructions. The implementation uses a context-free grammar derived from the schema, which constrains decoding so only schema-valid tokens can be sampled.
Both Python and Node SDKs accept native typed objects: a Pydantic model in Python or a Zod schema in TypeScript becomes the JSON schema with no extra serialization step. The SDK parses the output back into a typed object, so the developer never touches raw JSON [29].
Limitations: schemas must be a subset of JSON Schema (no $ref to external URLs, no regex patterns, limited combinator support), additionalProperties: false is required on every object, the first call with a new schema pays a one-time compilation cost, and Structured Outputs interacts with tool_choice: required in subtle ways worth checking the cookbook on.
The Realtime API is the audio-native interface for low-latency voice agents, originally launched in October 2024 over WebSocket and expanded with WebRTC in early 2025 and SIP in mid-2025 [12]. As of May 2026, the production model is gpt-realtime, with gpt-realtime-mini for cost and gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper shipping alongside.
The transport choice is essentially the architecture of the application:
Picking the wrong transport forces a lot of latency and reliability work later, so it is worth thinking about early.
Function calling and most Responses-style tools work in Realtime sessions, including web search and file search. The session protocol exposes events for response.audio.delta, response.text.delta, tool_call.created, and several others.
Barge-in (where the user starts speaking while the assistant is still talking) is built into the protocol. The client cancels the in-flight assistant audio with an input_audio_buffer.speech_started event and the server stops generating. The model can also detect end-of-turn from VAD signals if the client wants the server to manage turn-taking.
Voices include alloy, echo, fable, onyx, nova, and shimmer at launch, with several added since. Languages cover most major spoken languages with quality biased toward English. November 2025 added DTMF key-press support so voice agents can navigate IVR menus on outbound calls.
Audio is billed in tokens: roughly 100 tokens per second of input, 200 tokens per second of output. At gpt-realtime list price ($32 per million input audio tokens, $64 per million output, $0.40 per million cached input), a typical voice call runs about $0.30 per minute. Cached input is the lever that matters most: a system prompt repeated on every turn can drop from $32 to $0.40 per million tokens, an 80x discount.
The Assistants API is the predecessor to Responses [13]. It introduced higher-level concepts (assistants, threads, runs, messages) that many developers found friendlier than raw Chat Completions, and it included built-in code interpreter and file search tools. In August 2025, OpenAI announced that the Assistants API would be fully removed on August 26, 2026. The migration path to the Responses API is documented in the official guide [11]:
| Assistants concept | Responses replacement |
|---|---|
| Assistant object (model + instructions + tools) | Server-side prompt object created in dashboard, or inline instructions + tools in each Responses call |
| Thread (server-side message store) | Conversation object via /v1/conversations, or previous_response_id chain |
| Run (model invocation against thread) | Responses request |
| Message (item in thread) | input items, output items |
| Tool (built-in code interpreter, file search) | Same tools, available natively in Responses |
Most migrations are mechanical. The bigger change is conceptual: assistants were stateful objects with versioning, while Responses encourages a more functional, request-by-request style with state encoded in conversation IDs or encrypted reasoning. Teams that wrote a lot of assistant configuration code tend to rewrite that as either dashboard-managed prompts or static configuration in their own code.
The Agents SDK, released alongside the Responses API in March 2025, is OpenAI's open source orchestration framework for multi-agent workflows [30]. It exists in Python and TypeScript flavors and provides a small number of primitives (Agent, Tool, Handoff, Guardrail, Trace) that compose into systems ranging from a single tool-using agent to a full sub-agent organization.
The core abstraction is the Agent, which is essentially a configured Responses call with instructions and tools. Two orchestration patterns dominate:
The SDK includes built-in tracing, guardrails (input and output validation that runs before and after the model call), and structured outputs integration. Integration with Temporal and other workflow engines exists for production deployments where durability matters.
A related framework, the AgentKit toolkit, ships a visual Agent Builder and an embeddable ChatKit interface. Together with the Agents SDK they form the platform's answer to agent frameworks like LangGraph and CrewAI that emerged outside OpenAI in 2023-2024.
The fine-tuning endpoints let developers customize models on their own data using three different techniques [18]. All three are exposed through the same /v1/fine_tuning/jobs endpoint with different method values.
The original technique. The developer uploads JSONL data where each row is a Chat Completions-shaped conversation, and the model is trained to imitate the assistant turns. SFT is supported on GPT-4o, GPT-4o mini, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, and several earlier base models. SFT is the right choice when the developer has a clear gold standard and several hundred to several thousand examples.
DPO came to the API in 2024 and is available across the GPT-4.1 family (gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14) [31]. Where SFT trains on a single correct response, DPO trains on pairwise comparisons (a preferred response and a rejected one) and learns to favor the patterns in the preferred examples. This is the right technique when the developer can rank outputs more easily than write a single ideal one, which fits most subjective tasks (tone, style, structure).
RFT, generally available since May 2025, is the first fine-tuning method that updates a reasoning model. It uses a developer-supplied grader (a programmable function that scores candidate responses) and runs reinforcement learning over chain-of-thought rollouts to push the model toward higher-scoring outputs [32]. RFT is available on o4-mini and other o-series snapshots. Costs run around $100 per hour on o4-mini, which makes RFT 100-700x more expensive than SFT, but it is the only technique that can teach a reasoning model new domain-specific evaluation criteria. Classic use cases: turning instructions into working code, pulling facts into a clean format, and applying complex rule sets correctly. The grader can be Python code, an OpenAI-hosted reference model, or a hybrid.
| Technique | Best when | Data format | Cost | Supported models |
|---|---|---|---|---|
| SFT | Single gold response per input | Chat conversations (JSONL) | Low | GPT-4o, GPT-4o mini, GPT-4.1 family, others |
| DPO | Easier to rank than to write | Pairs of preferred/rejected responses | Low-medium | GPT-4.1 family |
| RFT | Complex reasoning with measurable rubric | Inputs + grader function | High (~$100/hr) | o4-mini, other o-series |
The Batch API at /v1/batches lets developers submit large volumes of requests for asynchronous processing at a 50% discount on both input and output tokens [19][33]. Batches are guaranteed to complete within 24 hours, though most finish in 1 to 6 hours depending on size and current load.
The workflow is simple: upload a JSONL file of requests through the Files API, create a batch referencing that file ID, poll the batch status, and download the JSONL output file. Each request carries its own custom ID, so the response file can be joined back to the input on the developer side.
Most endpoints are batch-eligible: Chat Completions, Responses, embeddings, and images. Vision inputs work. Streaming does not, since the whole point of batching is async processing. The Realtime API is not batchable. Common patterns include classifying or labeling large datasets, generating embeddings for search indexes, bulk content generation or summarization, evaluation pipelines, and data enrichment for analytics. A 1 million request job that would cost $1,000 synchronously runs $500 in a batch.
The Batch API runs on a separate rate limit pool from synchronous calls, so a batch backfill does not eat into the rate limits of the production traffic. Some models also accept larger contexts in batch mode than they do synchronously, which makes the Batch API the only way to process documents at the upper bound of certain context windows.
OpenAI exposes several service tiers that change the cost-latency-reliability trade-off [34][35].
| Tier | Cost vs standard | Latency | Best for |
|---|---|---|---|
| Priority | ~2.5x | Lowest, most consistent | User-facing apps where latency matters |
| Standard | Baseline | Standard | General production |
| Flex | 50% off | Slower, occasional 429s | Evaluations, data enrichment, async jobs |
| Batch | 50% off | Up to 24h | Bulk async workloads |
| Scale | Negotiated | Reserved capacity | Enterprise with predictable demand |
Priority and Flex are selected per request via the service_tier parameter. Scale Tier is a contractual commitment for very large customers where capacity is reserved at fixed daily rates and throughput is guaranteed regardless of broader platform load.
Authentication uses API keys passed in the Authorization header as Bearer tokens [2]. Modern keys use the sk-proj- prefix, scoping them to a specific project within an organization, which is more secure than the older organization-wide keys. Project keys can only access models, files, and resources tied to their project; organization keys can access anything in the organization. New projects almost always want project keys, because the blast radius of a leaked key is much smaller.
The organization and project structure also drives billing and rate limits. Usage rolls up to the organization, but each project can have its own spending limits, model allowlists, and member roles. Larger teams typically run one project per environment (dev, staging, prod) plus separate projects for shared services.
Best practices for key management:
OPENAI_API_KEY automatically; use .env files or a secrets manager.Users who belong to multiple organizations can specify which one to bill by passing an OpenAI-Organization header [36]. The OpenAI-Project header overrides the project implied by the key.
OpenAI applies rate limits at the organization level based on usage tiers. As cumulative spending increases, organizations automatically graduate to higher tiers with expanded limits [36][37]. Rate limits are measured across RPM (requests per minute), RPD (per day), TPM (tokens per minute), TPD (per day), IPM (images per minute), and AMM (audio minutes per minute, for streaming audio).
| Tier | Qualification | Indicative TPM (GPT-5 family) |
|---|---|---|
| Free | Default for new accounts | Limited access |
| Tier 1 | $5+ paid | 500,000 TPM |
| Tier 2 | $50+ paid, 7+ days since first payment | 1,000,000 TPM |
| Tier 3 | $100+ paid, 7+ days since first payment | 2,000,000 TPM |
| Tier 4 | $250+ paid, 14+ days since first payment | 4,000,000 TPM |
| Tier 5 | $1,000+ paid, 30+ days since first payment | Up to 40,000,000 TPM |
Rate limit information returns in HTTP response headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests), which lets applications throttle proactively rather than retrying blindly. The Retry-After header on 429 responses tells clients how long to wait. Some features unlock at higher tiers: the computer use tool is currently restricted to tiers 3-5, and Reinforcement Fine-Tuning requires manual approval.
OpenAI maintains official SDKs for several languages [38], with community libraries covering most of what is missing.
| Language | Package | Status | Notes |
|---|---|---|---|
| Python | openai (pip) | Stable, official | Sync/async, streaming, Pydantic, auto-pagination |
| Node / TypeScript | openai (npm) | Stable, official | TypeScript-first, streaming, Zod integration |
| .NET | OpenAI (NuGet) | Stable, official (with Microsoft) | .NET Standard 2.0, IAsyncEnumerable streaming |
| Java | openai-java (Maven) | Beta, official | Requires Java 8+, current 4.34.0 |
| Go | openai-go (module) | Beta, official | Idiomatic Go interfaces |
| Ruby | ruby-openai (gem) | Community | Widely used despite no official version |
| PHP | openai-php/client | Community | Maintained by community |
| Rust | async-openai | Community | Tokio-based |
All official SDKs auto-detect OPENAI_API_KEY, provide typed request and response objects, retry with exponential backoff on transient failures, and support streaming through language-appropriate patterns [38].
The Agents SDK is a separate package on top of the base SDK, available in Python (openai-agents-python) and TypeScript (openai-agents-js). New harness features (configurable memory, sandbox-aware orchestration, filesystem tools) tend to land in Python first and follow in TypeScript.
The OpenAI API uses a per-token pricing model with separate rates for input and output tokens [20]. A token is a chunk of text processed by the model's tokenizer, roughly four characters or 0.75 words in English. Input tokens include the system prompt, user messages, conversation history, and tool definitions. Output tokens include the model's response text and, for reasoning models, internal reasoning tokens.
Output tokens are generally priced 2x to 8x higher than input tokens, reflecting the greater compute cost of generation versus comprehension. For long-context models, prompts above a threshold (272K tokens for GPT-5.5) are billed at a higher per-token rate.
Prompt caching automatically reduces costs for repeated input prefixes. When the API detects that the beginning of a prompt matches a recent request, it reads from cache instead of reprocessing those tokens [20]. Cache discounts vary by model family:
| Model family | Cache discount |
|---|---|
| GPT-5 series, GPT-5.5 | 90% off cached input tokens |
| GPT-4.1 series | 75% off cached input tokens |
| GPT-4o / o-series | 50% off cached input tokens |
| gpt-realtime | ~98% off cached input tokens |
Caching is automatic but only triggers for prefixes that match within a recent window (extended in 2026 to up to 24 hours via extended prompt caching). Keep the prefix stable: put the system prompt, tool definitions, and persistent context at the front, and per-request data at the end.
| Strategy | Savings | Trade-off |
|---|---|---|
| Batch API | 50% on all tokens | Up to 24-hour latency |
| Flex processing | 50% on all tokens | Slower, occasional 429s |
| Prompt caching | 50-90% on repeated inputs | Requires consistent prefixes |
| Smaller models (mini/nano) | 80-95% vs flagship | Lower capability on hard tasks |
| Fine-tuning | Reduced prompt length | Upfront training cost |
| Structured Outputs | Fewer retries | Slightly constrained format |
| Predicted Outputs | Reduced output latency | Only useful for partial-edit tasks |
OpenAI's compliance posture matters for any team handling regulated data. The platform meets several enterprise standards [39][40].
Qualifying organizations can request a Zero Data Retention (ZDR) configuration, where the API processes requests without storing any content. Once the response is returned, OpenAI permanently deletes all request data from its systems. ZDR is required for HIPAA compliance and is common for financial services and other regulated industries. Some features (Responses API store=true, Assistants API threads, fine-tuning training data persistence) are not available under ZDR.
Eligible enterprise customers can store sensitive customer content at rest in the United States, European Union, United Kingdom, Japan, Canada, South Korea, Singapore, Australia, India, and the United Arab Emirates [40]. This applies to ChatGPT Enterprise, ChatGPT Edu, ChatGPT for Healthcare, and the API platform.
API traffic is not used to train OpenAI's models by default. The original Chat Completions defaults changed in 2023 to opt-out from training, and the same applies to Responses, Assistants, Realtime, and every other modern endpoint. Free ChatGPT and consumer products have different defaults, but the API is unambiguously opt-out unless the customer explicitly enrolls in a data sharing program.
In the default configuration, OpenAI retains API request and response data for 30 days for abuse monitoring before automatic deletion. ZDR turns this off entirely. The API also publishes audit logs for security-relevant events through the Admin API, which enterprises plug into their SIEM.
Microsoft hosts OpenAI's models on Azure under a separate brand and API surface called Azure OpenAI Service [41]. The endpoint shape is similar but not identical: Azure scopes deployments per region, requires per-deployment names rather than per-model names, and ships features on a slightly delayed timeline.
What is the same: the model weights (GPT-5 on Azure is GPT-5 from OpenAI), most wire formats (the Python SDK has an Azure mode that handles the differences with a small configuration change), and most core features (function calling, structured outputs, vision, embeddings, fine-tuning).
What is different: region availability is per-model and per-feature (Sora launched in Sweden Central and East US 2 first; gpt-realtime launched on Azure Foundry Direct Models in early 2026), and some features arrive on Azure later (the Responses API took several months; computer use was OpenAI-only at first). Azure offers data zone deployments (US-only or EU-only) and global deployments (Microsoft routes to whichever region has capacity). Quotas are per-region, per-subscription, per-model. Pricing is generally similar; Azure offers Provisioned Throughput Units (PTUs) for reserved capacity that are not available on the OpenAI platform directly.
Azure makes sense when the organization is already on Azure, needs PTUs for reserved capacity, or has data residency requirements that Azure satisfies more cleanly. The OpenAI platform makes sense when the organization wants the latest features as soon as they ship or runs across multiple clouds.
As of May 2026, the OpenAI API is the most widely deployed commercial AI API, serving over one million organizations worldwide [1]. The April 2026 release of GPT-5.5 brought the first model where the long-context regime (above 272K tokens) is priced as a deliberate tier rather than an exception, which suggests OpenAI now sees million-token contexts as a routine workload rather than a curiosity. The March 2026 release of GPT-5.4 introduced native computer-use as a first-class capability and pushed the context window past one million tokens for most variants.
Key trends shaping the API in early 2026:
The API evolves at roughly quarterly cadence on flagship models, with smaller updates (snapshot pins, new tools, pricing tweaks) shipping nearly every month. Whether the agent paradigm is the long-term mental model or just the current one is an open question; the platform's design as of mid-2026 is clearly betting it is.