OpenAI API

The OpenAI API is a REST-based application programming interface that gives developers programmatic access to OpenAI's family of artificial intelligence models, including the GPT series of large language models, the o-series reasoning models, Codex coding models, DALL-E and gpt-image-1 image generators, Whisper and gpt-4o-transcribe speech recognition, and the gpt-realtime audio agent stack. First launched in June 2020 as a private beta alongside GPT-3, the API has grown into one of the most widely used AI developer platforms in the world, serving over one million organizations as of December 2025 ^[1].

The API follows a pay-per-use pricing model based on tokens (roughly four characters of English text), with separate rates for input and output tokens, plus deep discounts for cached input and asynchronous batch jobs. Developers authenticate using API keys scoped to projects within organizations, and they can access models through HTTP endpoints or official SDKs for Python, Node.js, .NET, Java, and Go ^[2]. As of May 2026, the primary endpoint is the Responses API, an agent-friendly successor to Chat Completions that bundles built-in tools (web search, file search, code interpreter, computer use, image generation, and remote Model Context Protocol servers) into a single stateful interface.

History

GPT-3 API beta (June 2020)

OpenAI opened its API to developers as a private beta in June 2020, coinciding with the release of GPT-3. This marked the company's first commercial product, positioning GPT-3's 175 billion parameters as a general-purpose text engine available over HTTP ^[3]. The initial API offered text completions through a single endpoint, with pricing based on four model tiers (Ada, Babbage, Curie, and Davinci) that traded off capability for cost. Davinci, the most capable tier, was priced at $0.06 per 1,000 tokens, or roughly $60 per million tokens.

The beta attracted thousands of developers and spawned early AI-powered applications in copywriting, customer support, and code generation. Microsoft secured an exclusive license to GPT-3's underlying technology in September 2020, though the API itself remained accessible to all approved developers ^[4]. By the end of 2021, OpenAI had also released Codex through the API, which would go on to power the original GitHub Copilot.

ChatGPT API and GPT-3.5 Turbo (March 2023)

On March 1, 2023, OpenAI released the gpt-3.5-turbo model through the API, introducing the Chat Completions endpoint that would become the standard interface for conversational AI ^[5]. This model was priced at $0.002 per 1,000 tokens, a 10x reduction from GPT-3 Davinci, while delivering superior instruction-following and dialogue quality. The launch coincided with an explosion of developer interest following ChatGPT's viral success in late 2022.

The Chat Completions format introduced the now-standard message-based structure with system, user, and assistant roles, replacing the older text-in/text-out completions paradigm. This format has since been adopted across the industry by Anthropic, Google, and others.

GPT-4 and expanding capabilities (2023-2024)

GPT-4 arrived in the API on March 14, 2023, bringing multimodal capabilities (text and image input) and significantly improved reasoning ^[6]. Throughout 2023 and 2024, OpenAI rapidly expanded the API's feature set:

Date	Feature
June 2023	Function calling, enabling models to generate structured JSON for tool use
November 2023	GPT-4 Turbo with 128K context, JSON mode, and the Assistants API beta
April 2024	Batch API at 50% off list price with a 24-hour SLA
May 2024	GPT-4o with native multimodal processing at half the cost of GPT-4 Turbo
July 2024	GPT-4o mini at $0.15 per million input tokens
August 2024	Structured Outputs guaranteeing schema-compliant JSON
September 2024	o1 reasoning models with chain-of-thought capabilities
October 2024	Realtime API beta over WebSocket for voice agents
December 2024	o1 GA, function calling, structured outputs, and developer messages

Responses API and the agent era (2025-2026)

In March 2025, OpenAI introduced the Responses API as the successor to both the Chat Completions API and the Assistants API ^[7]. The Responses API functions as a superset of Chat Completions, adding built-in tools (web search, file search, code interpreter, computer use, image generation), server-side conversation state, and a more flexible input format for multimodal data. The same announcement introduced the Agents SDK, an open source orchestration framework for multi-agent workflows. The Assistants API was deprecated in August 2025 with a sunset date of August 26, 2026 ^[8]. WebRTC support arrived in the Realtime API in early 2025, followed by the production gpt-realtime model in late 2025. Reinforcement Fine-Tuning (RFT) became generally available in May 2025 for o-series reasoning models, and DPO (Direct Preference Optimization) followed for the GPT-4.1 family.

Major model releases continued at a rapid cadence: GPT-4.1 in April 2025, GPT-5 in August 2025, GPT-5.1 in November 2025, GPT-5.2 in December 2025, GPT-5.4 in March 2026, and GPT-5.5 in April 2026. Built-in image generation via gpt-image-1 launched in April 2025, and the computer use tool, originally exposed through the Operator consumer product, became available as a Responses API tool for tier 3-5 customers later that year.

Endpoints

The OpenAI API exposes several endpoint families. Some are recommended for new work, others remain available primarily for backward compatibility ^[2]^[9]. The table below summarizes the current state.

Endpoint	Path	Status (May 2026)	Primary use
Responses	`/v1/responses`	Recommended for all new work	Stateful, agentic, multi-tool generation
Conversations	`/v1/conversations`	GA, paired with Responses	Persistent multi-turn state
Chat Completions	`/v1/chat/completions`	Maintained, no new agent features	Wire-compatible legacy interface
Completions	`/v1/completions`	Legacy, text-only	Older base models, fine-tuned bases
Realtime	`wss://api.openai.com/v1/realtime`	GA over WebSocket, WebRTC, and SIP	Speech-in/speech-out agents
Assistants v2	`/v1/assistants`, `/v1/threads`	Deprecated, sunsets Aug 26 2026	Legacy assistant projects
Embeddings	`/v1/embeddings`	GA	Vector representations for retrieval
Images	`/v1/images/generations`, `/edits`, `/variations`	GA	DALL-E 3 and gpt-image-1
Audio Speech	`/v1/audio/speech`	GA	TTS-1, TTS-1-HD, gpt-4o-mini-tts
Audio Transcriptions	`/v1/audio/transcriptions`	GA	Whisper-1, gpt-4o-transcribe family
Audio Translations	`/v1/audio/translations`	GA	Translate audio to English text
Files	`/v1/files`	GA	Upload data for fine-tuning, batch, file search
Vector stores	`/v1/vector_stores`	GA	Hosted vector database for file search
Fine-tuning	`/v1/fine_tuning/jobs`	GA, supports SFT, DPO, RFT	Custom model training
Batch	`/v1/batches`	GA, 50% discount, 24h SLA	High-throughput async jobs
Moderations	`/v1/moderations`	GA, free	Safety classification
Models	`/v1/models`	GA	List and inspect available models

Responses API

The Responses API at /v1/responses is the most important addition to the platform since Chat Completions itself. It combines text generation, tool use, and built-in services like web search and code interpreter into a single request, and it stores reasoning state server-side so subsequent turns can reuse it ^[10]. OpenAI reports that GPT-5 served through the Responses API scores about 3% higher on SWE-bench than the same model served through Chat Completions, primarily because the Responses API can keep the model's hidden reasoning chain across tool calls instead of throwing it away every turn. Cache utilization improves 40-80% relative to Chat Completions on long sessions for the same reason.

A minimal request looks like a Chat Completions call with messages replaced by an input field that accepts strings, message arrays, image parts, or file parts, plus a tools array mixing function definitions with hosted tools. The response includes an output array of typed items (message, tool_call, reasoning, mcp_list_tools), which is easier to parse than the older choices/messages structure.

State management follows two patterns. With store: true (the default), OpenAI persists the response object for 30 days, and a follow-up call can pass previous_response_id to chain turns without resending the full history. For organizations on Zero Data Retention, or anyone who prefers stateless deployments, the API returns encrypted reasoning items that the client passes back on the next turn to preserve thinking without storing it on OpenAI's servers ^[10]. The Responses API also exposes a phase field on streamed messages (introduced in early 2026) that labels assistant text as either commentary or final_answer. A WebSocket variant arrived in February 2026 for long-running streams.

Chat Completions

Chat Completions remains supported and continues to receive new model snapshots, but it does not get new agent features. Built-in tools, the computer use tool, image generation as a tool, and MCP connectors are Responses-only. Function calling, vision, structured outputs, streaming, and reasoning effort all work in both endpoints, so existing Chat Completions code keeps running.

OpenAI's official guidance is that new projects should use Responses, and existing Chat Completions projects can migrate when they need an agent feature. The migration path is deliberately cheap: most Chat Completions code converts to Responses by changing the URL and renaming messages to input ^[11].

Completions (legacy)

The original /v1/completions endpoint still works for the handful of base models that predate the chat format and for fine-tuned models built on those bases. New flagship models do not ship with a base completion endpoint. Most developers will never use this path.

Realtime API

The Realtime API is the audio-native interface, designed for low-latency voice agents. It speaks WebSocket, WebRTC, or SIP (for telephony), and it carries audio in both directions plus text, transcripts, and tool calls on the same channel ^[12]. The dedicated Realtime API section below covers transports, voices, function calling, and pricing in detail.

Assistants API v2 (deprecated)

The Assistants API, introduced at OpenAI's DevDay in November 2023, provided a higher-level abstraction for building AI assistants with persistent threads, automatic context management, and built-in tools ^[13]. The v2 release in April 2024 replaced the original retrieval tool with the file search tool backed by hosted vector stores. In August 2025, OpenAI deprecated the Assistants API in favor of the Responses API. The Assistants API will be removed on August 26, 2026 ^[8]. The Assistants API v2 and migration section below covers the migration path.

Embeddings

The /v1/embeddings endpoint converts text into numerical vector representations useful for semantic search, clustering, and retrieval-augmented generation (RAG). The current models, text-embedding-3-small and text-embedding-3-large, are trained with Matryoshka Representation Learning, so a developer can request fewer dimensions (down to 256 or 512) and still get most of the retrieval quality at a fraction of the storage cost ^[14]. On MTEB, text-embedding-3-large truncated to 256 dimensions beats text-embedding-ada-002 at its full 1,536. text-embedding-3-large defaults to 3,072; text-embedding-3-small defaults to 1,536. The legacy ada-002 still works but is not recommended.

Images

The Images API hosts two model families. DALL-E 3 generates images at fixed sizes (1024x1024, 1792x1024, 1024x1792) and quality settings (standard or HD). The newer gpt-image-1, released in April 2025, is a natively multimodal model that accepts both text and image inputs and is much better at rendering text inside images, following style instructions, and producing consistent characters across edits ^[15]. gpt-image-1 supports image generation, editing, and inline use as a Responses tool. A successor, gpt-image-1.5, rolled out alongside GPT Image 2 in early 2026. Images are priced per image with size and quality multipliers.

Audio: speech, transcription, translation

Three audio endpoints handle the major speech tasks. /v1/audio/speech generates spoken audio using TTS-1, TTS-1-HD, or gpt-4o-mini-tts (which supports prompt-based voice direction). /v1/audio/transcriptions converts speech to text using whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize, the last of which produces speaker-aware transcripts via the diarized_json response format ^[16]. /v1/audio/translations translates non-English audio into English text. The gpt-4o-transcribe family has lower word error rates than Whisper and supports streaming transcription. Whisper-1 remains popular for batch transcription because it is cheap and well-understood.

Files, vector stores, moderations, and models

The Files API stores artifacts (training data, batch payloads, RAG documents) for use across the platform. Vector stores are managed retrieval indexes that back the file search tool: a developer uploads files, attaches them to a vector store, and then either calls file search inside a Response or reads the index directly through /v1/vector_stores. Vector stores handle chunking, embedding, and ranking automatically, so most projects avoid running their own vector database ^[17]. The Moderations endpoint classifies text (and with omni-moderation, images) across hate, harassment, sexual, self-harm, and violence categories; it is free. /v1/models lists every model the calling key has access to, with metadata like ownership and creation date. Fine-tuning and Batch are covered in dedicated sections below.

Models

As of May 2026, the OpenAI API offers a wide range of models across several families. The following tables list the primary models with current pricing per million tokens ^[20]^[21]. Pricing changes regularly; the platform pricing page is the source of truth.

Flagship and reasoning models

Model	Release	Context	Input	Cached input	Output	Notes
GPT-5.5 Pro	April 2026	1M+	$30.00	varies	$180.00	Maximum capability variant
GPT-5.5	April 2026	1M+ (922K in / 128K out)	$5.00	$0.50	$30.00	New frontier; 2x/1.5x for >272K input
GPT-5.4 Pro	March 2026	1.05M	premium	premium	premium	Computer-use native, top tier
GPT-5.4	March 2026	1.05M	$2.50	$0.25	$15.00	Most token-efficient reasoning
GPT-5.4 mini	March 2026	400K	$0.75	$0.075	$4.50	Lower-latency, lower-cost
GPT-5.4 nano	March 2026	400K	$0.20	$0.02	$1.25	Budget tier
GPT-5.2	December 2025	400K	$1.75	$0.175	$14.00	Professional knowledge work
GPT-5.1	November 2025	400K	$1.25	$0.125	$10.00	Faster routing variant
GPT-5	August 2025	400K	$1.25	$0.125	$10.00	Unified base + thinking architecture
GPT-5 mini	August 2025	400K	$0.25	$0.025	$2.00	High-volume budget
GPT-5 nano	August 2025	400K	$0.05	$0.005	$0.40	Smallest GPT-5 variant
GPT-4.5	February 2025	128K	$75.00	$37.50	$150.00	Research preview, deprecated July 2025
GPT-4.1	April 2025	1M	$2.00	$0.50	$8.00	Coding-optimized
GPT-4.1 mini	April 2025	1M	$0.40	$0.10	$1.60	Fast, affordable
GPT-4.1 nano	April 2025	1M	$0.10	$0.025	$0.40	Lowest-cost GPT-4.1
GPT-4o	May 2024	128K	$2.50	$1.25	$10.00	Multimodal flagship of its era
GPT-4o mini	July 2024	128K	$0.15	$0.075	$0.60	High-volume budget option

O-series reasoning models

The o-series models are trained with reinforcement learning to perform internal chain-of-thought reasoning before generating responses. They produce reasoning tokens that count against the output-token bill but are not visible in the response by default ^[22]. With the GPT-5 family rolling reasoning into the unified router, the o-series is now positioned as the specialist track for hard math, scientific reasoning, and reinforcement fine-tuning targets.

Model	Release	Context	Input	Cached input	Output	Notes
o3	April 2025	200K	$10.00	$2.50	$40.00	General reasoning flagship
o3-mini	January 2025	200K	$1.10	$0.55	$4.40	Fast STEM reasoning
o4-mini	April 2025	200K	$1.10	$0.275	$4.40	"Thinks with images," RFT target
o1	December 2024	200K	$15.00	$7.50	$60.00	First production reasoning model
o1-mini	September 2024	128K	$1.10	$0.55	$4.40	Deprecated for o3-mini

Codex models

The Codex line, revived in 2025 as an agentic coding family, is a separate snapshot lineage from the main GPT-5 family. The current production model is gpt-5.3-codex; gpt-5.2-codex and gpt-5.1-codex are still available as snapshot pins ^[23]. Codex models are tuned for long-running agent loops and tool use, and they support a xhigh reasoning effort setting alongside the standard low/medium/high.

Model	Release	Status	Notes
gpt-5.3-codex	March 2026	Recommended	Most capable agentic coding model
gpt-5.2-codex	December 2025	Available	Long-running coding agents
gpt-5.1-codex	November 2025	Available	Earlier snapshot
gpt-5-codex	August 2025	Available	Original GPT-5-based variant
codex-1	May 2025	Legacy	Original 2025 Codex; o3-derived

Image and audio models

Model	Purpose	Notes
gpt-image-1.5	Image generation	Successor to gpt-image-1, faster
gpt-image-1	Image generation, editing	Native multimodal, strong text rendering
DALL-E 3	Image generation	Prompt-faithful, fixed sizes
Whisper-1	Speech-to-text	Cheap batch transcription
gpt-4o-transcribe	Speech-to-text	Lower WER than Whisper, streaming
gpt-4o-mini-transcribe	Speech-to-text	Cost-efficient streaming transcription
gpt-4o-transcribe-diarize	Speech-to-text + diarization	Speaker-aware transcripts
TTS-1 / TTS-1-HD	Text-to-speech	Lower-cost / higher-quality voices
gpt-4o-mini-tts	Text-to-speech	Prompt-controllable voice direction
gpt-realtime	Realtime audio	Production speech-in/speech-out
gpt-realtime-mini	Realtime audio	Cost-sensitive variant

Embedding and moderation models

Model	Purpose	Input price (per 1M tokens)	Default dimensions
text-embedding-3-large	High-quality embeddings	$0.13	3,072 (configurable)
text-embedding-3-small	Cost-efficient embeddings	$0.02	1,536 (configurable)
text-embedding-ada-002	Legacy embeddings	$0.10	1,536
omni-moderation-latest	Text + image moderation	Free	n/a
text-moderation-latest	Text-only moderation	Free	n/a

Open-weight models

In 2025 OpenAI released gpt-oss-120b and gpt-oss-20b under Apache 2.0. These are not hosted on the platform, but they appear on the model index for parity and can be served on third party infrastructure. Developers who want OpenAI quality with full self-hosting now have an option that did not exist for most of the API's history.

Built-in tools

The Responses API ships with hosted tools that the model can call without any developer-side wiring. Each one runs server-side, returns its output back into the model's context, and is billed per use ^[10]^[24].

File search

File search is a managed retrieval-augmented generation tool over uploaded vector stores. The developer creates a vector store, attaches files (PDFs, Markdown, code, JSON, images with OCR), and passes the vector store ID into the tools array. The model decides when retrieval would help, calls the tool with a query, and gets back ranked chunks plus citations. File search handles chunking, embedding, hybrid retrieval, and reranking, removing most of the boilerplate early RAG systems had to write ^[25].

Web search

The web search tool lets the model fetch live results from the public internet and ground its answer in cited sources. It is automatically invoked when the model judges that current information is needed, and the response includes inline source URLs the developer can render as citations. Web search supports geographic targeting and freshness filters.

Code interpreter

The code interpreter tool runs Python in a sandboxed container. The container has filesystem access scoped to the response, can install most common scientific Python packages, and can read or generate files (CSVs, plots, PDFs). This is the same engine ChatGPT exposes for data analysis, available as a tool any Responses API call can invoke ^[26]. Code interpreter is the easiest path to numerical reasoning that is actually correct, because the model offloads arithmetic and table manipulation to the Python runtime instead of trying to do it in tokens.

Computer use

The computer use tool, originally exposed through the Operator consumer product, lets the model drive a virtual desktop or browser by emitting screenshots-and-mouse-clicks instructions ^[27]. The tool runs on a Computer-Using Agent (CUA) model that combines GPT-4o vision with reinforcement-learned UI understanding, and it is available as a research preview in the Responses API for usage tier 3-5 customers. Pricing is $3 per million input tokens and $12 per million output tokens. The developer is responsible for the virtual machine; OpenAI ships a reference sample app that uses Docker, Browserbase, or Anchor.

Image generation as a tool

gpt-image-1 can be invoked as a tool from inside a Responses call, which means an agent can produce text, search the web, run Python, and generate images in a single conversation without the developer wiring up four separate APIs. The image generation tool returns image IDs and URLs that flow back into the conversation history.

Remote MCP servers

The Responses API supports Model Context Protocol servers as remote tools. The developer registers an MCP server URL with optional authentication, the API discovers its tool schema on first use (cached as an mcp_list_tools item), and the model calls those tools transparently. This is how OpenAI is connecting the API to the broader MCP ecosystem that Anthropic introduced in late 2024 and that has since been adopted by most major AI vendors. MCP turns every tool author into a third-party any model can discover, which is more or less the long-promised plugin ecosystem the original ChatGPT plugins tried and failed to deliver.

Function calling and tool use

Function calling, introduced in June 2023, lets a developer define functions with names, descriptions, and JSON Schema parameter schemas. The model decides when to call them, generates a JSON object that conforms to the schema, and the developer's code executes the function and returns the result on the next turn ^[28]. This is the foundation of nearly every modern AI agent, because it gives the model a structured way to request actions in the outside world.

With strict: true set on a function definition, the API guarantees that the model's arguments will be valid against the schema. Strict mode uses constrained decoding to force the model to only sample tokens that produce schema-valid JSON. The trade-off is a small latency hit on the first request with a new schema (because OpenAI compiles the schema into a finite-state machine and caches it) and a few unsupported schema features (like arbitrary regex on strings).

By default, models can emit multiple tool calls in a single turn. The API returns them as an array, and the developer's code can run them in parallel before sending all results back together. This is materially faster than serial tool use.

The tool_choice parameter accepts auto (default), none, required (the model must call some tool), or a specific function name. Combined with strict: true, this is how developers reliably extract structured data: pass a single tool definition with tool_choice set to require it, and the model returns schema-valid arguments rather than free text. Common patterns include tools as routers (each tool corresponds to a different downstream pipeline) and tools as confirmations (the model emits a tool call describing what it wants to do, and a human or downstream service approves before execution).

Structured Outputs

Structured Outputs, launched in August 2024, guarantees that model responses conform exactly to a provided JSON Schema ^[29]. This goes beyond JSON mode (which only ensured syntactically valid JSON) by enforcing strict schema adherence. The developer enables it by setting strict: true in a function definition or by passing response_format: { type: "json_schema", strict: true, schema: {...} } for non-tool responses.

On complex schema-following evals, gpt-4o-2024-08-06 with Structured Outputs scored a perfect 100% versus less than 40% with prompt-only schema instructions. The implementation uses a context-free grammar derived from the schema, which constrains decoding so only schema-valid tokens can be sampled.

Both Python and Node SDKs accept native typed objects: a Pydantic model in Python or a Zod schema in TypeScript becomes the JSON schema with no extra serialization step. The SDK parses the output back into a typed object, so the developer never touches raw JSON ^[29].

Limitations: schemas must be a subset of JSON Schema (no $ref to external URLs, no regex patterns, limited combinator support), additionalProperties: false is required on every object, the first call with a new schema pays a one-time compilation cost, and Structured Outputs interacts with tool_choice: required in subtle ways worth checking the cookbook on.

Realtime API

The Realtime API is the audio-native interface for low-latency voice agents, originally launched in October 2024 over WebSocket and expanded with WebRTC in early 2025 and SIP in mid-2025 ^[12]. As of May 2026, the production model is gpt-realtime, with gpt-realtime-mini for cost and gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper shipping alongside.

Transports

The transport choice is essentially the architecture of the application:

WebRTC is the right answer for browser and mobile clients. The browser stack handles jitter, packet loss, and codec negotiation. Audio quality holds up over consumer Wi-Fi and cellular in a way that hand-rolled WebSocket audio streaming does not.
WebSocket is the right answer for server-side orchestration. Compliance, logging, persistence, and tool routing live on the server, and audio frames travel between the server and OpenAI in a separate channel.
SIP is the right answer for actual phone calls. Twilio, Vonage, and other CPaaS providers can dial OpenAI directly.

Picking the wrong transport forces a lot of latency and reliability work later, so it is worth thinking about early.

Function calling, tools, and barge-in

Function calling and most Responses-style tools work in Realtime sessions, including web search and file search. The session protocol exposes events for response.audio.delta, response.text.delta, tool_call.created, and several others.

Barge-in (where the user starts speaking while the assistant is still talking) is built into the protocol. The client cancels the in-flight assistant audio with an input_audio_buffer.speech_started event and the server stops generating. The model can also detect end-of-turn from VAD signals if the client wants the server to manage turn-taking.

Voices, languages, and pricing

Voices include alloy, echo, fable, onyx, nova, and shimmer at launch, with several added since. Languages cover most major spoken languages with quality biased toward English. November 2025 added DTMF key-press support so voice agents can navigate IVR menus on outbound calls.

Audio is billed in tokens: roughly 100 tokens per second of input, 200 tokens per second of output. At gpt-realtime list price ($32 per million input audio tokens, $64 per million output, $0.40 per million cached input), a typical voice call runs about $0.30 per minute. Cached input is the lever that matters most: a system prompt repeated on every turn can drop from $32 to $0.40 per million tokens, an 80x discount.

Assistants API v2 and migration

The Assistants API is the predecessor to Responses ^[13]. It introduced higher-level concepts (assistants, threads, runs, messages) that many developers found friendlier than raw Chat Completions, and it included built-in code interpreter and file search tools. In August 2025, OpenAI announced that the Assistants API would be fully removed on August 26, 2026. The migration path to the Responses API is documented in the official guide ^[11]:

Assistants concept	Responses replacement
Assistant object (model + instructions + tools)	Server-side prompt object created in dashboard, or inline `instructions` + `tools` in each Responses call
Thread (server-side message store)	Conversation object via `/v1/conversations`, or `previous_response_id` chain
Run (model invocation against thread)	Responses request
Message (item in thread)	`input` items, `output` items
Tool (built-in code interpreter, file search)	Same tools, available natively in Responses

Most migrations are mechanical. The bigger change is conceptual: assistants were stateful objects with versioning, while Responses encourages a more functional, request-by-request style with state encoded in conversation IDs or encrypted reasoning. Teams that wrote a lot of assistant configuration code tend to rewrite that as either dashboard-managed prompts or static configuration in their own code.

Agents SDK

The Agents SDK, released alongside the Responses API in March 2025, is OpenAI's open source orchestration framework for multi-agent workflows ^[30]. It exists in Python and TypeScript flavors and provides a small number of primitives (Agent, Tool, Handoff, Guardrail, Trace) that compose into systems ranging from a single tool-using agent to a full sub-agent organization.

The core abstraction is the Agent, which is essentially a configured Responses call with instructions and tools. Two orchestration patterns dominate:

Agents as tools: a specialist agent is exposed to a primary agent as a function. The primary agent calls into the specialist for bounded subtasks but stays in charge of the conversation. This fits when one agent is the user-facing interface and others are backend specialists.
Handoffs: routing itself is the workflow. The first agent inspects the task and hands off to a specialist, which then owns the rest of the interaction.

The SDK includes built-in tracing, guardrails (input and output validation that runs before and after the model call), and structured outputs integration. Integration with Temporal and other workflow engines exists for production deployments where durability matters.

A related framework, the AgentKit toolkit, ships a visual Agent Builder and an embeddable ChatKit interface. Together with the Agents SDK they form the platform's answer to agent frameworks like LangGraph and CrewAI that emerged outside OpenAI in 2023-2024.

Fine-tuning

The fine-tuning endpoints let developers customize models on their own data using three different techniques ^[18]. All three are exposed through the same /v1/fine_tuning/jobs endpoint with different method values.

Supervised fine-tuning (SFT)

The original technique. The developer uploads JSONL data where each row is a Chat Completions-shaped conversation, and the model is trained to imitate the assistant turns. SFT is supported on GPT-4o, GPT-4o mini, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano, and several earlier base models. SFT is the right choice when the developer has a clear gold standard and several hundred to several thousand examples.

Direct Preference Optimization (DPO)

DPO came to the API in 2024 and is available across the GPT-4.1 family (gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4.1-nano-2025-04-14) ^[31]. Where SFT trains on a single correct response, DPO trains on pairwise comparisons (a preferred response and a rejected one) and learns to favor the patterns in the preferred examples. This is the right technique when the developer can rank outputs more easily than write a single ideal one, which fits most subjective tasks (tone, style, structure).

Reinforcement Fine-Tuning (RFT)

RFT, generally available since May 2025, is the first fine-tuning method that updates a reasoning model. It uses a developer-supplied grader (a programmable function that scores candidate responses) and runs reinforcement learning over chain-of-thought rollouts to push the model toward higher-scoring outputs ^[32]. RFT is available on o4-mini and other o-series snapshots. Costs run around $100 per hour on o4-mini, which makes RFT 100-700x more expensive than SFT, but it is the only technique that can teach a reasoning model new domain-specific evaluation criteria. Classic use cases: turning instructions into working code, pulling facts into a clean format, and applying complex rule sets correctly. The grader can be Python code, an OpenAI-hosted reference model, or a hybrid.

Comparing the techniques

Technique	Best when	Data format	Cost	Supported models
SFT	Single gold response per input	Chat conversations (JSONL)	Low	GPT-4o, GPT-4o mini, GPT-4.1 family, others
DPO	Easier to rank than to write	Pairs of preferred/rejected responses	Low-medium	GPT-4.1 family
RFT	Complex reasoning with measurable rubric	Inputs + grader function	High (~$100/hr)	o4-mini, other o-series

Batch API

The Batch API at /v1/batches lets developers submit large volumes of requests for asynchronous processing at a 50% discount on both input and output tokens ^[19]^[33]. Batches are guaranteed to complete within 24 hours, though most finish in 1 to 6 hours depending on size and current load.

The workflow is simple: upload a JSONL file of requests through the Files API, create a batch referencing that file ID, poll the batch status, and download the JSONL output file. Each request carries its own custom ID, so the response file can be joined back to the input on the developer side.

Most endpoints are batch-eligible: Chat Completions, Responses, embeddings, and images. Vision inputs work. Streaming does not, since the whole point of batching is async processing. The Realtime API is not batchable. Common patterns include classifying or labeling large datasets, generating embeddings for search indexes, bulk content generation or summarization, evaluation pipelines, and data enrichment for analytics. A 1 million request job that would cost $1,000 synchronously runs $500 in a batch.

The Batch API runs on a separate rate limit pool from synchronous calls, so a batch backfill does not eat into the rate limits of the production traffic. Some models also accept larger contexts in batch mode than they do synchronously, which makes the Batch API the only way to process documents at the upper bound of certain context windows.

Service tiers

OpenAI exposes several service tiers that change the cost-latency-reliability trade-off ^[34]^[35].

Tier	Cost vs standard	Latency	Best for
Priority	~2.5x	Lowest, most consistent	User-facing apps where latency matters
Standard	Baseline	Standard	General production
Flex	50% off	Slower, occasional 429s	Evaluations, data enrichment, async jobs
Batch	50% off	Up to 24h	Bulk async workloads
Scale	Negotiated	Reserved capacity	Enterprise with predictable demand

Priority and Flex are selected per request via the service_tier parameter. Scale Tier is a contractual commitment for very large customers where capacity is reserved at fixed daily rates and throughput is guaranteed regardless of broader platform load.

Authentication and organizations

Authentication uses API keys passed in the Authorization header as Bearer tokens ^[2]. Modern keys use the sk-proj- prefix, scoping them to a specific project within an organization, which is more secure than the older organization-wide keys. Project keys can only access models, files, and resources tied to their project; organization keys can access anything in the organization. New projects almost always want project keys, because the blast radius of a leaked key is much smaller.

The organization and project structure also drives billing and rate limits. Usage rolls up to the organization, but each project can have its own spending limits, model allowlists, and member roles. Larger teams typically run one project per environment (dev, staging, prod) plus separate projects for shared services.

Best practices for key management:

Never expose keys in client-side code or version control. The SDKs detect OPENAI_API_KEY automatically; use .env files or a secrets manager.
Rotate keys regularly through the platform dashboard. The Admin API supports programmatic rotation.
Use project-scoped keys instead of organization keys.
Set per-key spending limits to cap blast radius if a key is compromised.
Use IP allowlisting (available on enterprise plans) to restrict where keys can be used from.

Users who belong to multiple organizations can specify which one to bill by passing an OpenAI-Organization header ^[36]. The OpenAI-Project header overrides the project implied by the key.

Rate limits and usage tiers

OpenAI applies rate limits at the organization level based on usage tiers. As cumulative spending increases, organizations automatically graduate to higher tiers with expanded limits ^[36]^[37]. Rate limits are measured across RPM (requests per minute), RPD (per day), TPM (tokens per minute), TPD (per day), IPM (images per minute), and AMM (audio minutes per minute, for streaming audio).

Tier	Qualification	Indicative TPM (GPT-5 family)
Free	Default for new accounts	Limited access
Tier 1	$5+ paid	500,000 TPM
Tier 2	$50+ paid, 7+ days since first payment	1,000,000 TPM
Tier 3	$100+ paid, 7+ days since first payment	2,000,000 TPM
Tier 4	$250+ paid, 14+ days since first payment	4,000,000 TPM
Tier 5	$1,000+ paid, 30+ days since first payment	Up to 40,000,000 TPM

Rate limit information returns in HTTP response headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests), which lets applications throttle proactively rather than retrying blindly. The Retry-After header on 429 responses tells clients how long to wait. Some features unlock at higher tiers: the computer use tool is currently restricted to tiers 3-5, and Reinforcement Fine-Tuning requires manual approval.

SDKs and libraries

OpenAI maintains official SDKs for several languages ^[38], with community libraries covering most of what is missing.

Language	Package	Status	Notes
Python	`openai` (pip)	Stable, official	Sync/async, streaming, Pydantic, auto-pagination
Node / TypeScript	`openai` (npm)	Stable, official	TypeScript-first, streaming, Zod integration
.NET	`OpenAI` (NuGet)	Stable, official (with Microsoft)	.NET Standard 2.0, IAsyncEnumerable streaming
Java	`openai-java` (Maven)	Beta, official	Requires Java 8+, current 4.34.0
Go	`openai-go` (module)	Beta, official	Idiomatic Go interfaces
Ruby	`ruby-openai` (gem)	Community	Widely used despite no official version
PHP	`openai-php/client`	Community	Maintained by community
Rust	`async-openai`	Community	Tokio-based

All official SDKs auto-detect OPENAI_API_KEY, provide typed request and response objects, retry with exponential backoff on transient failures, and support streaming through language-appropriate patterns ^[38].

The Agents SDK is a separate package on top of the base SDK, available in Python (openai-agents-python) and TypeScript (openai-agents-js). New harness features (configurable memory, sandbox-aware orchestration, filesystem tools) tend to land in Python first and follow in TypeScript.

Pricing model

Token-based pricing

The OpenAI API uses a per-token pricing model with separate rates for input and output tokens ^[20]. A token is a chunk of text processed by the model's tokenizer, roughly four characters or 0.75 words in English. Input tokens include the system prompt, user messages, conversation history, and tool definitions. Output tokens include the model's response text and, for reasoning models, internal reasoning tokens.

Output tokens are generally priced 2x to 8x higher than input tokens, reflecting the greater compute cost of generation versus comprehension. For long-context models, prompts above a threshold (272K tokens for GPT-5.5) are billed at a higher per-token rate.

Prompt caching

Prompt caching automatically reduces costs for repeated input prefixes. When the API detects that the beginning of a prompt matches a recent request, it reads from cache instead of reprocessing those tokens ^[20]. Cache discounts vary by model family:

Model family	Cache discount
GPT-5 series, GPT-5.5	90% off cached input tokens
GPT-4.1 series	75% off cached input tokens
GPT-4o / o-series	50% off cached input tokens
gpt-realtime	~98% off cached input tokens

Caching is automatic but only triggers for prefixes that match within a recent window (extended in 2026 to up to 24 hours via extended prompt caching). Keep the prefix stable: put the system prompt, tool definitions, and persistent context at the front, and per-request data at the end.

Cost optimization summary

Strategy	Savings	Trade-off
Batch API	50% on all tokens	Up to 24-hour latency
Flex processing	50% on all tokens	Slower, occasional 429s
Prompt caching	50-90% on repeated inputs	Requires consistent prefixes
Smaller models (mini/nano)	80-95% vs flagship	Lower capability on hard tasks
Fine-tuning	Reduced prompt length	Upfront training cost
Structured Outputs	Fewer retries	Slightly constrained format
Predicted Outputs	Reduced output latency	Only useful for partial-edit tasks

Compliance and data handling

OpenAI's compliance posture matters for any team handling regulated data. The platform meets several enterprise standards ^[39]^[40].

Certifications

SOC 2 Type 2: independent examination of Security, Availability, Confidentiality, and Privacy controls for the API and ChatGPT business products.
ISO/IEC 27001:2022 and ISO/IEC 27701:2019: information security and privacy management systems covering the API and enterprise products.
HIPAA Business Associate Agreement (BAA): available for ChatGPT for Healthcare and API healthcare customers. A BAA plus zero data retention is the path to HIPAA compliance.
GDPR: OpenAI signs Data Processing Addenda and supports EU data residency.

Zero Data Retention

Qualifying organizations can request a Zero Data Retention (ZDR) configuration, where the API processes requests without storing any content. Once the response is returned, OpenAI permanently deletes all request data from its systems. ZDR is required for HIPAA compliance and is common for financial services and other regulated industries. Some features (Responses API store=true, Assistants API threads, fine-tuning training data persistence) are not available under ZDR.

Data residency, training opt-out, and logging

Eligible enterprise customers can store sensitive customer content at rest in the United States, European Union, United Kingdom, Japan, Canada, South Korea, Singapore, Australia, India, and the United Arab Emirates ^[40]. This applies to ChatGPT Enterprise, ChatGPT Edu, ChatGPT for Healthcare, and the API platform.

API traffic is not used to train OpenAI's models by default. The original Chat Completions defaults changed in 2023 to opt-out from training, and the same applies to Responses, Assistants, Realtime, and every other modern endpoint. Free ChatGPT and consumer products have different defaults, but the API is unambiguously opt-out unless the customer explicitly enrolls in a data sharing program.

In the default configuration, OpenAI retains API request and response data for 30 days for abuse monitoring before automatic deletion. ZDR turns this off entirely. The API also publishes audit logs for security-relevant events through the Admin API, which enterprises plug into their SIEM.

Azure OpenAI parity

Microsoft hosts OpenAI's models on Azure under a separate brand and API surface called Azure OpenAI Service ^[41]. The endpoint shape is similar but not identical: Azure scopes deployments per region, requires per-deployment names rather than per-model names, and ships features on a slightly delayed timeline.

What is the same: the model weights (GPT-5 on Azure is GPT-5 from OpenAI), most wire formats (the Python SDK has an Azure mode that handles the differences with a small configuration change), and most core features (function calling, structured outputs, vision, embeddings, fine-tuning).

What is different: region availability is per-model and per-feature (Sora launched in Sweden Central and East US 2 first; gpt-realtime launched on Azure Foundry Direct Models in early 2026), and some features arrive on Azure later (the Responses API took several months; computer use was OpenAI-only at first). Azure offers data zone deployments (US-only or EU-only) and global deployments (Microsoft routes to whichever region has capacity). Quotas are per-region, per-subscription, per-model. Pricing is generally similar; Azure offers Provisioned Throughput Units (PTUs) for reserved capacity that are not available on the OpenAI platform directly.

Azure makes sense when the organization is already on Azure, needs PTUs for reserved capacity, or has data residency requirements that Azure satisfies more cleanly. The OpenAI platform makes sense when the organization wants the latest features as soon as they ship or runs across multiple clouds.

Current state (May 2026)

As of May 2026, the OpenAI API is the most widely deployed commercial AI API, serving over one million organizations worldwide ^[1]. The April 2026 release of GPT-5.5 brought the first model where the long-context regime (above 272K tokens) is priced as a deliberate tier rather than an exception, which suggests OpenAI now sees million-token contexts as a routine workload rather than a curiosity. The March 2026 release of GPT-5.4 introduced native computer-use as a first-class capability and pushed the context window past one million tokens for most variants.

Key trends shaping the API in early 2026:

Agents are the default mental model. The Responses API, the Agents SDK, AgentKit, the computer use tool, and gpt-realtime collectively assume the developer is building agents, not chatbots.
Price deflation continues. What cost $60 per million tokens with GPT-3 Davinci in 2020 now costs $0.05 per million with GPT-5 nano, a roughly 1,200x reduction. The deflation is steeper on cached input.
Context window expansion. The jump from 128K (GPT-4o) to 1M+ (GPT-5.5, GPT-5.4) opens new use cases in document analysis, codebase understanding, and long-running agent workflows.
MCP is the integration layer. The Responses API supports remote MCP servers natively, and most major tool authors now ship MCP servers alongside their REST APIs.
Competition is heavier. The API faces growing competition from Anthropic's Claude API, Google's Gemini API, and inference platforms hosting open-weight models. OpenAI maintains its lead through rapid model iteration, a mature developer ecosystem, and deep Azure integration ^[42].

The API evolves at roughly quarterly cadence on flagship models, with smaller updates (snapshot pins, new tools, pricing tweaks) shipping nearly every month. Whether the agent paradigm is the long-term mental model or just the current one is an open question; the platform's design as of mid-2026 is clearly betting it is.

References

OpenAI. "The state of enterprise AI 2025 report." OpenAI, 2025. https://openai.com/index/the-state-of-enterprise-ai-2025-report/
OpenAI. "API Reference." OpenAI Developer Platform. https://platform.openai.com/docs/api-reference
Brown, Tom et al. "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165
Microsoft. "Microsoft teams up with OpenAI to exclusively license GPT-3." September 22, 2020. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/
OpenAI. "Introducing ChatGPT and Whisper APIs." March 1, 2023. https://openai.com/blog/introducing-chatgpt-and-whisper-apis
OpenAI. "GPT-4 Technical Report." March 14, 2023. https://openai.com/research/gpt-4
OpenAI. "New tools for building agents." March 2025. https://openai.com/index/new-tools-for-building-agents/
OpenAI Developer Community. "Assistants API beta deprecation, August 26, 2026 sunset." August 2025. https://community.openai.com/t/assistants-api-beta-deprecation-august-26-2026-sunset/1354666
OpenAI. "Deprecations." OpenAI Developer Platform. https://developers.openai.com/api/docs/deprecations
OpenAI. "Why we built the Responses API." OpenAI Developers Blog. https://developers.openai.com/blog/responses-api
OpenAI. "Migrate to the Responses API." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/migrate-to-responses
OpenAI. "Realtime and audio." OpenAI Developer Platform. https://platform.openai.com/docs/guides/realtime
OpenAI. "New models and developer products announced at DevDay." November 6, 2023. https://openai.com/blog/new-models-and-developer-products-announced-at-devday
Weaviate. "OpenAI's Matryoshka Embeddings in Weaviate." 2024. https://weaviate.io/blog/openais-matryoshka-embeddings-in-weaviate
OpenAI. "Introducing our latest image generation model in the API." April 23, 2025. https://openai.com/index/image-generation-api/
OpenAI. "Introducing next-generation audio models in the API." 2025. https://openai.com/index/introducing-our-next-generation-audio-models/
OpenAI. "File search." OpenAI Developer Platform. https://platform.openai.com/docs/guides/tools-file-search
OpenAI. "Fine-tuning." OpenAI Developer Platform. https://platform.openai.com/docs/guides/fine-tuning
OpenAI. "Batch API." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/batch
OpenAI. "Pricing." OpenAI API. https://openai.com/api/pricing/
DevTk.AI. "OpenAI API Pricing 2026: GPT-5.5, GPT-5.4, Codex & GPT-5 Cost per 1M Tokens." 2026. https://devtk.ai/en/blog/openai-api-pricing-guide-2026/
OpenAI. "Learning to reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
OpenAI. "Codex models." OpenAI Developer Platform. https://developers.openai.com/codex/models
OpenAI. "New tools and features in the Responses API." 2025. https://openai.com/index/new-tools-and-features-in-the-responses-api/
OpenAI. "Tools, Web Search and States with Responses API." Cookbook. https://developers.openai.com/cookbook/examples/responses_api/responses_example
OpenAI. "Code Interpreter." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/tools-code-interpreter
OpenAI. "Computer use." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/tools-computer-use
OpenAI. "Function calling." OpenAI Developer Platform. https://platform.openai.com/docs/guides/function-calling
OpenAI. "Introducing Structured Outputs in the API." August 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/
OpenAI. "Agents SDK." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/agents
OpenAI. "Fine-Tuning Techniques: Choosing Between SFT, DPO, and RFT." Cookbook. https://cookbook.openai.com/examples/fine_tuning_direct_preference_optimization_guide
OpenAI. "Reinforcement fine-tuning." OpenAI Developer Platform. https://platform.openai.com/docs/guides/reinforcement-fine-tuning
TokenMix. "OpenAI Batch API 2026: 50% Off Every Model, 24-Hour Guide." 2026. https://tokenmix.ai/blog/openai-batch-api-pricing
OpenAI. "Flex processing." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/flex-processing
OpenAI. "Priority Processing for API Customers." https://openai.com/api-priority-processing/
OpenAI. "Rate limits." OpenAI Developer Platform. https://developers.openai.com/api/docs/guides/rate-limits
Inference.net. "OpenAI Rate Limits: Complete Guide to TPM, RPM & Tier Limits (2026)." 2026. https://inference.net/content/openai-rate-limits-guide/
OpenAI. "SDKs and CLI." OpenAI Developer Platform. https://developers.openai.com/api/docs/libraries
OpenAI. "Enterprise privacy at OpenAI." https://openai.com/enterprise-privacy/
OpenAI. "Business data privacy, security, and compliance." https://openai.com/business-data/
Microsoft. "Use the Azure OpenAI Responses API." Microsoft Foundry, 2026. https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/responses
IntuitionLabs. "AI API Pricing Comparison (2026)." https://intuitionlabs.ai/articles/ai-api-pricing-comparison-grok-gemini-openai-claude

History

GPT-3 API beta (June 2020)

ChatGPT API and GPT-3.5 Turbo (March 2023)

GPT-4 and expanding capabilities (2023-2024)

Responses API and the agent era (2025-2026)

Endpoints

Responses API

Chat Completions

Completions (legacy)

Realtime API

Assistants API v2 (deprecated)

Embeddings

Images

Audio: speech, transcription, translation

Files, vector stores, moderations, and models

Models

Flagship and reasoning models

O-series reasoning models

Codex models

Image and audio models

Embedding and moderation models

Open-weight models

Built-in tools

File search

Web search

Code interpreter

Computer use

Image generation as a tool

Remote MCP servers

Function calling and tool use

Structured Outputs

Realtime API

Transports

Function calling, tools, and barge-in

Voices, languages, and pricing

Assistants API v2 and migration

Agents SDK

Fine-tuning

Supervised fine-tuning (SFT)

Direct Preference Optimization (DPO)

Reinforcement Fine-Tuning (RFT)

Comparing the techniques

Batch API

Service tiers

Authentication and organizations

Rate limits and usage tiers

SDKs and libraries

Pricing model

Token-based pricing

Prompt caching

Cost optimization summary

Compliance and data handling

Certifications

Zero Data Retention

Data residency, training opt-out, and logging

Azure OpenAI parity

Current state (May 2026)

See also

References

Improve this article

Related Articles

DeepSeek 3.0

GPT-5 Codex

Access PDF

Dev tools

Aider

Model Context Protocol

History

GPT-3 API beta (June 2020)

ChatGPT API and GPT-3.5 Turbo (March 2023)

GPT-4 and expanding capabilities (2023-2024)

Responses API and the agent era (2025-2026)

Endpoints

Responses API

Chat Completions

Completions (legacy)

Realtime API

Assistants API v2 (deprecated)

Embeddings

Images