The OpenAI API is a REST-based application programming interface that provides developers with programmatic access to OpenAI's family of artificial intelligence models, including the GPT series of large language models, the o-series reasoning models, DALL-E image generators, and Whisper speech recognition. First launched in June 2020 as a private beta alongside GPT-3, the API has grown into one of the most widely used AI developer platforms in the world, serving over one million organizations as of December 2025 [1].
The API follows a pay-per-use pricing model based on tokens (roughly four characters of English text), with separate rates for input and output tokens. Developers authenticate using API keys scoped to projects within organizations, and can access models through HTTP endpoints or official SDKs for Python, Node.js, and .NET [2].
OpenAI opened its API to developers as a private beta in June 2020, coinciding with the release of GPT-3. This marked the company's first commercial product, positioning GPT-3's 175 billion parameters as a general-purpose text engine available over HTTP [3]. The initial API offered text completions through a single endpoint, with pricing based on four model tiers (Ada, Babbage, Curie, and Davinci) that traded off capability for cost. Davinci, the most capable tier, was priced at $0.06 per 1,000 tokens, or roughly $60 per million tokens.
The beta attracted thousands of developers and spawned early AI-powered applications in copywriting, customer support, and code generation. Microsoft secured an exclusive license to GPT-3's underlying technology in September 2020, though the API itself remained accessible to all approved developers [4].
On March 1, 2023, OpenAI released the gpt-3.5-turbo model through the API, introducing the Chat Completions endpoint that would become the standard interface for conversational AI [5]. This model was priced at $0.002 per 1,000 tokens, a 10x reduction from GPT-3 Davinci, while delivering superior instruction-following and dialogue quality. The launch coincided with an explosion of developer interest following ChatGPT's viral success in late 2022.
The Chat Completions format introduced the now-standard message-based structure with system, user, and assistant roles, replacing the older text-in/text-out completions paradigm. This format has since been adopted across the industry by Anthropic, Google, and others.
GPT-4 arrived in the API on March 14, 2023, bringing multimodal capabilities (text and image input) and significantly improved reasoning [6]. Throughout 2023 and 2024, OpenAI rapidly expanded the API's feature set:
In March 2025, OpenAI introduced the Responses API as the successor to both the Chat Completions API and the Assistants API [7]. The Responses API functions as a superset of Chat Completions, adding built-in tools like web search, file search, and code execution, along with server-side memory and more flexible input formats for multimodal data. The Assistants API was officially deprecated in August 2025, with removal scheduled for August 2026.
Major model releases continued at a rapid pace through 2025 and into 2026, with GPT-4.1, GPT-5, GPT-5.2, and GPT-5.4 each bringing improvements in capability, context length, and cost efficiency.
As of March 2026, the OpenAI API offers a wide range of models across several families. The following table lists the primary models available through the API with their pricing [8][9].
| Model | Release Date | Context Window | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Key Features |
|---|---|---|---|---|---|
| GPT-5.4 | March 2026 | 1,050,000 | $2.50 | $15.00 | Native computer-use, most token-efficient reasoning |
| GPT-5.4 Pro | March 2026 | 1,050,000 | Higher tier | Higher tier | Maximum capability variant |
| GPT-5.4 Mini | March 2026 | 400,000 | $0.75 | $4.50 | Lower-latency, lower-cost |
| GPT-5.4 Nano | March 2026 | 400,000 | $0.20 | $1.25 | Budget tier, API-only |
| GPT-5.2 | December 2025 | 400,000 | $1.75 | $14.00 | Professional knowledge work |
| GPT-5 | August 2025 | 400,000 | $1.25 | $10.00 | Unified base + thinking architecture |
| GPT-4.1 | April 2025 | 1,000,000 | $2.00 | $8.00 | Coding-optimized, 1M context |
| GPT-4.1 Mini | April 2025 | 1,000,000 | $0.40 | $1.60 | Fast, affordable |
| GPT-4.1 Nano | April 2025 | 1,000,000 | $0.10 | $0.40 | Lowest-cost GPT-4.1 |
| GPT-4o | May 2024 | 128,000 | $2.50 | $10.00 | Multimodal (text, vision, audio) |
| GPT-4o Mini | July 2024 | 128,000 | $0.15 | $0.60 | High-volume budget option |
The o-series models are trained with reinforcement learning to perform internal chain-of-thought reasoning before generating responses. They use "reasoning tokens" for thinking steps that are billed as output tokens but not visible in API responses [10].
| Model | Release Date | Context Window | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Specialization |
|---|---|---|---|---|---|
| o3 | April 2025 | 200,000 | $0.40 | $1.60 | General reasoning |
| o4-mini | April 2025 | 200,000 | $1.10 | $4.40 | Cost-efficient reasoning, "thinks with images" |
| o3-mini | January 2025 | 200,000 | $1.10 | $4.40 | Fast STEM reasoning |
| Model | Purpose | Input Price (per 1M tokens) | Output Price (per 1M tokens) |
|---|---|---|---|
| text-embedding-3-large | Embeddings (3072 dims) | $0.13 | N/A |
| text-embedding-3-small | Embeddings (1536 dims) | $0.02 | N/A |
| DALL-E 3 | Image generation | $0.040-$0.120 per image | N/A |
| Whisper (large-v3) | Speech-to-text | $0.006 per minute | N/A |
| TTS-1 / TTS-1-HD | Text-to-speech | $15.00 / $30.00 per 1M characters | N/A |
| GPT-4o Transcribe | Audio transcription | $2.50 per hour | N/A |
The OpenAI API exposes several endpoint categories, each serving different use cases [2][11].
The /v1/chat/completions endpoint is the most widely used interface. It accepts an array of messages (with roles: system, user, assistant, tool) and returns a model-generated response. This endpoint supports:
The /v1/responses endpoint, introduced in March 2025, extends the Chat Completions paradigm with built-in server-side tools [7]. It supports web search, file search, code execution in a sandboxed environment, and optional persistent memory across conversations. OpenAI recommends this endpoint for all new projects.
The /v1/embeddings endpoint converts text into numerical vector representations useful for semantic search, clustering, and retrieval-augmented generation (RAG). The current models (text-embedding-3-small and text-embedding-3-large) support configurable dimensions and Matryoshka Representation Learning for flexible precision-cost tradeoffs.
The /v1/images/generations endpoint generates images from text prompts using DALL-E 3. It supports multiple sizes (1024x1024, 1792x1024, 1024x1792), quality settings (standard or HD), and returns either URLs or base64-encoded images.
The API provides several audio endpoints:
/v1/audio/transcriptions: Converts speech to text using Whisper or GPT-4o models/v1/audio/translations: Translates audio into English text/v1/audio/speech: Generates spoken audio from text input using TTS modelsThe /v1/moderations endpoint classifies text across categories like hate speech, self-harm, sexual content, and violence. This free endpoint is commonly used to filter user inputs before passing them to language models.
The fine-tuning endpoints allow developers to customize models on their own datasets. Supported base models include GPT-4o, GPT-4o mini, and GPT-4.1 mini. The process involves uploading training data in JSONL format, creating a fine-tuning job, and then using the resulting custom model through the standard Chat Completions endpoint [12].
Function calling allows the model to generate structured JSON arguments for developer-defined functions, enabling integration with external systems such as databases, APIs, and tools [13]. Introduced in June 2023, this feature has become foundational for building AI agents. The developer defines available functions with names, descriptions, and parameter schemas; the model decides when to call them and generates the appropriate arguments.
Structured Outputs, launched in August 2024, guarantee that model responses conform exactly to a provided JSON Schema [14]. This goes beyond the earlier JSON mode (which only ensured syntactically valid JSON) by enforcing strict schema adherence. Developers enable this by setting strict: true in function definitions or using the response_format parameter with a JSON schema. The Python and Node.js SDKs support Pydantic and Zod objects, respectively, for schema definition.
Starting with GPT-4 Turbo, the API accepts image inputs alongside text in the Chat Completions endpoint. Models can analyze photographs, screenshots, charts, documents, and other visual content. Images can be passed as URLs or base64-encoded data, with configurable detail levels (low or high) that affect token consumption and cost.
All text generation endpoints support streaming via server-sent events (SSE), delivering tokens incrementally as the model generates them rather than waiting for the complete response. This is essential for interactive applications where perceived latency matters. The SDKs provide convenient streaming helpers in both synchronous and asynchronous modes.
Developers can create custom versions of OpenAI models trained on domain-specific data [12]. Fine-tuning is available for GPT-4o, GPT-4o mini, and GPT-4.1 mini, with pricing that covers both training tokens and inference on the resulting model. Fine-tuning typically improves consistency on specialized tasks while reducing the need for lengthy system prompts.
The Assistants API, introduced at OpenAI's DevDay in November 2023, provided a higher-level abstraction for building AI assistants with persistent threads, automatic context management, built-in tools (code interpreter, file search, function calling), and file handling [15]. It managed conversation state server-side, freeing developers from manually tracking message history.
In August 2025, OpenAI deprecated the Assistants API in favor of the Responses API, which offers the same capabilities with a simpler interface and better performance [7]. The Assistants API is scheduled for removal in August 2026.
The Batch API allows developers to submit large volumes of requests for asynchronous processing at a 50% discount on both input and output tokens [16]. Batches are processed within a 24-hour window, making this feature ideal for non-time-sensitive workloads such as:
The Batch API supports the same models and parameters as synchronous endpoints, including vision inputs. However, streaming is not available for batch requests. The 50% cost reduction makes the Batch API one of the most effective ways to reduce API spending for workloads that can tolerate higher latency.
Authentication uses API keys passed in the Authorization header as Bearer tokens [2]. Modern keys use the sk-proj- prefix, scoping them to a specific project within an organization. This allows fine-grained access control: different teams or applications can have separate keys with distinct permissions and usage tracking.
Best practices for key management include:
OPENAI_API_KEY)The API uses a hierarchical structure: organizations contain projects, and projects contain API keys. Users who belong to multiple organizations can specify which one to bill by passing an OpenAI-Organization header. Usage, billing, and rate limits are tracked at the organization level [17].
OpenAI applies rate limits at the organization level based on usage tiers. As spending increases, organizations automatically graduate to higher tiers with expanded limits [17].
Rate limits are measured across five dimensions:
| Metric | Description |
|---|---|
| RPM | Requests per minute |
| RPD | Requests per day |
| TPM | Tokens per minute |
| TPD | Tokens per day |
| IPM | Images per minute |
| Tier | Qualification | Example TPM (GPT-5 family) |
|---|---|---|
| Free | Default for new accounts | Limited access |
| Tier 1 | $5+ paid | 500,000 TPM |
| Tier 2 | $50+ paid, 7+ days since first payment | 1,000,000 TPM |
| Tier 3 | $100+ paid, 7+ days since first payment | 2,000,000 TPM |
| Tier 4 | $250+ paid, 14+ days since first payment | 4,000,000 TPM |
| Tier 5 | $1,000+ paid, 30+ days since first payment | Higher limits |
Rate limit information is returned in HTTP response headers (x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, etc.), allowing applications to implement proactive throttling.
For organizations needing even higher throughput, OpenAI offers a Scale Tier with reserved capacity and guaranteed availability, where customers purchase token units at fixed daily rates [18].
OpenAI provides official SDKs for three programming languages [2]:
| SDK | Language | Package Manager | Key Features |
|---|---|---|---|
| openai-python | Python | pip (openai) | Sync/async, streaming, Pydantic support, auto-pagination |
| openai-node | Node.js/TypeScript | npm (openai) | TypeScript-first, streaming, Zod integration |
| openai-dotnet | C#/.NET | NuGet (OpenAI) | .NET Standard 2.0, collaboration with Microsoft |
All three SDKs share a consistent design philosophy: they auto-detect the OPENAI_API_KEY environment variable, provide typed request and response objects, handle retries with exponential backoff, and support streaming through language-appropriate patterns (async generators in Python, async iterables in Node.js, IAsyncEnumerable in .NET) [19].
The community has also produced unofficial libraries for Go, Rust, Java, Ruby, PHP, and other languages, many listed in OpenAI's official documentation.
The OpenAI API uses a per-token pricing model with separate rates for input and output tokens [8]. A token is a chunk of text processed by the model's tokenizer, roughly corresponding to four characters or 0.75 words in English. Input tokens include the system prompt, user messages, conversation history, and any tool definitions. Output tokens include the model's response text and, for reasoning models, internal thinking tokens.
Output tokens are generally priced 2x to 8x higher than input tokens, reflecting the greater computational cost of generation versus comprehension.
Prompt caching automatically reduces costs for repeated input prefixes. When the API detects that the beginning of a prompt matches a recent request, it reads from cache instead of reprocessing those tokens [8]. Cache discounts vary by model family:
| Model Family | Cache Discount |
|---|---|
| GPT-5 series | 90% off cached input tokens |
| GPT-4.1 series | 75% off cached input tokens |
| GPT-4o / o-series | 50% off cached input tokens |
| Strategy | Savings | Trade-off |
|---|---|---|
| Batch API | 50% on all tokens | Up to 24-hour latency |
| Prompt caching | 50-90% on repeated inputs | Requires consistent prompt prefixes |
| Smaller models (Mini/Nano) | 80-95% vs flagship | Lower capability on complex tasks |
| Fine-tuning | Reduced prompt length | Upfront training cost |
| Structured Outputs | Fewer retries | Slightly constrained output format |
As of March 2026, the OpenAI API is the most widely deployed commercial AI API, serving over one million organizations worldwide [1]. The March 2026 release of GPT-5.4 introduced OpenAI's first model with native computer-use capabilities and a context window exceeding one million tokens, enabling agents to operate computers and carry out complex workflows across applications [20].
Key trends shaping the API in early 2026 include:
The API continues to evolve rapidly, with OpenAI releasing major model updates on a roughly quarterly cadence and expanding the platform's capabilities toward fully autonomous AI agents.