# GPT API

> Source: https://aiwiki.ai/wiki/gpt_api
> Updated: 2026-07-16
> Categories: Developer Tools, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **GPT API** is the public HTTP interface that [OpenAI](/wiki/openai) exposes for programmatic access to its hosted language, vision, audio, image, and video models. The phrase has been used loosely since the GPT-3 beta in 2020 and remains the way most developers refer to the surface today, even though OpenAI's own marketing has shifted between "the API," "OpenAI API," and product-specific names like the [Responses](/wiki/responses_api) and [Chat Completions](/wiki/chat_completions_api) endpoints. In practice, "GPT API" covers everything served from `https://api.openai.com/v1/`, including text generation, embeddings, image generation with `gpt-image-1`, speech synthesis, transcription with [Whisper](/wiki/whisper), real-time voice over [WebRTC](/wiki/webrtc), moderation, file storage, fine-tuning, batch jobs, and webhooks.[3]

This article is the canonical reference for the API surface itself. "GPT API" and "[OpenAI API](/wiki/openai_api)" are widely used interchangeably; the OpenAI API article covers the same ground from a slightly broader angle (history, business model, ecosystem) while this article focuses on the endpoints, authentication, SDKs, pricing, and operational details. Specific models such as GPT-5, GPT-4o, [o3](/wiki/o3), and `gpt-image-1` have their own articles; the [ChatGPT](/wiki/chatgpt) consumer product is documented separately. For Microsoft's parallel offering see Azure OpenAI Service.

The API is what turned OpenAI from a research lab into a platform business. By 2024 it was the largest piece of the company's revenue mix, by 2025 it was the substrate underneath most enterprise [generative AI](/wiki/generative_ai) integrations, and by 2026 the shape of `/v1/chat/completions` had become a de facto industry standard that almost every other model provider tried to imitate. The history below traces how that happened, what each generation of the API looked like, what the current endpoints do, what they cost, what they can be replaced with, and where they have been quietly broken in ways that catch new developers off guard.

## History

OpenAI announced the API in private beta on June 11, 2020, framing it as the company's first commercial product and the home of [GPT-3](/wiki/gpt-3).[1][2][39] The launch post described a single "text-in, text-out" interface that any developer could call to perform virtually any English language task, with early access limited to companies that had been piloting the technology, including Algolia, Koko, MessageBird, Sapling, Replika, Casetext, Quizlet, and Reddit.[1] The waitlist stayed long for most of 2020 and 2021. OpenAI removed the waitlist in November 2021, opening access to anyone in supported countries with a credit card on file.

The early API only exposed a single endpoint, `POST /v1/completions`, which accepted a prompt string and returned a continuation.[1][3] Models were addressed by short names like `davinci`, `curie`, `babbage`, and `ada`, each available in base and instruction-tuned variants. The Embeddings endpoint shipped in January 2022, fine-tuning followed for the base GPT-3 series, and the moderation classifier joined the lineup in August 2022.[4] None of those primitives looked like an "API platform" yet; they looked like a single research model behind a credit card form.

The biggest structural shift came on March 1, 2023, when OpenAI launched the Chat Completions API at `POST /v1/chat/completions` alongside `gpt-3.5-turbo`.[4] The new endpoint took an array of role-tagged messages instead of a single prompt, and it became the dominant surface within months.[3] Pricing was deliberately aggressive: `gpt-3.5-turbo` undercut the older `text-davinci-003` by about 10x, which converted a generation of hobbyist projects into paid traffic almost overnight. Function calling arrived in the same endpoint on June 13, 2023, exposing structured tool descriptions through `functions` and `function_call` parameters.[10] OpenAI later renamed those parameters to `tools` and `tool_choice` in November 2023 to match the multi-tool model that GPT-4 Turbo supported.[4] Vision input followed on November 6, 2023, when GPT-4 Turbo with Vision became available, letting clients pass image URLs or base64 data inside chat messages.[4]

The rest of 2023 added Whisper transcription, [DALL-E](/wiki/dall-e) 3 image generation, and the original Assistants API, which introduced server-side conversation threads, runs, and file-search retrieval. None of those Assistants primitives stuck. OpenAI announced the Responses API on March 11, 2025, called Assistants "v1 beta," and started a one-year sunset clock that ends August 26, 2026.[7][26] The Assistants story is one of the few times OpenAI shipped a new abstraction that did not graduate; the developer feedback was that threads, runs, and steps were heavier than what most use cases needed, and the framework's lock-in to OpenAI-hosted state was uncomfortable for teams that wanted to keep conversation history in their own database.

2024 was an infrastructure year. Project-scoped API keys arrived in April 2024,[33] the [Batch API](/wiki/batch_api) launched the same month with a 50% discount,[13] the [Realtime API](/wiki/realtime_api) entered public beta on October 1, 2024,[8] prompt caching turned on automatically across the latest model snapshots on the same day,[12] and Structured Outputs guaranteed JSON schema conformance starting August 6, 2024.[11] The official .NET library went stable in October 2024,[30][31] the Go SDK shipped in July 2024,[28] and a Java SDK followed shortly after.[29] The Admin and Audit Log APIs landed on August 1, 2024, which is when OpenAI's offering started to feel like it could pass an enterprise security review without a wrapper layer.[4] The omni-moderation model replaced the older text-only classifier on September 26, 2024 and became free across the board.[25]

In 2025, OpenAI pushed the API toward agent workflows. The Responses API, the Conversations API, the Agents SDK for Python (March 2025) and TypeScript (June 2025), the computer-use tool, and the GA Realtime release on August 28, 2025 with the new `gpt-realtime` model all landed in that window.[7][9][23][35] Reinforcement fine-tuning became generally available,[34] webhooks went live,[18] the Codex CLI shipped as an open-source Rust binary in April 2025,[37] and a year of model releases (`o3`, `o4-mini`, `gpt-4.1`, `gpt-5`, `gpt-5.1`, `gpt-5.2`) kept the model picker churning.[32] The current state, as of mid-2026, is a multi-endpoint API that is still backward compatible with most 2023 client code while quietly rotating new traffic toward Responses and Conversations.

A handful of milestones worth pinning down because they show up in many third-party guides without dates:

| Date | Event |
|---|---|
| June 11, 2020 | API beta launches with [GPT-3](/wiki/gpt-3) and `/v1/completions`[1] |
| November 18, 2021 | Public availability, waitlist removed |
| January 25, 2022 | First Embeddings models shipped |
| March 1, 2023 | Chat Completions and `gpt-3.5-turbo` launch[4] |
| June 13, 2023 | Function calling lands in Chat Completions[10] |
| November 6, 2023 | GPT-4 Turbo with Vision goes GA in the API[4] |
| January 25, 2024 | `text-embedding-3-small` and `text-embedding-3-large` ship[36] |
| April 15, 2024 | Batch API launches with a 50% discount[13] |
| April 2024 | Project-scoped API keys roll out[33] |
| August 1, 2024 | Admin and Audit Log APIs released[4] |
| August 6, 2024 | Structured Outputs and `gpt-4o-2024-08-06` launch[11] |
| September 26, 2024 | `omni-moderation-latest` replaces text-only moderation[25] |
| October 1, 2024 | Realtime API public beta, prompt caching turns on, .NET SDK GA[8][12][31] |
| March 11, 2025 | Responses API and Agents SDK launch[7] |
| April 23, 2025 | `gpt-image-1` opens image generation in the API[24] |
| May 23, 2025 | Reinforcement fine-tuning GA on `o4-mini`[34] |
| August 20, 2025 | Conversations API launches[35] |
| August 26, 2025 | Assistants API deprecation announced[26] |
| August 28, 2025 | Realtime API GA with `gpt-realtime`[9] |
| March 24, 2026 | Sora discontinuation announced[27] |
| May 7, 2026 | Realtime beta interface retires[5] |
| May 12, 2026 | DALL-E 2 and DALL-E 3 retire[5] |
| August 26, 2026 | Assistants API sunset[26] |
| September 24, 2026 | Sora 2 video API sunset[27] |

## API surface and endpoints

All endpoints share the base URL `https://api.openai.com/v1/` and use JSON request and response bodies, with the exception of audio transcription (multipart form upload) and file uploads.[3] Every successful response returns a `200` status; errors use the standard 4xx and 5xx codes documented in the [OpenAI error codes](/wiki/openai_error_codes) reference.[19] Most endpoints accept a small set of optional headers in addition to `Authorization`: `OpenAI-Organization`, `OpenAI-Project`, `OpenAI-Beta` (used historically for Assistants and the original Realtime), and `Idempotency-Key` for safe retries on `POST` calls.[3]

### Current endpoints

| Endpoint | Path | Status | Purpose |
|---|---|---|---|
| Responses | `POST /v1/responses` | GA | Stateful, agent-oriented endpoint that combines chat, tools, web search, file search, code interpreter, and computer use in one call[7] |
| Chat Completions | `POST /v1/chat/completions` | GA | Stateless message-array endpoint, the de facto standard since 2023 |
| Completions | `POST /v1/completions` | Legacy | Single-prompt interface from the 2020 era, supported only by older base models |
| Embeddings | `POST /v1/embeddings` | GA | Vector embeddings via `text-embedding-3-small` and `text-embedding-3-large`[21][36] |
| Conversations | `POST /v1/conversations` | GA | Container API for long-running conversations used with Responses[35] |
| Images | `POST /v1/images/generations`, `/edits`, `/variations` | GA | Image generation with `gpt-image-1`; DALL-E 2 and 3 retire May 12, 2026[24][5] |
| Audio | `POST /v1/audio/speech`, `/transcriptions`, `/translations` | GA | Text-to-speech, [Whisper](/wiki/whisper) transcription, and audio translation[22] |
| Realtime | `wss://api.openai.com/v1/realtime`, WebRTC, SIP | GA | Low-latency speech-to-speech with `gpt-realtime`[9] |
| Moderations | `POST /v1/moderations` | GA | Free safety classifier, currently `omni-moderation-latest`[25] |
| Files | `POST /v1/files`, `GET /v1/files/{id}` | GA | Upload up to 512 MB per file, 2.5 TB per project[40] |
| Uploads | `POST /v1/uploads` | GA | Multipart uploads for files larger than 512 MB, up to 8 GB[40] |
| Fine-tuning | `POST /v1/fine_tuning/jobs` | GA | Supervised, DPO, and reinforcement fine-tuning[34] |
| Batch | `POST /v1/batches` | GA | Asynchronous bulk processing at a 50% discount[13] |
| Vector Stores | `POST /v1/vector_stores` | GA | Managed embedding indexes used by file search |
| Webhooks | `POST /v1/webhooks` | GA | Subscribe to batch, fine-tuning, response, and realtime events[18] |
| Models | `GET /v1/models` | GA | List models the caller can access |
| Usage and billing | `GET /v1/organization/usage`, `/v1/organization/costs` | GA | Programmatic usage and cost reporting |
| Audit logs | `GET /v1/organization/audit_logs` | GA | Admin API surface for compliance |
| Admin keys | `POST /v1/organization/admin_api_keys` | GA | Create keys for org-level automation[33] |
| Project API keys | `POST /v1/organization/projects/{id}/api_keys` | GA | Create project-scoped keys[33] |
| Assistants | `POST /v1/assistants` | Deprecated | Sunset August 26, 2026; migrate to Responses + Conversations[26][6] |
| Threads, Runs, Run Steps | `POST /v1/threads`, `/v1/threads/{id}/runs` | Deprecated | Same Assistants sunset window[26] |
| Videos (Sora) | `POST /v1/videos` | Sunsetting | Sora 2 API shuts down September 24, 2026[27] |
| Edits | `POST /v1/edits` | Removed | Folded into chat completions[5] |
| Search, Classifications, Answers, Engines | `POST /v1/{search,classifications,answers,engines}` | Removed | Shut down December 3, 2022[5] |

The Responses endpoint is the one OpenAI now points new projects at.[6][7] It is stateful by default, can carry tool state across turns when used with a Conversation object, and natively supports the four built-in tools: `web_search`, `file_search`, `code_interpreter`, and `computer_use`.[7] It also accepts remote MCP servers as tools, which is how OpenAI's documentation suggests integrating third-party data sources without writing custom function-calling glue.[3] Responses support background mode (`background: true`), which returns immediately with a job id and notifies the caller via webhook when the run is complete;[18] that pattern is the right way to handle deep research jobs and long-running computer-use sessions.

Chat Completions is not deprecated and OpenAI has been explicit that it will keep working.[6] In practice, most existing client code, the [LangChain](/wiki/langchain) integration, the [LlamaIndex](/wiki/llamaindex) integration, and almost every OpenAI-compatible third-party endpoint still target `/v1/chat/completions`, so the endpoint will likely outlive several model generations. The product team has said that new features will land in Responses first and may eventually appear in Chat Completions, but there is no committed sunset date.[6] The migration cost is real for any application that uses tools, since the tool call format shifts from `tool_calls` arrays inside an assistant message to typed output items, but for plain text completion the migration is essentially renaming `messages` to `input`.[6]

The legacy Completions endpoint is a different story. It only works with older base models like `gpt-3.5-turbo-instruct` and `babbage-002`, and OpenAI has flagged it as a candidate for retirement once the underlying models are deprecated.[5] Modern chat-tuned models including GPT-4o, GPT-5, and the [o-series](/wiki/o-series) reasoning models reject `/v1/completions` requests outright with a 400 error.[3] There is one ongoing use case for the old endpoint: certain logprob-style evaluations and zero-shot classification recipes still rely on its `logprobs` parameter, which the chat endpoint exposes only in a more limited form.[3]

### Removed and deprecated endpoints

The original `/v1/engines`, `/v1/search`, `/v1/classifications`, and `/v1/answers` endpoints from the GPT-3 beta were shut down on December 3, 2022, with migration guides pointing developers to the Embeddings and Completions endpoints.[5] The original `/v1/fine-tunes` endpoint shut down on January 4, 2024 in favor of `/v1/fine_tuning/jobs`.[5] The Edit endpoint, `/v1/edits`, was removed in 2023 and its use cases moved into chat completions. Each of these removals followed OpenAI's standard pattern: at least six months of advance notice in the [Deprecations](https://developers.openai.com/api/docs/deprecations) page, a migration guide with side-by-side examples, and a hard cutoff after which requests return 404.[5]

The Assistants API is the largest deprecation in flight. OpenAI announced on August 26, 2025 that Assistants would be removed exactly one year later, on August 26, 2026.[26] Migration paths point to Responses (for the model interaction itself) and Conversations (for the thread-like state container that Assistants exposed).[6][35] The migration guide is one of the few times OpenAI has shipped detailed side-by-side examples instead of a one-line deprecation notice, which suggests the team understands the cost of the change for production deployments.[6] The Realtime beta interface was also retired May 7, 2026 in favor of the GA Realtime contract, which is similar but not byte-compatible.[5][9]

DALL-E 2 and DALL-E 3 retire on May 12, 2026.[5] Sora 2 video generation, which had a brief life from late 2025 through early 2026, is scheduled to shut down on September 24, 2026 after OpenAI announced it was discontinuing the product on March 24, 2026.[27] The text moderation models (`text-moderation-latest`, `text-moderation-stable`) retired October 27, 2025, with omni-moderation as the replacement.[5][25] Several model snapshots also have hard sunset dates: legacy GPT-3.5 Turbo and GPT-4 variants are scheduled for October 23, 2026, with `gpt-4.1-mini` and `gpt-4.1` as the recommended replacements; the original `o1` series is going away the same day, with `o3` as the migration target.[5]

## Authentication and key management

Every API call carries an `Authorization: Bearer <API_KEY>` header.[3] There is no OAuth flow for first-party applications, no JWT exchange, and no signed URL scheme. The bearer token is the only thing that sits between a client and the model, which keeps the surface simple and makes any leak immediately catastrophic. OpenAI scans GitHub and a few other public surfaces for leaked keys and revokes them automatically when they appear, but the gap between the leak and the revocation is long enough for someone to drain a credit balance.

OpenAI introduced project-scoped API keys in April 2024 and has been steering developers off the older organization-wide "user keys" since.[33] A project key is bound to a single project inside an organization, and the project itself carries usage limits, member lists, and rate-limit settings.[33] Compromising a project key cannot reach data or billing in another project, which makes the model better suited to multi-tenant SaaS deployments than the old shared keys. Most enterprises now create one project per environment (development, staging, production) and a separate project per major product line, so a key leak is contained both blast-radius-wise and audit-wise.

A typical request looks like:

```
curl https://api.openai.com/v1/responses \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-5", "input": "hello"}'
```

Clients that belong to several organizations can disambiguate with two optional headers: `OpenAI-Organization` for the organization id and `OpenAI-Project` for the project id.[3] Both default to whatever the API key is bound to. Older user-level keys still work but are scheduled for eventual deprecation; OpenAI has not committed to a specific date but has been moving the dashboard's defaults toward project keys for over a year.[33]

Service accounts let an organization or project own a key that is not tied to an individual user account.[33] Service accounts work like normal project keys but survive employee turnover, which is the kind of detail that only matters until the day a developer leaves and a key has to be rotated under pressure. Admin API keys are a separate class managed at the organization level. They cannot call inference endpoints; instead, they unlock the audit logs, project management, and user provisioning APIs that OpenAI shipped alongside the project key system in August 2024.[33] Admin keys are how SCIM-style provisioning, automated key rotation, and compliance reporting get built on top of the platform.

OpenAI also supports SSO with SAML, JIT user provisioning, and SCIM for organizations on the enterprise plan. Those mechanisms govern who can log in to the dashboard and create keys; they do not change how the API itself authenticates a request, which remains the bearer token model.

## Official SDKs and libraries

OpenAI maintains first-party libraries for most major languages, all generated from a shared OpenAPI specification. Generating from a single spec is what keeps feature support roughly synchronized across SDKs, although the Python and Node libraries usually pick up new endpoints first.

| SDK | Repository | First release | Notes |
|---|---|---|---|
| Python | `openai/openai-python` | 2020 | Reference implementation, ships with async, streaming, retry, and Pydantic-typed responses |
| Node.js / TypeScript | `openai/openai-node` | 2020 | Edge-runtime compatible, used by most JavaScript and Bun projects |
| .NET | `openai/openai-dotnet` | June 2024 (beta), October 2024 (GA)[30][31] | Built in collaboration with Microsoft, full Assistants v2 and chat support |
| Go | `openai/openai-go` | July 2024[28] | Requires Go 1.22+, official replacement for the popular community `sashabaranov/go-openai` |
| Java | `openai/openai-java` | 2024 (beta), 2025 (GA)[29] | Maven coordinates `com.openai:openai-java`, current major version 4.x |
| Agents SDK (Python) | `openai/openai-agents-python` | March 2025[7] | Higher-level orchestration on top of the Responses API |
| Agents SDK (TypeScript) | `openai/openai-agents-js` | June 2025 | Same surface for Node/Deno/Bun |

The community ecosystem is enormous. `sashabaranov/go-openai` predated the official Go SDK by more than a year and still has more downloads. `OkGoDoIt/OpenAI-API-dotnet` was the most popular .NET option before Microsoft and OpenAI shipped the official package. Spring AI, LangChain4j, simple-openai, and the various "OpenAI for Rust" crates remain widely used. There is no official Rust SDK, although the Codex CLI itself is written in Rust and contains a usable client. PHP, Ruby, Swift, Kotlin, and Elixir all have well-maintained community libraries; OpenAI links to a curated list on the [SDKs and CLI](https://developers.openai.com/api/docs/libraries) page.

Most SDKs default to reading the API key from the `OPENAI_API_KEY` environment variable, expose synchronous and asynchronous variants, and stream chunks back as iterators or async iterables. They all wrap the same HTTP API, so downgrading to a raw `curl` or [fetch](/wiki/fetch) call is straightforward when something does not work. The official libraries also share a few quality-of-life features: automatic retries with exponential backoff on retryable errors, configurable timeouts, structured error types, and request-level idempotency keys. The Python and Node SDKs additionally expose typed event streams for the Responses and Realtime APIs, which is what the Agents SDK builds on.[38]

The Agents SDK deserves a separate mention. It is not strictly necessary; everything it does can be coded directly against Responses and Conversations. What it provides is opinion: a `Runner` class that executes the agent loop, a `Handoff` primitive for delegating to other agents, a `Tracing` integration that logs every step to the dashboard, and a `Guardrail` system for input and output validation.[7] For teams that are starting from scratch on a multi-agent workflow, it removes a few hundred lines of glue code. For teams that already have an agent harness, the Responses API is usable directly without it.

## Pricing models

Pricing is per-token for text and per-second or per-megabyte for audio, image, and video. The published rates change often, and the [pricing page](https://openai.com/api/pricing/) is the authoritative source.[15] The structure has been more stable than the numbers, which makes it worth understanding the categories rather than memorizing the rates.

The **standard** rate is what most developers pay. Each model has a per-million input rate and a per-million output rate, with output charged at a 4x to 8x multiple of input on the flagship models.[15] As of mid-2026, GPT-5 is $1.25 per million input tokens and $10 per million output tokens; GPT-5.2 sits at roughly $1.75 input and $14 output; the smaller `gpt-4.1-mini`, `gpt-4o-mini`, and `o4-mini` price an order of magnitude lower.[15] Reasoning tokens generated by o-series models are billed as output tokens even when they are not visible in the response, which means a request that prints fifty tokens of visible text can still bill for fifty thousand tokens of reasoning.[15]

**Cached input tokens** are billed at 10% of the standard input rate when prompt caching applies.[12][15] Caching is automatic for prompts of at least 1,024 tokens, kicks in across calls within a few minutes, and was rolled out on October 1, 2024.[12] The 90% discount can be the difference between a profitable RAG system and an unprofitable one, so most production stacks now structure prompts so that the static portion (system prompt, tool schemas, retrieved documents) sits at the front of the message array where it can be cached. The cache is per-organization, lives on a single inference cluster for several minutes, and refreshes lazily on hits.[12] There is no API to manage cache entries; they are populated and evicted automatically.[12]

The **Batch API** trades latency for a flat 50% discount on every model.[13] A batch is a JSONL file uploaded to the Files API and submitted to `/v1/batches`; OpenAI guarantees results within 24 hours and frequently returns them in minutes.[13] Batch traffic does not count against the synchronous rate limits, which makes it the standard way to run nightly evaluations or large embedding jobs without throttling production traffic.[13][14] Webhooks fire on completion.[18] The discount applies to input, output, and even cached tokens, and stacks with prompt caching when the batch contains repeated prefixes.[13]

**Priority processing** is a pay-as-you-go premium tier introduced in 2025.[17] The headline rate is roughly 1.5x to 2x the standard input and output prices, and in exchange OpenAI promises lower latency, fewer 503s during peak hours, and a separate rate-limit pool.[17] It is enabled per request via a `service_tier` parameter and is positioned as the right choice for latency-sensitive consumer features.[17] Priority is still pay-as-you-go and does not require a contract, which sets it apart from Scale Tier.[16][17]

**Scale Tier** is the enterprise version. Customers buy "token units" (a fixed number of input and output tokens per minute) for a single model snapshot, with a 30-day minimum commitment, dedicated capacity, and a 99.9% uptime SLA.[16] Pricing is custom and contractual rather than self-serve.[16] The same model is available to Azure customers as Provisioned Throughput Units, although the unit math differs. Scale Tier customers also get earlier access to new model snapshots and to capacity guarantees during launches, which is a non-trivial advantage when a flagship model launches and the standard tier rate-limits everyone for a week.[16]

**Region-specific endpoints** for data residency add a 10% surcharge for models released after March 5, 2026.[15] The list of supported regions has grown gradually, and includes EU, UK, Japan, Korea, Canada, India, and Australia at the time of writing. Regional traffic is processed inside the region; usage and billing remain global.

The **free moderation tier** is the only place where calls are not metered. The omni-moderation model is free to use through `/v1/moderations`, with rate limits that scale with the caller's usage tier.[25] This is a deliberate policy choice on OpenAI's side: free moderation removes the financial reason for developers to skip safety filtering on user input.[25]

A simple worked example may help calibrate expectations. A chatbot that processes 1,000 requests per day with an average of 2,000 input tokens and 500 output tokens on GPT-5 costs about $2.50 + $5.00 = $7.50 per day at standard rates, before caching.[15] Most of that input is a static system prompt and is cached after the first call, dropping the input portion to $0.25 per day, for a total around $5.25 daily.[12] Running the same workload on the Batch API would cut another 50% off, but only matters if 24-hour latency is acceptable.[13] Add Priority processing during business hours and the math gets more complicated; most teams end up running mixed-tier strategies where production user traffic is on Priority and offline analytics are on Batch.

## Rate limits and tiers

Rate limits are dimensional, per-model, and per-organization. The dimensions are RPM (requests per minute), TPM (tokens per minute), RPD (requests per day), TPD (tokens per day) for some models, IPM (images per minute) for image endpoints, and audio minutes per minute for streaming audio.[14] Hitting any single dimension returns a 429.[14]

There are six **usage tiers**: Free, Tier 1, Tier 2, Tier 3, Tier 4, and Tier 5. Tier promotions are automatic and happen on a combination of cumulative paid spend and account age:[14]

| Tier | Qualification | Indicative scale |
|---|---|---|
| Free | New accounts, limited models | A few requests per minute |
| Tier 1 | Any payment method on file | 500 RPM and 30,000 TPM on GPT-4o; ~500k TPM on GPT-5 |
| Tier 2 | $50 cumulative paid + 7 days | About 5x Tier 1 |
| Tier 3 | $100 cumulative paid + 7 days | About 10x Tier 1 |
| Tier 4 | $250 cumulative paid + 14 days | Several million TPM on GPT-5 |
| Tier 5 | $1,000 cumulative paid + 30 days | The published ceiling for self-serve |

For accounts that need more than Tier 5 provides, the path forward is Scale Tier, Priority processing, or a direct conversation with OpenAI's sales team.[16][17] Limits are visible in the dashboard under Settings, then Limits, which lists the cap for every dimension on every model that the account can use.[14] Limits at the project level are set independently and act as ceilings underneath the organization-wide caps.[14][33] That nesting is useful for protecting a flagship model's quota from being burned by a runaway development project; setting a low project cap on a development project means the production project always has its full share.

Access to specific models is also tier-gated. The computer-use tool was originally restricted to Tier 3 and above,[23] the o-series reasoning models had similar gating during their preview windows, and brand-new model snapshots typically start with reduced limits while OpenAI watches for abuse patterns. Reinforcement fine-tuning was originally Tier 4 and up.[34] The pattern is consistent enough that "wait a week and try again" is a reasonable workaround when a feature is gated above the current tier.

The Batch API has its own pool.[13] Each batch request counts against the daily batch quota in tokens (typically several billion per day on the high-volume tiers), but does not consume synchronous TPM or RPM.[13][14] That separation is the main reason large embedding pipelines and offline evaluations move to batch even when the 24-hour window is not strictly necessary.

The 429 response carries useful headers. `x-ratelimit-limit-tokens`, `x-ratelimit-remaining-tokens`, `x-ratelimit-limit-requests`, and `x-ratelimit-remaining-requests` show the current quota and what is left.[14] `retry-after` (or its sibling `retry-after-ms`) indicates the wait time in seconds.[14] Honoring the header is faster and friendlier than guessing, and it is the only way to avoid a tight retry loop that just keeps blowing through the rate limit and consuming budget on failed requests.[20]

## Tool use and function calling

Function calling, which is what OpenAI now calls "tool use," is the mechanism that lets a model decide to call an external function and pass it structured arguments. It launched on June 13, 2023 inside Chat Completions, originally as `functions` and `function_call` parameters.[10] The November 2023 update generalized those parameters into `tools` and `tool_choice`, allowing several tool types in a single call.[3][4]

A tool definition is a JSON Schema describing the function name, description, and parameters.[3] The model returns a `tool_calls` array containing the chosen function and a JSON arguments string; the client executes the function locally, sends the result back as a `tool` role message, and the model continues.[3][10] Multiple tool calls per turn are supported, including parallel calls when the model judges them independent. The `parallel_tool_calls` parameter (default `true`) lets the developer turn that off when ordering matters, which is common in workflows where tool A's output feeds tool B's input.[3]

In the Responses API, tools are first-class items in the input array and tool calls are emitted as discrete output items rather than wrapped inside a chat message.[6] The Responses surface also supports OpenAI's **built-in tools** that run on OpenAI infrastructure rather than the client:[7]

- `web_search` performs live web searches and grounds the response in citations.
- `file_search` retrieves from a vector store the developer has populated with uploaded files.
- `code_interpreter` runs Python in a sandboxed container.
- `computer_use` drives a virtual browser or desktop, paired with a `computer-use-preview` model.[23]
- Remote MCP servers expose any [Model Context Protocol](/wiki/model_context_protocol) server as a tool the model can call.

Tools are charged for what they cost OpenAI to run. Web search calls are billed per query (typically a few cents per call), code interpreter sessions per minute, and computer-use turns at a higher rate than ordinary tokens because they include both reasoning and execution.[15] Built-in tools also count against per-tool rate limits; web search has its own QPS cap that is independent of the model's TPM.[14]

The choice between custom function calling and built-in tools is a build-or-buy question. Custom functions give complete control over what the tool does, where it runs, and how it logs. Built-in tools save the integration work but lock the application to OpenAI's implementation. Most production agents end up with a mix: built-in `web_search` and `code_interpreter` because they are hard to replicate cleanly, and a long tail of custom tools that hit the application's own database, internal services, and proprietary APIs.

### Structured Outputs

Structured Outputs guarantees that the model's text output exactly matches a developer-supplied JSON Schema.[11] It launched on August 6, 2024 alongside `gpt-4o-2024-08-06`, which was the first model trained to handle complex schemas, and OpenAI also added a constrained decoding path so the guarantee is engineering-backed rather than just a model behavior.[11] Structured Outputs is enabled with `response_format: { type: "json_schema", json_schema: ... }` on Chat Completions, or `text.format` on Responses.[3][11] It also works on tool definitions, which is how most production agents now describe their tools because it removes the entire class of "the model returned almost-valid JSON" bugs.[11]

The supported schema features are a subset of full JSON Schema. Required: `string`, `number`, `boolean`, `integer`, `array`, `object`, `enum`, `anyOf`. Not supported: `oneOf`, `allOf`, conditional schemas, recursive references with `$ref` to anywhere outside the document. There is also a hard cap on schema depth (5 levels of nesting) and total property count (100).[11] Schemas that exceed those limits return a 400 at request time, not silently at response time, which makes the failure mode obvious during development.[11][19]

The older **JSON Mode** (`response_format: { type: "json_object" }`) still works and is supported on more models, but it only guarantees that the response parses as JSON; the schema constraint is on the developer to enforce.[11] JSON Mode predates Structured Outputs by about ten months and remains useful when the application needs free-form JSON whose shape changes per call. For everything else, Structured Outputs is the better default.

## Streaming and Server-Sent Events

Most endpoints support streaming.[38] For Chat Completions, setting `stream: true` makes the response a stream of [Server-Sent Events](/wiki/server-sent_events) (SSE), each carrying a `delta` object with the next chunk of content, tool call arguments, or finish reason.[38] The stream terminates with a `data: [DONE]` line.[38]

Responses uses a richer event model. The response is still SSE, but each event has a typed name like `response.created`, `response.output_text.delta`, `response.output_item.added`, `response.completed`, or `error`.[38] That extra structure is what lets agent frameworks render reasoning steps, tool calls, and final text separately without parsing inline JSON. The Conversations API uses the same event types when responses are streamed back through it.[35]

Streaming reduces time-to-first-token from several seconds to under one second on most models, which is critical for chat-style interfaces. It does not reduce total cost, since billing is by token regardless of how the tokens are delivered.[15] Most SDKs hide the SSE plumbing behind an async iterator: in Python, `for chunk in client.chat.completions.create(..., stream=True)`; in Node, `for await (const chunk of stream)`.[38]

There are a few common pitfalls. Streaming hides errors that occur mid-response: a 500 error returned after some tokens have been emitted will appear as a truncated stream rather than as a clear failure, so production code needs to distinguish "stream ended with `[DONE]`" from "stream ended without a finish reason." Backpressure is another concern; consumers that cannot keep up with the stream can cause the connection to back up and eventually time out. Buffering chunks into larger updates before passing them to a UI usually solves both problems.

For Responses, the `include` parameter controls which event types the server emits.[38] By default, all events are sent. Setting `include: ["response.output_text.delta", "response.completed"]` keeps the bandwidth down for clients that only need the final text and does not need to render intermediate states.

## Vision, audio, and multimodal input

The API has supported image input on chat models since November 2023.[4] The image is passed as either a public URL or a base64-encoded data URI inside a chat message; clients control fidelity with a `detail` parameter (`low`, `high`, or `auto`).[3] Image tokens are billed alongside text tokens and the cost scales with resolution and detail level.[15] The supported formats are PNG, JPEG, GIF (first frame only), and WebP.[3] Maximum image size is 20 MB per image at the API level; the model's effective resolution cap depends on the model, with most flagship models accepting images up to about 2048 by 2048 pixels at high detail.[3]

Audio input arrived with `gpt-4o-audio-preview` in 2024 and went GA across the GPT-5 family in 2025.[22] Models accept a base64 WAV or MP3 inside a message and can return audio output the same way.[22] The Realtime API uses the same models with a streaming transport.[8] For non-realtime audio chat, the latency is comparable to text chat plus the time to upload the audio file, which is usually a few hundred milliseconds for clips under a minute.

The Audio endpoints are separate. `POST /v1/audio/speech` does text-to-speech with `tts-1`, `tts-1-hd`, and the newer `gpt-4o-mini-tts` family, supporting voices alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar, in mp3, opus, aac, flac, wav, and pcm.[22] The newer voices support style instructions ("speak slowly," "sound enthusiastic") embedded in the input text, which the older `tts-1` voices ignore.[22] `POST /v1/audio/transcriptions` does speech-to-text with `whisper-1` and the newer `gpt-4o-transcribe` family, accepting mp3, mp4, mpeg, mpga, m4a, wav, and webm files up to 25 MB.[22] `POST /v1/audio/translations` does the same but always returns English.[22]

Image generation moved off DALL-E and onto `gpt-image-1` after April 23, 2025, and onto `gpt-image-1.5` later that year.[24] The endpoint is `POST /v1/images/generations`, with `/edits` and `/variations` for image-to-image work.[3] Output is returned as base64 by default or as a URL with a short expiry.[3] The images endpoint accepts a `size` parameter (1024x1024, 1024x1536, 1536x1024, and a few smaller variants), a `quality` parameter (`low`, `medium`, `high`), and a `style` parameter for some variants.[3] Pricing is per generated image and varies with size and quality.[15] The same model is also accessible from inside Chat Completions and Responses via the `image_generation` tool, which lets a single conversational call mix text reasoning with image creation.[24]

The Files API handles uploads of documents, images, and audio for use across endpoints.[40] Individual files can be up to 512 MB, and each project can store up to 2.5 TB of files in total.[40] Uploads to this endpoint are rate-limited to 1,000 requests per minute per authenticated user.[40] The Uploads API handles files up to 8 GB by accepting them in multiple parts and assembling a final File object on completion.[40]

## Realtime API

The Realtime API solves a problem that the standard request-response surface cannot: low-latency, full-duplex voice. It launched in public beta on October 1, 2024 with WebSocket as the only transport, added WebRTC support in December 2024, and went generally available on August 28, 2025 alongside the new `gpt-realtime` model.[8][9] SIP is supported as a third transport, which is what enables direct phone integration without a media server in the middle.[9]

A Realtime session is an open connection that streams audio frames in both directions plus a control channel of typed JSON events.[8] The model can interrupt itself when it detects the user starting to speak, emit transcripts of both sides, call tools mid-sentence, and switch voices on the fly.[9] Image input is supported during a session, which is how voice agents can answer questions about something the user is showing on camera.[9] Remote MCP servers can be wired in as tools, the same as in Responses.[9]

The transport choice matters. WebSockets are the simplest to implement from a server-side application but introduce 200 to 500 ms of round-trip latency on top of the model latency. WebRTC adds a media stack and STUN/TURN configuration but cuts the network latency to under 100 ms, which is what lets a voice agent feel conversational rather than walkie-talkie-like. SIP is the choice when integrating with an existing phone system; the API accepts an inbound call directly and can dial out as well.[9]

Pricing is per audio token rather than per second, with a separate input and output rate.[15] The GA `gpt-realtime` is roughly $32 per million audio input tokens and $64 per million audio output tokens, about 20% cheaper than the preview version.[9][15] There is also a less expensive `gpt-realtime-mini` variant for use cases where the full model is overkill.[15] The original beta interface was retired May 7, 2026 in favor of the GA contract.[5]

Server-side controls let an application supervise an in-progress session: redact transcripts before they reach the model, cancel a response mid-generation, override the voice or speaking style, and inject system messages without restarting the connection.[9] Webhooks deliver `realtime.call.incoming` events for inbound SIP calls and `realtime.call.completed` for session summaries, which is how production deployments wire up call-center workflows.[9][18]

## Responses API versus Chat Completions

The two endpoints look superficially similar but model the world differently. Chat Completions is **stateless**: every request carries the entire message array, the server processes it, and nothing is retained.[6] Responses is **stateful by default**: the server stores the conversation history and tool state, the next request only carries the new turn, and reasoning context can persist across calls.[6][35]

The practical differences:

| Aspect | Chat Completions | Responses |
|---|---|---|
| State | Stateless, client owns history | Stateful, optional `store: false` for stateless mode |
| Input shape | `messages: [...]` | `input: [...]` plus `previous_response_id` |
| Tools | `tools: [...]` with custom JSON schemas | Same plus built-in `web_search`, `file_search`, `code_interpreter`, `computer_use`, MCP[7] |
| Reasoning | o-series reasoning tokens are billed but not surfaced cleanly | Reasoning items are first-class output items |
| Streaming | Generic delta events | Typed events (`response.output_text.delta`, etc.)[38] |
| Conversation containers | None, client-side only | Conversations API holds threads[35] |
| Background jobs | Not supported | `background: true`, webhook on completion[18] |
| Migration story | None needed | Responses can call the same models[6] |

OpenAI has been clear that new projects should start on Responses, but it has been equally clear that Chat Completions is not deprecated.[6] The two endpoints will likely coexist for years, in the way that the old `/v1/completions` endpoint coexisted with `/v1/chat/completions` even after the latter became the obvious choice. For most teams, the migration question comes down to whether the application benefits from server-side state and the built-in tools. Pure RAG over a developer-managed index? Chat Completions is fine. Multi-step agent that uses web search, runs code, and remembers what it did three turns ago? Responses pays back the migration cost.

The Conversations API sits underneath Responses as the thread-like container.[35] A `Conversation` object is a server-side bundle of message history that several Responses calls can attach to, and it is the closest replacement for the Assistants `Thread` object.[35] Conversations are not strictly required; an application can pass `previous_response_id` to chain calls without ever creating a Conversation.[35] They become useful when several agents collaborate on the same conversation, when a conversation needs to span sessions or devices, or when the application wants OpenAI to handle context compaction automatically.[35]

## Webhooks

Webhooks let OpenAI push events to a URL the developer controls instead of forcing the client to poll. The webhook system shipped in 2025 and uses the [Standard Webhooks](/wiki/standard_webhooks) specification, which means HMAC-SHA256 signatures sent in a `webhook-signature` header in the format `v1,base64_encoded_signature`.[18] Developers configure a webhook with a name, a public HTTPS endpoint, and a list of subscribed event types.[18] OpenAI generates a signing secret on creation that is shown exactly once.[18]

Supported event categories include:[18]

- Background responses: `response.completed`, `response.cancelled`, `response.failed`
- Batch jobs: `batch.completed`, `batch.cancelled`, `batch.expired`, `batch.failed`
- Fine-tuning: `fine_tuning.job.succeeded`, `fine_tuning.job.failed`, `fine_tuning.job.cancelled`
- Realtime call events: `realtime.call.incoming`, `realtime.call.completed`
- Deep research jobs and similar long-running operations

For batch and fine-tuning workloads, webhooks are the difference between a polling loop that may run for hours and a single push event that arrives the moment the job is done.[13][18] For Realtime calls coming in over SIP, the `realtime.call.incoming` webhook is what lets a backend route the call to a session it has prepared, the same pattern that traditional telephony platforms use.[9][18]

The signature verification model is straightforward: concatenate the message id, timestamp, and body, HMAC with the secret, base64 encode, and compare with the signature header.[18] OpenAI's documentation includes copy-paste implementations for Python, Node, and Go.[18] Most teams reuse a Standard Webhooks library rather than rolling their own.

## Compatibility and OpenAI-compatible endpoints

The shape of `/v1/chat/completions` has become a de facto industry standard. [Anthropic](/wiki/anthropic), [Google Gemini](/wiki/google_gemini), [Mistral](/wiki/mistral), [Cohere](/wiki/cohere), [Together AI](/wiki/together_ai), [Fireworks AI](/wiki/fireworks_ai), [Groq](/wiki/groq), [OpenRouter](/wiki/openrouter), [vLLM](/wiki/vllm), [Ollama](/wiki/ollama), [LM Studio](/wiki/lm_studio), and most other LLM serving platforms expose either an OpenAI-compatible endpoint or a near-compatible one. Most SDKs let clients swap in a different `base_url` and an alternate API key, then talk to a non-OpenAI backend with the same code. The compatibility is rarely complete: tool calling formats drift, streaming event shapes differ in the details, and provider-specific parameters (cache control, thinking budgets, safety modes) get added or removed at the edges. For migrations, the gap is small enough that most code works without changes for the chat endpoint and large enough that anything using Responses, Realtime, or built-in tools needs a rewrite.

Azure OpenAI Service is the closest thing to OpenAI's own API but is not byte-compatible. Azure uses an `api-key` header instead of `Authorization: Bearer`, requires a deployment name in the path instead of exposing models directly, and lags on new endpoints by weeks or months. The Responses API arrived on Azure several months after OpenAI's launch; Sora and computer-use lagged similarly. For developers who need to dual-target both, the official Azure SDKs handle the differences, and the OpenAI SDKs can usually be pointed at Azure with a custom base URL plus a small adapter. Azure also supports its own concept of capacity reservation (Provisioned Throughput Units) that does not match OpenAI's Scale Tier exactly, so cross-cloud comparisons require care.

Local-first runtimes like Ollama and LM Studio also implement the chat completions surface. The trade-off is the usual one: local models are private and free at the marginal call but lag the closed models on capability, and they almost never implement Responses, Realtime, or the built-in tools. For development, prototyping, and offline use cases, the local OpenAI-compatible servers are excellent. For production, most teams use them as a fallback or for non-sensitive batch workloads rather than as the primary path.

A handful of router services (LiteLLM, OpenRouter, Helicone, Portkey) sit in front of the API and offer a single OpenAI-shaped endpoint that fans out to multiple providers. Those routers usually add their own observability, retry, caching, and key management, and they are the standard answer for organizations that want to A/B test model providers without rewriting their application code.

## Errors and best practices

The API uses the standard HTTP status codes.[19] The ones that show up most often:

| Status | Meaning | Recommended action |
|---|---|---|
| 200 | Success | Process the response |
| 400 | Bad request, schema or parameter error | Fix the client; do not retry |
| 401 | Invalid or missing API key | Check the key, do not retry |
| 403 | Region not supported, or access not granted | Confirm region and account status |
| 404 | Model or resource not found | Verify the model name, especially after a deprecation |
| 408 | Request timed out | Retry with backoff, consider streaming |
| 409 | Conflict, common on idempotency-key reuse | Use a different idempotency key |
| 413 | Payload too large | Trim the request or use the Uploads API |
| 422 | Unprocessable entity | Fix the input |
| 429 | Rate limit exceeded | Honor the `retry-after` header, exponential backoff with jitter[20] |
| 500 | Internal server error | Retry with backoff |
| 503 | Service unavailable, model overloaded | Retry with backoff, consider a fallback model |
| 529 | Overloaded (rare on OpenAI; common on competitor compatible endpoints) | Retry with backoff |

Beyond the status code, every error response carries a JSON body with `error.type`, `error.code`, `error.param`, and `error.message`.[19] The `code` field is the machine-readable identifier and is what production code should switch on; common values include `invalid_api_key`, `insufficient_quota`, `model_not_found`, `context_length_exceeded`, `tokens_exceeded`, and `rate_limit_exceeded`.[19] The `message` field is human-readable and changes wording occasionally, so matching on it is brittle.

The best-practice patterns OpenAI recommends in its cookbook have been stable for years:[20]

1. Use **exponential backoff with jitter** on 429, 500, and 503 responses. The `retry-after` header on a 429 is authoritative; respecting it is usually faster than guessing.[20]
2. **Stream long responses.** A streamed response delivers tokens as they are generated, which makes user-facing latency feel like the time-to-first-token rather than the time-to-completion.[38]
3. **Set explicit `max_completion_tokens` and timeouts.** Both protect against runaway reasoning and silently expensive calls.
4. **Keep the static portion of the prompt at the front** so it can be cached. Reordering a large system prompt or tool schema to the start of the message array can drop a per-call cost by 50% or more once cache hits warm up.[12]
5. **Use the Batch API for offline jobs.** A nightly evaluation that does not need real-time results is half the price and runs against its own quota.[13]
6. **Pin model snapshots in production.** Aliases like `gpt-5` resolve to whatever snapshot is current; pinning to `gpt-5-2026-04-15` (or whatever snapshot the application was tested against) keeps behavior reproducible across deploys.
7. **Validate Structured Outputs against the schema you sent.** The guarantee is strong but defensive validation catches the rare miss and any client-side schema drift.[11]
8. **Rotate keys quickly.** Project keys make this easier than the old org keys did, but a forgotten key in a public repository is still the most common security incident.[33]
9. **Watch the 24-hour batch window.** A batch submitted at 23:59 UTC will not start counting against the next day's quota until it completes.[13]
10. **Subscribe to the Changelog.** OpenAI ships changes weekly, deprecations are usually announced 6 to 12 months in advance, and the [Changelog](https://developers.openai.com/api/docs/changelog) and [Deprecations](https://developers.openai.com/api/docs/deprecations) pages are the source of truth.[4][5]

A few additional items show up often enough in incident postmortems to be worth listing:

- **Always set a request timeout.** SDK defaults are often longer than what an end-user-facing application can tolerate. A 60-second client-side timeout with a graceful fallback message is better than a hung connection.
- **Never log full request or response bodies in production** unless they are also redacted. Prompts and outputs contain user data that has its own privacy and retention rules.
- **Watch the per-model context window.** Hitting `context_length_exceeded` is a 400 not a 429, so it does not retry; it surfaces immediately to the user.[19] Token-counting before sending is cheap insurance.
- **Treat the model as eventually consistent.** A new fine-tuned model or a new vector store may not be fully available across all regions for a few minutes after creation. Production code that creates resources and immediately uses them should retry on 404 for a few seconds.
- **Use idempotency keys on POSTs that have side effects** (file uploads, batch creation, fine-tuning jobs). The retries on transient errors are worth more when the server can deduplicate.[3]

OpenAI publishes a status page at `status.openai.com` that lists incidents per service. The API generally has 99.9% availability across a quarter, but individual models have noticeably worse incident rates during launch weeks. Building a fallback path to a different model (or a different provider through one of the OpenAI-compatible routers) is the single most effective way to keep an application up during those windows.

## See also

- [OpenAI](/wiki/openai)
- [OpenAI API](/wiki/openai_api)
- [Chat Completions API](/wiki/chat_completions_api)
- [Responses API](/wiki/responses_api)
- [Assistants API](/wiki/assistants_api)
- [Realtime API](/wiki/realtime_api)
- [Batch API](/wiki/batch_api)
- [Embeddings](/wiki/embeddings)
- [Function calling](/wiki/function_calling)
- [Structured Outputs](/wiki/structured_outputs)
- [Prompt caching](/wiki/prompt_caching)
- [Fine-tuning](/wiki/fine-tuning)
- [GPT-5](/wiki/gpt-5)
- [GPT-4o](/wiki/gpt-4o)
- [Whisper](/wiki/whisper)
- [GPT-image-1](/wiki/gpt-image-1)
- [Sora](/wiki/sora)
- [Codex](/wiki/codex)
- [Azure OpenAI Service](/wiki/azure_openai_service)
- [Model Context Protocol](/wiki/model_context_protocol)

## References

1. "OpenAI API." OpenAI, June 11, 2020. https://openai.com/index/openai-api/
2. "OpenAI launches an API to commercialize its research." VentureBeat, June 11, 2020. https://venturebeat.com/technology/openai-launches-an-api-to-commercialize-its-research/
3. "API Reference." OpenAI Platform. https://platform.openai.com/docs/api-reference
4. "Changelog." OpenAI Developers. https://developers.openai.com/api/docs/changelog
5. "Deprecations." OpenAI Developers. https://developers.openai.com/api/docs/deprecations
6. "Migrate to the Responses API." OpenAI Developers. https://developers.openai.com/api/docs/guides/migrate-to-responses
7. "New tools for building agents." OpenAI, March 11, 2025. https://openai.com/index/new-tools-for-building-agents/
8. "Introducing the Realtime API." OpenAI, October 1, 2024. https://openai.com/index/introducing-the-realtime-api/
9. "Introducing gpt-realtime and Realtime API updates for production voice agents." OpenAI, August 28, 2025. https://openai.com/index/introducing-gpt-realtime/
10. "Function calling and other API updates." OpenAI, June 13, 2023. https://openai.com/index/function-calling-and-other-api-updates/
11. "Introducing Structured Outputs in the API." OpenAI, August 6, 2024. https://openai.com/index/introducing-structured-outputs-in-the-api/
12. "Prompt Caching in the API." OpenAI, October 1, 2024. https://openai.com/index/api-prompt-caching/
13. "Batch API." OpenAI Developers. https://developers.openai.com/api/docs/guides/batch
14. "Rate limits." OpenAI Developers. https://developers.openai.com/api/docs/guides/rate-limits
15. "Pricing." OpenAI. https://openai.com/api/pricing/
16. "Scale Tier for API Customers." OpenAI. https://openai.com/api-scale-tier/
17. "Priority Processing for API Customers." OpenAI. https://openai.com/api-priority-processing/
18. "Webhooks." OpenAI Developers. https://developers.openai.com/api/docs/guides/webhooks
19. "Error codes." OpenAI Developers. https://developers.openai.com/api/docs/guides/error-codes
20. "How to handle rate limits." OpenAI Cookbook. https://developers.openai.com/cookbook/examples/how_to_handle_rate_limits
21. "Vector embeddings." OpenAI Developers. https://developers.openai.com/api/docs/guides/embeddings
22. "Audio and speech." OpenAI Developers. https://developers.openai.com/api/docs/guides/audio
23. "Computer use." OpenAI Developers. https://developers.openai.com/api/docs/guides/tools-computer-use
24. "Introducing our latest image generation model in the API." OpenAI, April 23, 2025. https://openai.com/index/image-generation-api/
25. "Upgrading the Moderation API with our new multimodal moderation model." OpenAI, September 26, 2024. https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/
26. "Assistants API beta deprecation, August 26, 2026 sunset." OpenAI Developer Community, August 26, 2025. https://community.openai.com/t/assistants-api-beta-deprecation-august-26-2026-sunset/1354666
27. "What to know about the Sora discontinuation." OpenAI Help Center, March 24, 2026. https://help.openai.com/en/articles/20001152-what-to-know-about-the-sora-discontinuation
28. "openai/openai-go." GitHub. https://github.com/openai/openai-go
29. "openai/openai-java." GitHub. https://github.com/openai/openai-java
30. "openai/openai-dotnet." GitHub. https://github.com/openai/openai-dotnet
31. "Announcing the stable release of the official OpenAI library for .NET." Microsoft .NET Blog, October 2024. https://devblogs.microsoft.com/dotnet/announcing-the-stable-release-of-the-official-open-ai-library-for-dotnet/
32. "OpenAI for Developers in 2025." OpenAI Developers. https://developers.openai.com/blog/openai-for-developers-2025
33. "Managing your work in the API platform with projects." OpenAI Help Center. https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects
34. "Reinforcement fine-tuning." OpenAI Developers. https://developers.openai.com/api/docs/guides/reinforcement-fine-tuning
35. "Conversation state." OpenAI Developers. https://developers.openai.com/api/docs/guides/conversation-state
36. "New embedding models and API updates." OpenAI, January 25, 2024. https://openai.com/index/new-embedding-models-and-api-updates/
37. "Introducing Codex." OpenAI, April 2025. https://openai.com/index/introducing-codex/
38. "Streaming API responses." OpenAI Developers. https://developers.openai.com/api/docs/guides/streaming-responses
39. "OpenAI API." Hacker News discussion of the June 11, 2020 launch. https://news.ycombinator.com/item?id=23489653
40. "Files." OpenAI API Reference. https://platform.openai.com/docs/api-reference/files