OpenAI Assistants API
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,904 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,904 words
Add missing citations, update stale details, or suggest a clearer explanation.
The OpenAI Assistants API is a stateful, server-managed application programming interface introduced by OpenAI on November 6, 2023, at the company's first DevDay developer conference. It was OpenAI's earliest "agentic" interface, exposing four primary object types (Assistants, Threads, Messages, and Runs) and three first-party tools (Code Interpreter, Retrieval, and Function calling) that together let developers build chatbots and tool-using assistants without having to manage conversation history, vector databases, or sandboxed code execution themselves.[^1][^2] The API remained in beta for its entire lifetime. OpenAI announced its replacement, the Responses API, on March 11, 2025, and on August 26, 2025 published a formal deprecation notice declaring that the Assistants API beta would sunset on August 26, 2026.[^3][^4][^5]
| Field | Value |
|---|---|
| Type | Stateful HTTP API for building AI assistants and AI agents |
| Creator | OpenAI |
| Initial release | November 6, 2023 (v1 beta) |
| Major revision | April 17, 2024 (v2 beta) |
| Deprecation announced | August 26, 2025 |
| Scheduled sunset | August 26, 2026 |
| Successor | OpenAI Responses API plus OpenAI Agents SDK |
| Core abstractions | Assistant, Thread, Message, Run, Run Step, Vector Store (v2) |
| Hosted tools | Code Interpreter, Retrieval / File Search, Function calling |
| Status as of 2026-05-20 | Beta, deprecated, three months from sunset |
OpenAI announced the Assistants API on November 6, 2023, during its first DevDay conference in San Francisco, alongside GPT-4 Turbo and the consumer-facing custom GPTs product.[^1][^2] The release was framed by chief executive Sam Altman and developer experience lead Romain Huet as a step toward "agent-like experiences" inside third-party applications: the new API would let developers build assistants with specific instructions, persistent memory, and access to first-party tools without having to wire up conversation storage, sandboxed Python execution, or document indexing themselves.[^1][^6]
The November 6 release shipped four resource types and three tools. The four resource types were the Assistant (a server-stored configuration of model, instructions, and tools), the Thread (a server-stored conversation), the Message (a unit of user or assistant content inside a Thread), and the Run (an execution of an Assistant against a Thread).[^2][^6] The three first-party tools were Code Interpreter, which executed Python in an OpenAI-hosted sandbox; Retrieval, which performed embedding-based document search; and Function calling, which let the model emit structured JSON calls to developer-defined functions.[^1][^2] The API entered beta open to all developers on the day of the announcement.[^1]
The launch positioned the Assistants API as a structural counterpart to the customer-facing GPTs product. Both shared the same underlying abstractions of instructions, model, and tools, but where GPTs lived inside the ChatGPT consumer surface and were monetised through the OpenAI-operated GPT Store, the Assistants API exposed the same primitives to developers building their own applications.[^19] OpenAI's developer-relations posts argued that this split followed a deliberate product strategy: GPTs would democratise assistant authoring for non-technical users, while the Assistants API would let businesses embed similar functionality into their own clients with their own branding, authentication, and data handling.[^19][^2]
Coverage of the launch emphasised that the new API was not merely a wrapper over Chat Completions. Earlier in 2023, OpenAI had released function calling on the Chat Completions endpoint, and the developer community had built a substantial ecosystem of agent libraries on top of that primitive. The Assistants API was the first time the model provider itself shipped a server-side runtime for the agent loop, persistent conversation storage, and hosted retrieval as a packaged offering, which several commentators identified as a competitive response to client-side frameworks such as LangChain and LlamaIndex that had grown popular over the preceding year.[^1][^19][^23]
The first version of Retrieval was capped at twenty files per Assistant, with a per-file size limit of 512 megabytes and a per-Assistant storage charge.[^7] Several developer forum threads in November and December 2023 criticised the twenty-file ceiling as too small for production retrieval-augmented generation workloads, with users asking whether the limit would be lifted.[^7] Early reviews also noted that the API's asynchronous Run model required client-side polling: after creating a Run, callers had to repeatedly fetch the Run object until its status changed from queued or in_progress to a terminal state such as completed, failed, cancelled, expired, or requires_action.[^8]
Cost surprises were a recurring theme in early reviews. The v1 Retrieval tool was billed at 0.20 US dollars per gigabyte per Assistant per day, meaning that two Assistants sharing the same source corpus would each incur full storage charges; v2 later corrected this by introducing the shared Vector Store object.[^7][^10] Code Interpreter's hourly session billing was another point of confusion: developers who assumed a single Code Interpreter invocation cost 0.03 US dollars sometimes discovered that opening parallel Threads multiplied the charge, because each Thread maintained its own sandbox session.[^17][^18] Several early adopters concluded that the Assistants API was best suited to prototypes and internal tools where these unpredictable per-feature charges were tolerable, while large consumer-scale deployments could be cheaper to operate on the stateless Chat Completions endpoint with developer-managed retrieval and execution.[^22]
OpenAI released Assistants API v2 on April 17, 2024.[^9][^10] The v2 revision was a backwards-incompatible upgrade that introduced a new top-level object, the Vector Store, and renamed the Retrieval tool to file_search.[^9][^10] A Vector Store handled automatic parsing, chunking, and embedding for uploaded files and could be attached to either an Assistant or a Thread; the same Vector Store could be shared across multiple Assistants.[^10] The per-Assistant file ceiling rose from twenty to ten thousand files, a five-hundred-fold increase, while the per-file size limit remained at 512 megabytes and a new per-file token cap of five million tokens was imposed.[^9][^10]
The v2 update also added stream events for Runs, parallel function calls (multiple tool_calls returned in a single required_action step), a tool_choice parameter, token-usage fields on completed Runs, and standard sampling parameters (temperature, top_p, plus per-Run token limits).[^10][^11] File search results returned annotated citations identifying the source file and chunk for each retrieved span.[^10] OpenAI announced that access to the v1 endpoints would end on December 18, 2024, after which only v2 would be available.[^11]
Several smaller v2 enhancements addressed long-running developer complaints. The introduction of tool_choice let callers force a specific tool invocation on a per-Run basis, a feature already present on Chat Completions. Vector Stores accepted optional expiration policies that automatically deleted files after a configurable interval, reducing the risk of perpetual storage charges accruing on forgotten Assistants. The v2 endpoints also added support for fine-tuned gpt-3.5-turbo-0125 variants, and later expanded to fine-tuned GPT-4o derivatives, allowing developers to combine a customised base model with the Assistants runtime.[^10][^11] In aggregate, v2 closed many of the smaller feature gaps between the Assistants API and the simpler Chat Completions endpoint, but it preserved the asynchronous Run model and the perpetual beta label that had been criticised since launch.[^22][^23]
On March 11, 2025, OpenAI introduced the Responses API together with the open-source Agents SDK as part of a launch titled "New tools for building agents".[^3][^12] The Responses API was positioned as a unification of the strengths of the older Chat Completions endpoint and the Assistants API: it supports server-side conversation state, hosted tools, and an item-based event stream, but with a flatter object model.[^3][^4][^13] The March 11 announcement stated that OpenAI intended to achieve feature parity in the Responses API and then formally deprecate the Assistants API, with a sunset target in mid-2026 and a twelve-month migration window after the formal deprecation notice.[^3][^14]
That formal notice arrived on August 26, 2025. OpenAI told developers on its forum: "We're winding down the Assistants API beta. It will sunset one year from now, August 26, 2026."[^4] The same day the company published an "Assistants migration guide" describing how to translate Assistants (which become Prompts, a dashboard-only configuration object), Threads (which become Conversations), Runs (which become Responses), and Run Steps (which become Items) into the Responses API and the companion Conversations API.[^4][^15] As of May 2026 the Assistants API remains operational but is in legacy support, with three months remaining before the scheduled shutdown.[^5]
The Assistants API exposed a small set of REST endpoints under /v1/assistants, /v1/threads, and (in v2) /v1/vector_stores. The objects formed a directed hierarchy:
| Object | Lifetime | Purpose |
|---|---|---|
| Assistant | Persistent, account-scoped | Stores a model selection, system instructions, attached tools, and (in v2) attached Vector Stores. Reusable across many Threads and users.[^2][^6] |
| Thread | Persistent, account-scoped | Holds an ordered list of Messages for one conversation. Independent of any Assistant; the same Thread can be run against multiple Assistants.[^2][^6] |
| Message | Persistent, child of Thread | A user or assistant content payload (text, file attachments, images). Messages are append-only within a Thread.[^6][^11] |
| Run | Persistent, child of Thread | One execution of an Assistant on a Thread. Carries a status, token usage, and a list of Run Steps.[^8][^11] |
| Run Step | Persistent, child of Run | A single step of the agent loop: a message creation, a tool call, or a tool output.[^11] |
| Vector Store (v2) | Persistent, account-scoped | An auto-chunked, auto-embedded collection of files that the file_search tool can query.[^9][^10] |
The Run object's state machine drove the agent loop. After a client called POST /v1/threads/{thread_id}/runs, the Run started in queued, advanced to in_progress, and ended in one of completed, failed, cancelled, expired, or requires_action.[^8] The requires_action state was used to hand control back to the client for function calling: when the model emitted one or more tool_calls for developer-defined functions, the Run paused and waited for the client to submit tool outputs via POST /v1/threads/{thread_id}/runs/{run_id}/submit_tool_outputs.[^8][^16]
Because Threads were server-managed, the Assistants API was not idempotent in the way a stateless Chat Completions call was: while a Run was in a non-terminal state the Thread was locked, and no new Messages could be appended and no new Runs could be created against it.[^8] Clients either polled the Run endpoint until a terminal state was reached or, after the v2 update added streaming, subscribed to a server-sent event stream that emitted thread.run.created, thread.message.delta, thread.run.requires_action, and similar events.[^10][^11]
Three tool types were available across the API's lifetime:
tool_calls in a single requires_action event.[^10][^16]The three tool types could be mixed within a single Assistant, and the v2 tool_choice parameter let callers force the use of a specific tool on a per-Run basis.[^10]
The Assistants API used a layered pricing model. Token usage on the underlying model (for example GPT-4 Turbo or, later, GPT-4o) was billed at the standard per-token rate.[^1][^17] On top of that, the hosted tools carried infrastructure surcharges. Code Interpreter cost 0.03 US dollars per session, with a session defined as up to one hour of activity on a single Thread; concurrent sessions on different Threads were billed independently.[^17][^18] File Search in v2 was billed at 0.10 US dollars per gigabyte of Vector Store storage per day, with the first gigabyte free and a default project storage cap of 100 gigabytes; v1 Retrieval had been charged at 0.20 US dollars per gigabyte per Assistant per day.[^7][^17] Function calling itself carried no separate fee, only the model token cost.
Streaming was not available at the November 2023 launch. OpenAI added stream events in the v2 update, emitting incremental deltas for Message content (thread.message.delta), Run status changes (thread.run.in_progress, thread.run.completed), and tool invocations (thread.run.step.created, thread.run.requires_action).[^11][^10] Token-usage statistics on a Run were populated only after the Run reached a terminal state, which complicated billing instrumentation for long-running streamed Runs.[^11]
Streaming did not eliminate the underlying state-machine complexity of the Assistants API. A streamed Run still progressed through the same set of Run Step events as a polled Run; the stream simply delivered them as server-sent events rather than requiring repeated GET calls. Tool execution still required the client to detect a thread.run.requires_action event, locally execute the listed tool_calls, and reopen a connection with submit_tool_outputs to resume the Run.[^11][^16] This made the streaming developer experience richer than polling but did not simplify the orchestration logic that any agent-style application had to implement.
OpenAI shipped first-party Python and Node.js SDK helpers that wrapped the polling and streaming patterns. The Python helper client.beta.threads.runs.create_and_poll ran a Run to completion and returned the final Run object, while client.beta.threads.runs.stream returned an event-handler interface that fired callbacks for each Run Step.[^8] Similar helpers existed for Vector Store creation, including a method that uploaded a directory of files and polled until indexing was complete.[^10][^11] These helpers reduced boilerplate but did not change the underlying HTTP surface; clients in other languages still had to implement the polling or streaming loops themselves.
The Assistants API saw broad if shallow adoption. Coverage of DevDay 2023 reported that OpenAI made the beta available to "all developers" on the day of announcement, and use cases highlighted in the launch materials ranged from a natural-language data-analysis app to a coding assistant to an AI vacation planner.[^1][^2] Subsequent third-party guides documented production deployments at customer-support shops, financial-services firms, and internal enterprise assistants, often in hybrid configurations alongside the consumer-facing Custom GPTs product.[^19][^20] Through the Azure OpenAI Service, Microsoft offered a managed mirror of the Assistants API on Azure, extending its reach into regulated enterprise environments.[^21]
Independent reviewers consistently noted that the API was attractive for prototypes because it absorbed several normally tedious pieces of infrastructure (conversation storage, embedding indexes, sandboxed code execution) but became awkward at scale. A widely cited 2024 review summarised the trade-off bluntly: "the good, bad, and expensive."[^22] Reported pain points included perceived latency of four to eight seconds per turn versus one to two seconds for Chat Completions, unpredictable file-search billing on large Vector Stores, the inability to control chunking or embedding choices, and the persistent beta label.[^22][^23]
By the time of the August 2025 deprecation announcement, OpenAI's developer-relations posts described the Responses API as having "already overtaken Chat Completions in token activity" and characterised it as the recommended path for new agent applications.[^4] The same posts encouraged Assistants API users to begin migrating, citing internally measured improvements of forty to eighty per cent in cache-hit rate under the Responses API compared with Chat Completions.[^4][^13]
A notable adoption pattern, documented across multiple third-party integration guides, was the hybrid stack in which a single organisation ran the Assistants API for its internal-facing assistants and a fleet of consumer-facing Custom GPTs for end-user productivity. In this configuration the Assistants API typically powered chatbots inside proprietary applications that needed strict data residency, audit logging, or integration with existing identity systems, while Custom GPTs were used by employees for ad-hoc tasks in the ChatGPT product surface.[^19][^20] As of the August 2025 deprecation, OpenAI's migration guidance treated this hybrid use as the modal case and provided distinct migration paths for each, with Assistants API users moving to the Responses API and Custom GPT authors continuing to use the existing GPT Builder interface.[^4][^15]
The classic OpenAI Chat Completions endpoint was stateless: every call sent the full message list and received a single completion. The Assistants API was stateful: a Thread persisted on OpenAI servers, the model could autonomously decide to call multiple tools across multiple Run Steps, and the conversation history did not need to be retransmitted on each turn.[^2][^6] Chat Completions returned a single response synchronously; an Assistants Run was asynchronous and required either polling or stream subscription.[^8] Chat Completions never hosted Code Interpreter or vector retrieval; those tools were introduced first on the Assistants API and later ported to the Responses API.[^3][^13]
The Anthropic Messages API takes the opposite philosophical stance from the Assistants API on conversation state. Anthropic's Messages endpoint is fully stateless: each call must include the entire conversation history, and the client is responsible for storing, truncating, and resubmitting it.[^24] Tool use under the Messages API follows a stop-resume pattern in which the model returns stop_reason: "tool_use" with one or more tool_use content blocks, the client executes the call locally, and the client sends back a tool_result block on the next turn; there is no server-managed Run object.[^24] Server-side tools introduced later by Anthropic (web search, code execution, computer use, and others) run on Anthropic infrastructure but still flow through the same stateless Messages envelope.[^24] In 2025, Anthropic introduced a Managed Agents offering for stateful agent execution, more directly analogous to the Assistants and Responses APIs, while continuing to ship the Messages API itself as a stateless primitive.[^25]
Before and during the lifetime of the Assistants API, several open-source frameworks offered comparable abstractions for conversation memory, tool orchestration, and RAG pipelines, but as client-side libraries rather than server-side endpoints. LangChain provided chains, agents, memory classes, and document loaders that could be assembled into agent loops over any model provider, while LlamaIndex specialised in retrieval and indexing primitives for RAG applications. Both libraries could call the underlying OpenAI Chat Completions endpoint or the Assistants API, but they kept state in the developer's own process rather than on OpenAI's servers.[^23] Compared with the Assistants API, these frameworks offered finer control over chunking, embedding choice, and prompt construction at the cost of more developer code; the Assistants API offered hosted infrastructure at the cost of opacity and lock-in.[^22][^23]
The Model Context Protocol (MCP), introduced by Anthropic in late 2024, is not an agent API but a transport protocol that lets language models discover and call tools exposed by external servers. MCP is complementary to, rather than competitive with, the Assistants API: where the Assistants API combined a particular runtime (the Run loop) with a particular tool set (Code Interpreter, file search, functions), MCP standardises only the tool-discovery and tool-calling surface, leaving the agent runtime to the model provider.[^26] The Responses API and the OpenAI Agents SDK subsequently added MCP support as a built-in tool type, which the Assistants API never did.[^4]
| API | Provider | State | Hosted tools | Status (May 2026) |
|---|---|---|---|---|
| Assistants API | OpenAI | Server-managed Threads | Code Interpreter, File Search, Functions | Beta, sunset 2026-08-26[^4] |
| Responses API | OpenAI | Optional server state via Conversations | Web search, File Search, Code Interpreter, Computer Use, MCP, Image | GA[^3][^13] |
| Chat Completions | OpenAI | Stateless | None | GA[^3] |
| Messages API | Anthropic | Stateless | Web Search, Code Execution, Computer Use (server-side blocks) | GA[^24] |
| Managed Agents | Anthropic | Server-managed sessions | Mirrors Messages tool set | GA[^25] |
The Assistants API drew several recurrent criticisms over its lifetime, many of which OpenAI itself cited as motivation for the Responses API.
Polling and locking. Without streaming, the only way to know that a Run had finished was to call GET /v1/threads/{thread_id}/runs/{run_id} in a loop. Even with the v2 streaming events, the Thread was locked while a Run was active, which prevented appending new user messages or starting a parallel Run on the same Thread.[^8][^23] Multi-user front-ends therefore had to map each end-user to a distinct Thread and queue requests carefully.
Opaque retrieval. File Search did not expose its chunking strategy, embedding model, or top-k retrieval parameters. Developers who needed control over how documents were split or how results were ranked routinely fell back to LangChain, LlamaIndex, or a self-hosted vector database with the standard Chat Completions endpoint.[^22][^23]
Unpredictable costs. The combination of token billing, per-session Code Interpreter charges, per-gigabyte daily Vector Store charges, and the model's own discretion over whether to call tools made budgeting difficult. Reviewers reported cases where a single user question caused the same source PDF to be reprocessed in multiple Runs, accumulating tokens disproportionate to the conversational length.[^22][^23]
Perpetual beta. The Assistants API never reached general availability. Developer forum posts from late 2024 and early 2025 asked repeatedly whether it would ever leave beta; OpenAI's eventual answer was the Responses API, not a graduation of the existing surface.[^14][^23]
Limited tool surface. Web search, computer use, image generation, and MCP support were not added to the Assistants API. Each of these capabilities was introduced first on the Responses API in 2025, leaving Assistants users on an effectively frozen tool set.[^3][^12][^4]
In its August 2025 deprecation announcement and in subsequent developer blog posts, OpenAI framed the move from the Assistants API to the Responses API as a deliberate reset informed by a year and a half of production experience. The company described server-managed Threads as a useful prototype affordance that proved limiting in production: developers needed finer control over which context entered each model call, lower-latency turn execution, and stateless operation modes for organisations with zero-data-retention compliance requirements.[^4][^13]
The Responses API addresses these constraints by making server state optional (turns can be chained with a previous_response_id parameter or stored explicitly via the Conversations API, but they can also be sent fully stateless), by collapsing the Assistant configuration into a Prompt object that lives only in the dashboard, by replacing the asynchronous Run with a synchronous Response that streams items as they are produced, and by extending the hosted tool set to include web search, computer use, image generation, and Model Context Protocol clients.[^3][^4][^13] OpenAI also reported that the Responses API allows for substantially better prompt caching, citing internal benchmarks showing forty to eighty per cent cache-hit improvements over Chat Completions on equivalent workloads.[^4][^13]
The deprecation post stopped short of calling the Assistants API a failure. It was the first agentic API any major model provider had shipped, and many of its abstractions (server-managed conversation state, hosted code execution, hosted retrieval, structured tool calls) are now baseline expectations across the industry. The article's principal lesson, as OpenAI articulated it, was that "API design has always been guided by how the models themselves work," and that the rapid evolution of the underlying models (from GPT-4 Turbo through GPT-4o to GPT-5 and reasoning-focused successors) made the original Assistants object model too rigid to absorb new capabilities cleanly.[^13]
A second articulated lesson concerned the separation between configuration and execution. The Assistants object conflated three concerns into a single persistent resource: the model and decoding parameters, the system instructions and prompt, and the attached tools and vector stores. When any one of these needed to change (for example, swapping the underlying model to a newer reasoning variant), developers had to either mutate the Assistant in place or create a new Assistant and migrate clients to its identifier. The Responses API resolves this by treating Prompts as dashboard-managed templates that are referenced at call time, and by allowing tools, models, and instructions to be specified per-request when desired, which decouples runtime evolution from persistent identity.[^4][^13][^15]
A third lesson concerned the agent loop itself. Several Responses API design choices, including the unified item stream, the use of previous_response_id for turn chaining, and the move from polled Runs to streamed Responses, were framed in OpenAI's blog posts as direct responses to friction points reported by Assistants API users. The migration guide also clarifies that some Assistants API affordances, including dashboard-only prompt creation and certain conversation-truncation features, did not translate one-to-one into the Responses API; OpenAI's documented position is that these gaps reflect intentional simplifications rather than regressions.[^4][^14][^15]
The Assistants API marked the moment when "agent" stopped being purely a research and open-source-framework concept and became a first-party product surface offered by a frontier model lab. By bundling a conversation store, a code sandbox, a vector retrieval system, and a function-calling interface into a single hosted offering, it lowered the activation energy for building chatbots that could browse documents, run calculations, and call external services. Many of the abstractions it introduced (persistent Threads, Runs as agent-loop executions, tool calls as discrete protocol events) recur in nearly every successor system, from the Responses API and Agents SDK to Anthropic's Managed Agents, to assorted client-side frameworks.[^3][^13][^25]
Even after its sunset, the Assistants API is likely to be remembered as the experiment that established server-managed agent state as a viable product category, identified its main failure modes (latency, opacity, billing complexity, beta drift), and informed the design of the cleaner APIs that replaced it.