Llama Stack
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,625 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,625 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama Stack is an open standardized framework created by Meta for building generative AI applications. It defines a set of standardized APIs (building blocks) that cover the application development lifecycle, including inference, safety, agents, retrieval, evaluation, and telemetry, and it ships implementations of those APIs (called providers) that can be bundled into ready-to-run packages (called distributions). The goal is to let a developer write an application once against a common API surface and then run it across different backends, whether on a local machine, on self-hosted infrastructure, or through a cloud service, without rewriting the application code.[1][2][3]
The project began as a request for comment published alongside the Llama 3.1 release in July 2024, was formalized as a set of API providers and distributions at Meta Connect in September 2024 (with the Llama 3.2 launch), and reached its first stable release in January 2025. It is developed in the open at the GitHub repository meta-llama/llama-stack.[1][4][5][6] In April 2026 the project was renamed OGX after it grew to support many non-Llama models.[7]
Meta frames Llama Stack as a response to fragmentation in the tooling around large language models. By mid-2024 there were many separate libraries and services for inference, fine-tuning, safety filtering, retrieval, and agent orchestration, and they did not share common interfaces. In the Llama 3.1 launch post on July 23, 2024, Meta released "a request for comment on the Llama Stack API, a standard interface we hope will make it easier for third-party projects to leverage Llama models." The company described the aim as "a set of standardized and opinionated interfaces for how to build canonical toolchain components (fine-tuning, synthetic data generation) and agentic applications," and said it hoped the interfaces would "become adopted across the ecosystem, which should help with easier interoperability."[4][8]
The intent is that an application written against the Llama Stack APIs is portable. Because the interface stays the same, a developer can prototype locally with one inference engine and later move the same code to a production cluster or a managed cloud endpoint by swapping the underlying provider rather than rewriting the application.[2][3][6]
Llama Stack organizes its functionality into a set of REST APIs, each covering one part of the development lifecycle. The exact set has evolved across releases. The version that shipped with the first stable release (v0.1.0, January 2025) is summarized below.[5][6]
| API | Purpose |
|---|---|
| Inference | Run text generation and chat completion against a Llama (or other) model. |
| Safety | Apply content filtering and safety policies, for example using Llama Guard shields. |
| Agents | Build multi-step agentic workflows that combine the other APIs. |
| Tools | Register and call external tools that agents can invoke. |
| RAG / Memory | Store and retrieve knowledge for retrieval-augmented generation. |
| Evaluation | Test and score model and agent quality using scoring functions. |
| Telemetry | Collect traces, metrics, and logs to observe and debug agents. |
| Post-training | Fine-tune models (listed as forthcoming at the time of the stable release). |
In its first stable form the platform let developers "build RAG applications and Agents using tools and safety shields, monitor those agents with telemetry, and evaluate agents with scoring functions."[5] When Meta first described the API set in 2024, the planned surfaces also included synthetic data generation and reward scoring, reflecting the original toolchain framing around fine-tuning and data generation.[1][9] Later releases added batch inference and expanded the agent and RAG capabilities.[5][6]
Two concepts are central to how Llama Stack achieves portability: providers and distributions.
A provider is a concrete implementation of one or more of the APIs against a specific backend. For example, the Inference API can be implemented by a local engine such as Ollama, by a self-hosted server such as vLLM, or by a managed cloud service; the Safety API can be implemented by a Llama Guard deployment; and the memory or vector-store API can be backed by FAISS, sqlite-vec, Weaviate, or a hosted vector database. Because every provider conforms to the same API contract, the application code does not change when the provider does. Meta described this as a "plugin architecture to support the rich ecosystem of different API implementations in various environments," and partners including NVIDIA, Fireworks, and Ollama contributed provider implementations across inference, memory, and safety.[2][3][5][6]
A distribution (or "distro") is a pre-configured bundle of providers, one for each API, packaged so they work together and expose a single endpoint. Meta introduced distributions at Connect in September 2024, describing a distribution as "a way to package multiple API Providers that work well together to provide a single endpoint for developers." The aim is to let developers work with Llama models across "on-prem, cloud, single-node, and on-device" environments using a consistent setup. The distributions announced at launch covered several deployment targets:[1][3][9]
| Distribution type | Backends named at launch |
|---|---|
| Single-node | A Meta-internal implementation and Ollama |
| Cloud | AWS, Databricks, Fireworks, and Together AI |
| On-device | iOS, via PyTorch ExecuTorch |
| On-premises | Dell Technologies |
Llama Stack ships a command-line interface (the llama CLI) for downloading models, configuring distributions, and running the server, along with Docker containers for the distribution server and the agents provider.[1][3] To make the APIs accessible from applications, Meta released client SDKs in several languages. At the September 2024 launch these were "client code in multiple languages, including python, node, kotlin, and swift," giving server-side, web, Android, and iOS developers a typed way to call the same endpoints.[3][5] The Swift and Kotlin clients in particular target on-device use through the iOS and Android distributions.[3]
Llama Stack moved from a proposal to a versioned software project over roughly six months, then continued to evolve through 2025 and into 2026.
| Date | Milestone |
|---|---|
| July 23, 2024 | Request for comment on the Llama Stack API, published with the Llama 3.1 release.[4][8] |
| September 25, 2024 | Llama Stack distributions, providers, CLI, containers, and multi-language SDKs introduced at Meta Connect alongside Llama 3.2.[1][3][9] |
| January 24-25, 2025 | First stable release (v0.1.0): a "stable release (V1) of the Llama Stack APIs and the corresponding llama-stack server and client packages," with backward-compatible upgrades and automated provider verification.[5][6] |
| April 5, 2025 | Release v0.2.0, around the time of the Llama 4 launch, with expanded agent and RAG support.[10] |
| April 28, 2026 | Project renamed from Llama Stack to OGX.[7] |
The January 2025 stable release was positioned as the point at which the APIs were considered settled enough to build production applications on. Meta highlighted "backward-compatible upgrades," meaning developers could "integrate future API versions without modifying their existing implementations," and "automated provider verification," which ran compatibility checks when onboarding a new provider so that swapping backends was less error-prone.[5][6]
By 2026 the project had grown well beyond Llama-specific use. On April 28, 2026, the maintainers renamed it from Llama Stack to OGX (described as the "Open GenAI Stack"). The stated reasons were that "the Llama association was limiting" because the server by then supported many inference providers (the announcement cited 23) and could run models from OpenAI, Anthropic, Google, Mistral, and others, and that the name "Stack" implied a framework when the project was, in their words, "an HTTP server with a pluggable provider architecture." The rename moved the source from the llama_stack package to ogx, changed the CLI from llama to ogx, renamed environment variables from LLAMA_STACK_* to OGX_*, and moved the GitHub organization to ogx-ai, a change the maintainers said touched 1,696 files.[7]
Under the OGX name the project repositioned around being API-compatible with the major frontier labs: it implements OpenAI-style chat completion and Responses endpoints, the Anthropic Messages API, and Google's GenAI interface, so an application written for one vendor's SDK can run against the same server regardless of which underlying model it uses.[7]
Llama Stack sits above the LLaMA model weights and integrates several existing Meta safety and tooling efforts. Its Safety API is built around the Purple Llama project, in particular the Llama Guard input/output classifier used to filter prompts and responses. Llama Stack is distinct from the Llama API, a separate Meta-hosted service for calling Llama models, although a Llama-hosted endpoint can serve as one of the providers a distribution wires together. The framework is also broader than any single model generation: it was designed to work across the Llama 3.x and Llama 4 families and, after the OGX rename, across non-Llama models as well.[3][5][7]