Llama Stack

8 min read

Updated Jul 23, 2026

Llama Stack is an open standardized framework created by Meta for building generative AI applications. It defines a set of standardized APIs (building blocks) that cover the application development lifecycle, including inference, safety, agents, retrieval, evaluation, and telemetry, and it ships implementations of those APIs (called providers) that can be bundled into ready-to-run packages (called distributions). The goal is to let a developer write an application once against a common API surface and then run it across different backends, whether on a local machine, on self-hosted infrastructure, or through a cloud service, without rewriting the application code.^[1]^[2]^[3]

The project began as a request for comment published alongside the Llama 3.1 release in July 2024, was formalized as a set of API providers and distributions at Meta Connect in September 2024 (with the Llama 3.2 launch), and reached its first stable release in January 2025. It is developed in the open at the GitHub repository meta-llama/llama-stack.^[1]^[4]^[5]^[6] In April 2026 the project was renamed OGX after it grew to support many non-Llama models.^[7]

Background and goals

Meta frames Llama Stack as a response to fragmentation in the tooling around large language models. By mid-2024 there were many separate libraries and services for inference, fine-tuning, safety filtering, retrieval, and agent orchestration, and they did not share common interfaces. In the Llama 3.1 launch post on July 23, 2024, Meta released "a request for comment on the Llama Stack API, a standard interface we hope will make it easier for third-party projects to leverage Llama models." The company described the aim as "a set of standardized and opinionated interfaces for how to build canonical toolchain components (fine-tuning, synthetic data generation) and agentic applications," and said it hoped the interfaces would "become adopted across the ecosystem, which should help with easier interoperability."^[4]^[8]

The intent is that an application written against the Llama Stack APIs is portable. Because the interface stays the same, a developer can prototype locally with one inference engine and later move the same code to a production cluster or a managed cloud endpoint by swapping the underlying provider rather than rewriting the application.^[2]^[3]^[6]

API surfaces (building blocks)

Llama Stack organizes its functionality into a set of REST APIs, each covering one part of the development lifecycle. The exact set has evolved across releases. The version that shipped with the first stable release (v0.1.0, January 2025) is summarized below.^[5]^[6]

API	Purpose
Inference	Run text generation and chat completion against a Llama (or other) model.
Safety	Apply content filtering and safety policies, for example using Llama Guard shields.
Agents	Build multi-step agentic workflows that combine the other APIs.
Tools	Register and call external tools that agents can invoke.
RAG / Memory	Store and retrieve knowledge for retrieval-augmented generation.
Evaluation	Test and score model and agent quality using scoring functions.
Telemetry	Collect traces, metrics, and logs to observe and debug agents.
Post-training	Fine-tune models (listed as forthcoming at the time of the stable release).

In its first stable form the platform let developers "build RAG applications and Agents using tools and safety shields, monitor those agents with telemetry, and evaluate agents with scoring functions."^[5] When Meta first described the API set in 2024, the planned surfaces also included synthetic data generation and reward scoring, reflecting the original toolchain framing around fine-tuning and data generation.^[1]^[9] Later releases added batch inference and expanded the agent and RAG capabilities.^[5]^[6]

Providers and distributions

Two concepts are central to how Llama Stack achieves portability: providers and distributions.

A provider is a concrete implementation of one or more of the APIs against a specific backend. For example, the Inference API can be implemented by a local engine such as Ollama, by a self-hosted server such as vLLM, or by a managed cloud service; the Safety API can be implemented by a Llama Guard deployment; and the memory or vector-store API can be backed by FAISS, sqlite-vec, Weaviate, or a hosted vector database. Because every provider conforms to the same API contract, the application code does not change when the provider does. Meta described this as a "plugin architecture to support the rich ecosystem of different API implementations in various environments," and partners including NVIDIA, Fireworks, and Ollama contributed provider implementations across inference, memory, and safety.^[2]^[3]^[5]^[6]

A distribution (or "distro") is a pre-configured bundle of providers, one for each API, packaged so they work together and expose a single endpoint. Meta introduced distributions at Connect in September 2024, describing a distribution as "a way to package multiple API Providers that work well together to provide a single endpoint for developers." The aim is to let developers work with Llama models across "on-prem, cloud, single-node, and on-device" environments using a consistent setup. The distributions announced at launch covered several deployment targets:^[1]^[3]^[9]

Distribution type	Backends named at launch
Single-node	A Meta-internal implementation and Ollama
Cloud	AWS, Databricks, Fireworks, and Together AI
On-device	iOS, via PyTorch ExecuTorch
On-premises	Dell Technologies

Tooling and client SDKs

Llama Stack ships a command-line interface (the llama CLI) for downloading models, configuring distributions, and running the server, along with Docker containers for the distribution server and the agents provider.^[1]^[3] To make the APIs accessible from applications, Meta released client SDKs in several languages. At the September 2024 launch these were "client code in multiple languages, including python, node, kotlin, and swift," giving server-side, web, Android, and iOS developers a typed way to call the same endpoints.^[3]^[5] The Swift and Kotlin clients in particular target on-device use through the iOS and Android distributions.^[3]

Release timeline

Llama Stack moved from a proposal to a versioned software project over roughly six months, then continued to evolve through 2025 and into 2026.

Date	Milestone
July 23, 2024	Request for comment on the Llama Stack API, published with the Llama 3.1 release.^[4]^[8]
September 25, 2024	Llama Stack distributions, providers, CLI, containers, and multi-language SDKs introduced at Meta Connect alongside Llama 3.2.^[1]^[3]^[9]
January 24-25, 2025	First stable release (v0.1.0): a "stable release (V1) of the Llama Stack APIs and the corresponding llama-stack server and client packages," with backward-compatible upgrades and automated provider verification.^[5]^[6]
April 5, 2025	Release v0.2.0, around the time of the Llama 4 launch, with expanded agent and RAG support.^[10]
April 28, 2026	Project renamed from Llama Stack to OGX.^[7]

The January 2025 stable release was positioned as the point at which the APIs were considered settled enough to build production applications on. Meta highlighted "backward-compatible upgrades," meaning developers could "integrate future API versions without modifying their existing implementations," and "automated provider verification," which ran compatibility checks when onboarding a new provider so that swapping backends was less error-prone.^[5]^[6]

Rebranding to OGX

By 2026 the project had grown well beyond Llama-specific use. On April 28, 2026, the maintainers renamed it from Llama Stack to OGX (described as the "Open GenAI Stack"). The stated reasons were that "the Llama association was limiting" because the server by then supported many inference providers (the announcement cited 23) and could run models from OpenAI, Anthropic, Google, Mistral, and others, and that the name "Stack" implied a framework when the project was, in their words, "an HTTP server with a pluggable provider architecture." The rename moved the source from the llama_stack package to ogx, changed the CLI from llama to ogx, renamed environment variables from LLAMA_STACK_* to OGX_*, and moved the GitHub organization to ogx-ai, a change the maintainers said touched 1,696 files.^[7]

Under the OGX name the project repositioned around being API-compatible with the major frontier labs: it implements OpenAI-style chat completion and Responses endpoints, the Anthropic Messages API, and Google's GenAI interface, so an application written for one vendor's SDK can run against the same server regardless of which underlying model it uses.^[7]

Relationship to other Llama components

Llama Stack sits above the LLaMA model weights and integrates several existing Meta safety and tooling efforts. Its Safety API is built around the Purple Llama project, in particular the Llama Guard input/output classifier used to filter prompts and responses. Llama Stack is distinct from the Llama API, a separate Meta-hosted service for calling Llama models, although a Llama-hosted endpoint can serve as one of the providers a distribution wires together. The framework is also broader than any single model generation: it was designed to work across the Llama 3.x and Llama 4 families and, after the OGX rename, across non-Llama models as well.^[3]^[5]^[7]

References

^Sharon Machlis, "Meta introduces Llama Stack distributions for building LLM apps," InfoWorld, September 26, 2024. infoworld.com/...stributions-for-building-llm-apps
^llamastack/llama-stack, "Composable building blocks to build LLM Apps," GitHub repository README. github.com/...llama-stack
^Meta AI, "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models," ai.meta.com, September 25, 2024. ai.meta.com/...ect-2024-vision-edge-mobile-devices
^Meta AI, "Introducing Llama 3.1: Our most capable models to date," ai.meta.com, July 23, 2024. ai.meta.com/...meta-llama-3-1
^Asif Razzaq, "Meta AI Releases the First Stable Version of Llama Stack," MarkTechPost, January 25, 2025. marktechpost.com/...s-multi-environment-deployment
^meta-llama/llama-stack, "Release v0.1.0," GitHub releases, January 24, 2025. github.com/...v0.1.0
^OGX project, "From Llama Stack to OGX: A New Name, A Sharper Mission," ogx-ai.github.io, April 28, 2026. ogx-ai.github.io/...from-llama-stack-to-ogx
^IBM, "Meta releases new Llama 3.1 models, including highly anticipated 405B parameter variant," IBM Think, July 2024. ibm.com/...llama-3-1-models-405b-parameter-variant
^"Meta introduces Llama Stack distributions for building LLM apps," Azalio. azalio.io/...k-distributions-for-building-llm-apps
^meta-llama/llama-stack, "Releases," GitHub (v0.2.0, April 5, 2025). github.com/...releases

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · v3 · 1,622 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Llama 3.2 Llama API

Background and goals

API surfaces (building blocks)

Providers and distributions

Tooling and client SDKs

Release timeline

Rebranding to OGX

Relationship to other Llama components

References

Improve this article

Related Articles

LLaMA/Model Card

Llama API

Meta Model API

LLaMA

Purple Llama

Llama 3

What links here

Related Articles

LLaMA/Model Card

Llama API

Meta Model API

LLaMA

Purple Llama

Llama 3

What links here