Llama API
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,327 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,327 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Llama API is Meta's first-party hosted cloud service for running Llama models. Announced on April 29, 2025 at Meta's inaugural LlamaCon developer conference in Menlo Park, California, it gives developers programmatic access to Llama models for inference and fine-tuning through Meta's own infrastructure rather than only through third-party clouds.[1][2] At launch it was offered as a limited free preview behind a waitlist, with Meta saying it would expand access "in the coming weeks and months."[1][3]
The service marked a strategic shift. Until then, developers who wanted hosted Llama inference relied on outside providers such as Amazon Web Services, Microsoft Azure, Google Cloud, Together AI, and Fireworks AI, which packaged Meta's open-weight models as paid endpoints. With the Llama API, Meta began offering that access directly, positioning itself as an alternative front door to its own models.[2][4]
Meta unveiled the Llama API alongside other developer tooling at LlamaCon, the first conference dedicated to its Llama stack.[1] The company described the API as "available as a limited free preview," with one-click API key creation and interactive playgrounds for trying different models.[1] Initial access went to a set of selected developers through a waitlist sign-up form; broader availability was promised on a "weeks and months" timeline.[2][3]
Meta did not disclose pricing at announcement. TechCrunch reported that the company declined to share pricing details, and Meta's own terms described the service as being provided "free of charge" during the preview.[2][5] In the weeks after launch, Meta posted that it had "opened up slots" to more developers, moving the program toward a wider public preview, though Meta did not announce a firm general-availability date or published price list in 2025.[6] Coverage through mid-2025 continued to describe the API as a free preview.[7]
Privacy and portability were part of the pitch. Meta stated that it does not use prompts or model responses sent to the Llama API to train its own models, and that custom models built through the service remain the developer's to host anywhere, rather than being locked to Meta's servers.[1][2]
At launch the Llama API exposed Meta's then-newest Llama 4 models, the mixture-of-experts releases Scout and Maverick, plus an 8B-parameter variant of Llama 3.3 that Meta described as "previously unreleased" and made available specifically for customization.[1][8] Note that the openly released Llama 3.3 was a 70B model, so this smaller 8B variant was a new offering surfaced through the API rather than a download Meta had published earlier.[8][9]
| Model | Type | Role in the API |
|---|---|---|
| Llama 4 Scout | Mixture-of-experts, multimodal | Inference; fast-inference partner options |
| Llama 4 Maverick | Mixture-of-experts, multimodal | Inference; fast-inference partner options |
| Llama 3.3 8B (previously unreleased) | Dense text model | Fine-tuning and evaluation |
The Llama API shipped with lightweight software development kits in Python and TypeScript.[1] To lower switching costs, the API was made compatible with the OpenAI SDK, so applications already written against OpenAI's interface could be pointed at Llama models with minimal code changes.[1][2] Alongside the SDKs, Meta provided one-click API key generation and a browser playground for experimenting with the available models.[1][3]
Beyond inference, the preview included tools for building custom models. Developers could fine-tune the Llama 3.3 8B model: generating training data, training on it, and then checking results with an evaluation suite built into the service.[1][2] Meta's documentation framed this as a managed fine-tuning flow in which the developer mainly supplies a dataset and the platform handles the underlying training and evaluation steps.[10] Because Meta committed to not training on user data and to letting developers export their models, a model tuned through the API could be moved to other hosting once built.[1]
For developers working with Llama 4, Meta announced fast-inference options served on specialized hardware from two partners, Cerebras and Groq.[11][12] These options were offered separately from Meta's standard GPU-backed serving and were described as early, experimental access available by request; a developer could select a Cerebras or Groq model name within the API and have usage tracked in one place.[1][11]
Cerebras, which runs models on its wafer-scale chips, said its setup served Llama 4 Scout at about 2,648 tokens per second. Meta and Cerebras framed this as roughly 18 times faster than typical GPU-based serving from comparable closed-model APIs, citing benchmark figures from the analysis firm Artificial Analysis that placed SambaNova around 747 tokens per second and conventional GPU services near 100 to 130 tokens per second.[11] Cerebras's James Wang argued that high throughput matters most for reasoning and agent workloads rather than basic chat: "100 tokens per second is okay for chat, but it's very slow for reasoning. It's very slow for agents."[11]
Groq, which uses its custom language processing units (LPUs), said its integration delivered up to roughly 625 tokens per second on Llama 4 and required only a few lines of code to adopt, with no separate tuning, cold starts, or GPU configuration.[12] Groq chief executive Jonathan Ross said that "teaming up with Meta for the official Llama API raises the bar for model performance."[12]
| Inference path | Hardware | Reported throughput (Llama 4) | Availability at launch |
|---|---|---|---|
| Meta standard serving | GPU (Meta infrastructure) | Not disclosed | Preview |
| Cerebras | Wafer-scale engine | ~2,648 tokens/sec (Scout) | Experimental, by request |
| Groq | Language processing unit (LPU) | Up to ~625 tokens/sec | Experimental, by request |
The throughput figures above come from the partners and the benchmarks they cited; Meta did not publish independent latency numbers for its own GPU serving tier.[11][12]
Commentators read the Llama API as Meta becoming, in effect, a cloud vendor for its own models. The Next Platform described the move as Meta "finally becom[ing] a cloud," noting that revenue from hosted Llama inference had previously flowed to outside providers and would now be capturable by Meta directly.[4] The company leaned on the scale of Llama adoption to justify the effort, citing more than one billion Llama downloads (about 1.2 billion as of the announcement).[2][4]
The launch also fit a competitive moment. Reporting tied the API to Meta's effort to keep developers on Llama amid pressure from open-weight rivals such as China's DeepSeek and Alibaba's Qwen, and from closed APIs offered by OpenAI, Anthropic, and Google.[2][7] By pairing first-party hosting with optional high-speed serving from Cerebras and Groq, Meta offered both a default path and performance tiers without requiring developers to leave its ecosystem.[1][4]