Llama API

AI Inference Developer Tools Meta AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 1,324 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Llama API is Meta's first-party hosted cloud service for running Llama models. Announced on April 29, 2025 at Meta's inaugural LlamaCon developer conference in Menlo Park, California, it gives developers programmatic access to Llama models for inference and fine-tuning through Meta's own infrastructure rather than only through third-party clouds.^[1]^[2] At launch it was offered as a limited free preview behind a waitlist, with Meta saying it would expand access "in the coming weeks and months."^[1]^[3]

The service marked a strategic shift. Until then, developers who wanted hosted Llama inference relied on outside providers such as Amazon Web Services, Microsoft Azure, Google Cloud, Together AI, and Fireworks AI, which packaged Meta's open-weight models as paid endpoints. With the Llama API, Meta began offering that access directly, positioning itself as an alternative front door to its own models.^[2]^[4]

Announcement and preview status

Meta unveiled the Llama API alongside other developer tooling at LlamaCon, the first conference dedicated to its Llama stack.^[1] The company described the API as "available as a limited free preview," with one-click API key creation and interactive playgrounds for trying different models.^[1] Initial access went to a set of selected developers through a waitlist sign-up form; broader availability was promised on a "weeks and months" timeline.^[2]^[3]

Meta did not disclose pricing at announcement. TechCrunch reported that the company declined to share pricing details, and Meta's own terms described the service as being provided "free of charge" during the preview.^[2]^[5] In the weeks after launch, Meta posted that it had "opened up slots" to more developers, moving the program toward a wider public preview, though Meta did not announce a firm general-availability date or published price list in 2025.^[6] Coverage through mid-2025 continued to describe the API as a free preview.^[7]

Privacy and portability were part of the pitch. Meta stated that it does not use prompts or model responses sent to the Llama API to train its own models, and that custom models built through the service remain the developer's to host anywhere, rather than being locked to Meta's servers.^[1]^[2]

Models offered

At launch the Llama API exposed Meta's then-newest Llama 4 models, the mixture-of-experts releases Scout and Maverick, plus an 8B-parameter variant of Llama 3.3 that Meta described as "previously unreleased" and made available specifically for customization.^[1]^[8] Note that the openly released Llama 3.3 was a 70B model, so this smaller 8B variant was a new offering surfaced through the API rather than a download Meta had published earlier.^[8]^[9]

Model	Type	Role in the API
Llama 4 Scout	Mixture-of-experts, multimodal	Inference; fast-inference partner options
Llama 4 Maverick	Mixture-of-experts, multimodal	Inference; fast-inference partner options
Llama 3.3 8B (previously unreleased)	Dense text model	Fine-tuning and evaluation

Developer tooling

The Llama API shipped with lightweight software development kits in Python and TypeScript.^[1] To lower switching costs, the API was made compatible with the OpenAI SDK, so applications already written against OpenAI's interface could be pointed at Llama models with minimal code changes.^[1]^[2] Alongside the SDKs, Meta provided one-click API key generation and a browser playground for experimenting with the available models.^[1]^[3]

Fine-tuning and evaluation

Beyond inference, the preview included tools for building custom models. Developers could fine-tune the Llama 3.3 8B model: generating training data, training on it, and then checking results with an evaluation suite built into the service.^[1]^[2] Meta's documentation framed this as a managed fine-tuning flow in which the developer mainly supplies a dataset and the platform handles the underlying training and evaluation steps.^[10] Because Meta committed to not training on user data and to letting developers export their models, a model tuned through the API could be moved to other hosting once built.^[1]

Fast inference partnerships: Cerebras and Groq

For developers working with Llama 4, Meta announced fast-inference options served on specialized hardware from two partners, Cerebras and Groq.^[11]^[12] These options were offered separately from Meta's standard GPU-backed serving and were described as early, experimental access available by request; a developer could select a Cerebras or Groq model name within the API and have usage tracked in one place.^[1]^[11]

Cerebras, which runs models on its wafer-scale chips, said its setup served Llama 4 Scout at about 2,648 tokens per second. Meta and Cerebras framed this as roughly 18 times faster than typical GPU-based serving from comparable closed-model APIs, citing benchmark figures from the analysis firm Artificial Analysis that placed SambaNova around 747 tokens per second and conventional GPU services near 100 to 130 tokens per second.^[11] Cerebras's James Wang argued that high throughput matters most for reasoning and agent workloads rather than basic chat: "100 tokens per second is okay for chat, but it's very slow for reasoning. It's very slow for agents."^[11]

Groq, which uses its custom language processing units (LPUs), said its integration delivered up to roughly 625 tokens per second on Llama 4 and required only a few lines of code to adopt, with no separate tuning, cold starts, or GPU configuration.^[12] Groq chief executive Jonathan Ross said that "teaming up with Meta for the official Llama API raises the bar for model performance."^[12]

Inference path	Hardware	Reported throughput (Llama 4)	Availability at launch
Meta standard serving	GPU (Meta infrastructure)	Not disclosed	Preview
Cerebras	Wafer-scale engine	~2,648 tokens/sec (Scout)	Experimental, by request
Groq	Language processing unit (LPU)	Up to ~625 tokens/sec	Experimental, by request

The throughput figures above come from the partners and the benchmarks they cited; Meta did not publish independent latency numbers for its own GPU serving tier.^[11]^[12]

Strategic context

Commentators read the Llama API as Meta becoming, in effect, a cloud vendor for its own models. The Next Platform described the move as Meta "finally becom[ing] a cloud," noting that revenue from hosted Llama inference had previously flowed to outside providers and would now be capturable by Meta directly.^[4] The company leaned on the scale of Llama adoption to justify the effort, citing more than one billion Llama downloads (about 1.2 billion as of the announcement).^[2]^[4]

The launch also fit a competitive moment. Reporting tied the API to Meta's effort to keep developers on Llama amid pressure from open-weight rivals such as China's DeepSeek and Alibaba's Qwen, and from closed APIs offered by OpenAI, Anthropic, and Google.^[2]^[7] By pairing first-party hosting with optional high-speed serving from Cerebras and Groq, Meta offered both a default path and performance tiers without requiring developers to leave its ecosystem.^[1]^[4]

References

Meta AI, "Everything we announced at our first-ever LlamaCon," April 29, 2025. https://ai.meta.com/blog/llamacon-llama-news/ ↩
Kyle Wiggers, "Meta previews an API for its Llama AI models," TechCrunch, April 29, 2025. https://techcrunch.com/2025/04/29/meta-previews-an-api-for-its-llama-ai-models/ ↩
Meta, "Llama API: Convenient access to Llama models" (official product/waitlist page). https://llama.developer.meta.com/ ↩
Timothy Prickett Morgan, "With Its Llama API Service, Meta Platforms Finally Becomes A Cloud," The Next Platform, April 30, 2025. https://www.nextplatform.com/2025/04/30/with-its-llama-api-service-meta-platforms-finally-becomes-a-cloud/ ↩
Meta, "Llama API Terms of Service," April 29, 2025. https://llama.developer.meta.com/ ↩
Meta for Developers, "Llama API Public Preview" (announcement that additional slots were opened), 2025. https://www.facebook.com/MetaforDevelopers/videos/llama-api-public-preview/1850684655474128/ ↩
Daniel Dominguez, "Meta Announces API and Protection Tools at First LlamaCon Event," InfoQ, May 13, 2025. https://www.infoq.com/news/2025/05/meta-llamacon-announcements/ ↩
Meta, "Build the Future | Llama API" (product page listing Llama 4 Maverick, Scout, and previously unreleased Llama 3.3 8B). https://www.llama.com/products/llama-api/ ↩
Meta, "Llama-3.3-70B-Instruct" model card, Hugging Face. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct ↩
Meta, "Fine-tuning & evaluation," Llama API documentation. https://llama.developer.meta.com/docs/features/fine-tuning/ ↩
Michael Nuñez, "Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second," VentureBeat, April 29, 2025. https://venturebeat.com/ai/meta-unleashes-llama-api-running-18x-faster-than-openai-cerebras-partnership-delivers-2600-tokens-per-second ↩
Fiona Jackson, "Meta's Llama API, Accelerated by Groq, 'Raises Bar for Model Performance,'" TechRepublic, April 30, 2025. https://www.techrepublic.com/article/news-meta-llama-api-groq/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Llama Stack

Announcement and preview status

Models offered

Developer tooling

Fine-tuning and evaluation

Fast inference partnerships: Cerebras and Groq

Strategic context

See also

References

Improve this article

Related Articles

NVIDIA Triton Inference Server

TensorFlow Serving

Fireworks AI

NVIDIA NIM

NVIDIA Dynamo

ExLlamaV2 (EXL2)