LLaMA (Large Language Model Meta AI), stylized as Llama from version 2 onward, is a family of large language models developed by Meta AI (formerly Facebook AI Research, or FAIR). First released in February 2023, the Llama series has grown into one of the most widely adopted open-weight model families in the history of artificial intelligence. The series spans multiple generations, from the original LLaMA with up to 65 billion parameters to Llama 4's mixture-of-experts models with nearly 2 trillion total parameters. Llama models have been downloaded over 1.2 billion times as of 2025 and have spawned tens of thousands of derivative models on platforms like Hugging Face.
The Llama family represents Meta's commitment to open-weight AI research. Unlike proprietary models from OpenAI or Google, Meta has made Llama weights freely available for research and (from Llama 2 onward) commercial use. This decision has had a transformative effect on the AI ecosystem, enabling researchers, startups, and independent developers to build on top of state-of-the-art language models without the cost of training from scratch.
Each generation of Llama has introduced significant improvements in model size, training data scale, context length, and architectural innovation. The series progressed from a text-only, dense transformer architecture in LLaMA 1 to natively multimodal mixture-of-experts models in Llama 4 that can process text, images, and video in a single unified framework.
Meta AI announced LLaMA on February 24, 2023, alongside a research paper titled "LLaMA: Open and Efficient Foundation Language Models" (arXiv:2302.13971). The project was led by the FAIR (Fundamental AI Research) team at Meta. The stated goal was to demonstrate that smaller models trained on more data could match or exceed the performance of much larger models, challenging the prevailing assumption that raw parameter count was the primary driver of capability.
LLaMA was initially released under a non-commercial research license. Access was granted on a case-by-case basis to academic researchers, government-affiliated organizations, civil society groups, and industry research laboratories.
LLaMA 1 consisted of four model sizes:
| Model | Parameters | Dimension | Attention Heads | Layers | Learning Rate | Batch Size | Training Tokens |
|---|---|---|---|---|---|---|---|
| LLaMA 7B | 7 billion | 4,096 | 32 | 32 | 3.0e-4 | 4M | 1T |
| LLaMA 13B | 13 billion | 5,120 | 40 | 40 | 3.0e-4 | 4M | 1T |
| LLaMA 33B | 33 billion | 6,656 | 52 | 60 | 1.5e-4 | 4M | 1.4T |
| LLaMA 65B | 65 billion | 8,192 | 64 | 80 | 1.5e-4 | 4M | 1.4T |
All models used a context window of 2,048 tokens. The training dataset comprised 1.4 trillion tokens drawn from publicly available sources:
| Source | Proportion |
|---|---|
| CCNet (Common Crawl) | 67% |
| C4 | 15% |
| GitHub | 4.5% |
| Wikipedia | 4.5% |
| Books | 4.5% |
| ArXiv | 2.5% |
| Stack Exchange | 2% |
The Wikipedia and Books data included text in 20 languages: Bulgarian, Catalan, Czech, Danish, German, English, Spanish, French, Croatian, Hungarian, Italian, Dutch, Polish, Portuguese, Romanian, Russian, Slovenian, Serbian, Swedish, and Ukrainian.
LLaMA 1 used a decoder-only transformer architecture with several modifications compared to the original transformer design:
LLaMA demonstrated that smaller, well-trained models could compete with much larger ones. LLaMA-13B outperformed GPT-3 (175B parameters) on most benchmarks despite being more than 10 times smaller. LLaMA-65B was competitive with Chinchilla-70B and PaLM-540B on standard evaluation tasks.
| Model | BoolQ | PIQA | SIQA | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA |
|---|---|---|---|---|---|---|---|---|
| LLaMA 7B | 76.5 | 79.8 | 48.9 | 76.1 | 70.1 | 76.7 | 47.6 | 57.2 |
| LLaMA 13B | 78.1 | 80.1 | 50.4 | 79.2 | 73.0 | 78.1 | 52.7 | 56.4 |
| LLaMA 33B | 83.1 | 82.3 | 50.4 | 82.8 | 76.0 | 81.4 | 57.8 | 58.6 |
| LLaMA 65B | 85.3 | 82.8 | 52.3 | 84.2 | 77.0 | 81.5 | 56.0 | 60.2 |
Although Meta intended LLaMA 1 for controlled distribution to vetted researchers, the model weights were leaked to the public on March 3, 2023. A torrent containing the weights was uploaded and shared on the 4chan imageboard, then spread rapidly through online AI communities. Within days, the full model was available to anyone via BitTorrent.
Meta responded by filing takedown requests with Hugging Face and a DMCA takedown request with GitHub on March 20, 2023. Both platforms complied. However, the leak had already spread widely, and copies of the weights remained accessible through various channels.
The incident drew attention from U.S. lawmakers. Senators Richard Blumenthal and Josh Hawley wrote to Meta CEO Mark Zuckerberg expressing concern over the leak. They argued that Meta appeared to have "failed to conduct any meaningful risk assessment in advance of release" and that the company's approach was "unrestrained and permissive." The letter cited potential misuse for spam, fraud, malware, privacy violations, and harassment.
Paradoxically, the leak accelerated the open-source AI movement. Developers and researchers who gained access to the weights quickly began experimenting, producing fine-tuned variants and adaptations that demonstrated the potential of open-weight models. This groundswell of community activity is widely credited with influencing Meta's decision to release subsequent Llama versions under more permissive terms.
On July 18, 2023, Meta released Llama 2 in partnership with Microsoft. In a significant shift from LLaMA 1's restricted license, Llama 2 was made freely available for both research and commercial use. The license allowed most commercial applications but included restrictions for organizations with more than 700 million monthly active users, effectively requiring the largest technology companies to negotiate separate agreements.
This release represented Meta's strategic bet that an open ecosystem around Llama would benefit the company more than a closed approach. The partnership with Microsoft meant Llama 2 was available from day one in the Azure AI model catalog, as well as through Amazon Web Services, Hugging Face, and other cloud providers.
Llama 2 was available in three primary sizes: 7B, 13B, and 70B parameters. Meta also trained a 34B-parameter variant that was tested internally but not publicly released with the initial batch. Each model was trained on 2 trillion tokens of publicly available data, a 40 percent increase over LLaMA 1's training corpus. The context length was doubled from 2,048 to 4,096 tokens.
| Model | Parameters | Training Tokens | Context Length |
|---|---|---|---|
| Llama 2 7B | 7 billion | 2T | 4,096 |
| Llama 2 13B | 13 billion | 2T | 4,096 |
| Llama 2 70B | 70 billion | 2T | 4,096 |
Alongside the base pretrained models, Meta released Llama 2-Chat, a set of models fine-tuned specifically for dialogue applications. Llama 2-Chat was trained through a multi-stage process:
Llama 2-Chat models were available in 7B, 13B, and 70B sizes. The RLHF process improved the model's ability to follow instructions, produce helpful responses, and refuse harmful or inappropriate requests.
Llama 2 retained most of the architectural choices from LLaMA 1 (RMSNorm, SwiGLU, RoPE) but introduced Grouped-Query Attention (GQA) in the 70B model. GQA is a compromise between standard Multi-Head Attention (MHA) and Multi-Query Attention (MQA). It allows multiple query heads to share the same set of key and value heads, reducing the memory footprint and computational overhead of the KV cache during inference. This improvement made the 70B model substantially more efficient to deploy.
On August 24, 2023, Meta released Code Llama, a specialized variant of Llama 2 fine-tuned for code generation and understanding. Code Llama supported many popular programming languages including Python, C++, Java, PHP, TypeScript, C#, and Bash.
Code Llama was released in three sizes (7B, 13B, and 34B parameters), each trained on an additional 500 billion tokens of code and code-related data. Meta also provided two specialized variants:
The 7B and 13B models additionally supported fill-in-the-middle (FIM) capability, allowing them to insert code into existing code blocks for tasks like code completion. Code Llama was released under the same permissive license as Llama 2.
Meta released Llama 3 on April 18, 2024, with pretrained and instruction-tuned models in two sizes: 8B and 70B parameters. Meta described Llama 3 as "the most capable openly available LLM to date" at the time of its release.
Llama 3 represented a major leap in training scale. The models were pretrained on over 15 trillion tokens of publicly available data, seven times more than Llama 2. Compared to its predecessor, Llama 3 was three times more efficient to train, and the training data contained four times more code.
| Model | Parameters | Training Tokens | Context Length | Vocabulary Size |
|---|---|---|---|---|
| Llama 3 8B | 8 billion | 15T+ | 8,192 | 128K |
| Llama 3 70B | 70 billion | 15T+ | 8,192 | 128K |
One of the most significant changes in Llama 3 was a new tokenizer with a vocabulary of 128,000 tokens, four times larger than Llama 2's 32,000-token vocabulary. This larger vocabulary allowed the tokenizer to encode text much more efficiently, producing up to 15 percent fewer tokens for the same input text. Fewer tokens per input means faster inference and the ability to fit more content within the context window.
Llama 3 retained the decoder-only transformer architecture with RMSNorm, SwiGLU, and RoPE. A notable change was the adoption of Grouped-Query Attention (GQA) across both the 8B and 70B model sizes, whereas in Llama 2, GQA was used only in the 70B model. This improved inference efficiency across the entire model family.
The fine-tuning process for the instruction-tuned models incorporated publicly available instruction datasets as well as over 10 million human-annotated examples, a substantial increase over Llama 2's fine-tuning data.
On July 23, 2024, Meta released Llama 3.1 with updated versions of the 8B and 70B models and a new flagship: the 405B-parameter model. This was the largest openly available language model at the time and the first open model that Meta claimed could rival leading proprietary models like GPT-4, GPT-4o, and Claude 3.5 Sonnet.
Training the 405B model required over 16,000 NVIDIA H100 GPUs and over 15 trillion tokens of training data. Meta deliberately chose a dense transformer architecture rather than a mixture-of-experts design to maximize training stability at this unprecedented scale. For production deployment, the model was quantized from 16-bit (BF16) to 8-bit (FP8) precision to reduce resource requirements.
All Llama 3.1 models (8B, 70B, and 405B) supported a 128K-token context length, a 16-fold increase over Llama 3's 8,192-token context window. This extended context enabled use cases like long-form document summarization, codebase analysis, and multi-turn conversational agents that need to maintain context across many exchanges.
Llama 3.1 added official multilingual support for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
| Model | Parameters | Context Length | Training Tokens | Languages |
|---|---|---|---|---|
| Llama 3.1 8B | 8 billion | 128K | 15T+ | 8 |
| Llama 3.1 70B | 70 billion | 128K | 15T+ | 8 |
| Llama 3.1 405B | 405 billion | 128K | 15T+ | 8 |
Meta evaluated Llama 3.1 on over 150 benchmark datasets. The 405B model demonstrated strong performance in general knowledge, long-form text generation, multilingual translation, coding, mathematics, tool use, and advanced reasoning. It was the first openly available model to be broadly competitive with frontier proprietary models across these categories.
At Meta Connect 2024 in September, Meta released Llama 3.2, which split the Llama family in two new directions: multimodal vision models and lightweight edge models.
The Llama 3.2 11B and 90B vision language models (VLMs) were Meta's first multimodal Llama releases. These models could process both text and images, enabling tasks like image captioning, visual question answering, and document understanding. They were trained on a dataset of 6 billion image-text pairs.
The vision models were designed as drop-in replacements for their text-only counterparts, meaning existing applications using Llama 3.1 could upgrade to gain image understanding capabilities with minimal code changes. Meta reported that the 11B and 90B vision models exceeded Claude 3 Haiku on image understanding tasks.
The Llama 3.2 1B and 3B models were designed for on-device deployment on edge and mobile hardware. Despite their small size, they supported the full 128K-token context length and were trained on 9 trillion tokens. These models were optimized from day one for Qualcomm and MediaTek hardware and for Arm processors.
The 3B model outperformed Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, prompt rewriting, and tool use benchmarks.
| Model | Parameters | Type | Context Length | Key Capability |
|---|---|---|---|---|
| Llama 3.2 1B | 1 billion | Text-only | 128K | Edge/mobile deployment |
| Llama 3.2 3B | 3 billion | Text-only | 128K | Edge/mobile deployment |
| Llama 3.2 11B | 11 billion | Vision + Text | 128K | Image understanding |
| Llama 3.2 90B | 90 billion | Vision + Text | 128K | Image understanding |
On December 6, 2024, Meta released Llama 3.3, a text-only instruction-tuned model with 70 billion parameters. Llama 3.3 70B delivered performance comparable to the much larger Llama 3.1 405B while requiring only a fraction of the computational resources.
The model showed substantial improvements in reasoning, mathematical understanding, coding, tool calling, and multilingual text support compared to Llama 3.1 70B. It was pretrained on approximately 15 trillion tokens and fine-tuned with over 25 million synthetically generated examples in addition to publicly available instruction datasets. Training utilized a cumulative 39.3 million GPU hours on H100-80GB hardware.
Llama 3.3 supported the same eight languages as Llama 3.1: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Meta released Llama 4 on April 5, 2025, marking the most significant architectural shift in the series. Llama 4 introduced two major changes: a mixture-of-experts (MoE) architecture and native multimodality through early fusion.
Llama 4 was Meta's first model family to use a mixture-of-experts architecture. In an MoE model, each input token is routed to only a subset of the model's total parameters (the "active" parameters), while the remaining parameters (organized as specialized "expert" sub-networks) stay dormant for that token. This design allows the model to have a very large total parameter count for knowledge capacity while keeping per-token computation costs manageable.
Each token in a Llama 4 model is processed by a shared expert plus one routed expert selected from the available expert pool. The architecture also uses alternating dense layers alongside the MoE layers.
Unlike Llama 3.2's vision models (which added multimodal capabilities on top of a text-only foundation), Llama 4 was natively multimodal from the start of pretraining. Meta used an "early fusion" approach in which text, image, and video tokens are combined into a single unified representation during pretraining itself. This means the model does not freeze text parameters or use separate multimodal parameters when training with images and videos. Instead, all modalities share the same representational space from the beginning.
The vision encoder in Llama 4 is based on MetaCLIP but was trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM's internal representations.
Llama 4 Scout has 109 billion total parameters organized into 16 experts, with 17 billion active parameters per token. Its most notable feature is an industry-leading context window of 10 million tokens, achieved through a new architecture called iRoPE (interleaved attention layers with rotary position embeddings). The model was pretrained with a 256K-token context and then extended.
Despite its large context window and total parameter count, Scout fits on a single NVIDIA H100 GPU thanks to its MoE architecture (only 17B parameters are active per token). Meta reported that Scout outperformed Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of benchmarks.
Llama 4 Maverick scales up the expert count to 128 routed experts (plus a shared expert), giving it 400 billion total parameters while maintaining the same 17 billion active parameters per token as Scout. Maverick fits on a single NVIDIA H100 DGX host.
Meta described Maverick as the best multimodal model in its class, reporting that it beat GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks. An experimental chat-optimized version of Maverick achieved an ELO score of 1,417 on LMArena. Meta also noted that Maverick achieved comparable results to DeepSeek v3 on reasoning and coding tasks.
Llama 4 Behemoth is the largest model in the family, with 288 billion active parameters, 16 experts, and nearly 2 trillion total parameters. As of mid-2025, Behemoth was still in training and had not been publicly released. Meta disclosed that Behemoth serves as a teacher model for distilling knowledge into the smaller Scout and Maverick models.
Even in its unfinished state, Meta reported that Behemoth outperformed GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks such as MATH-500 and GPQA Diamond.
Pre-training Llama 4 Behemoth using FP8 precision and 32,000 GPUs achieved 390 TFLOPs per GPU.
All Llama 4 models were trained on over 30 trillion tokens, more than double the Llama 3 pretraining mixture. The training data included diverse text, image, and video datasets with coverage of over 200 languages, with 100 or more languages having at least 1 billion tokens each.
The post-training pipeline for Llama 4 consisted of three stages: lightweight supervised fine-tuning (SFT), online reinforcement learning (RL), and lightweight Direct Preference Optimization (DPO).
| Model | Active Parameters | Total Parameters | Experts | Context Length | Status |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B | 109B | 16 | 10M | Released (April 2025) |
| Llama 4 Maverick | 17B | 400B | 128 (+1 shared) | Not specified | Released (April 2025) |
| Llama 4 Behemoth | 288B | ~2T | 16 | Not specified | Training (as of mid-2025) |
The following table summarizes all major Llama releases:
| Version | Release Date | Model Sizes | Max Parameters | Context Length | Training Tokens | Architecture | License |
|---|---|---|---|---|---|---|---|
| LLaMA 1 | February 2023 | 7B, 13B, 33B, 65B | 65B | 2,048 | 1.4T | Dense transformer | Non-commercial research |
| Llama 2 | July 2023 | 7B, 13B, 70B | 70B | 4,096 | 2T | Dense transformer + GQA (70B) | Commercial (with restrictions) |
| Code Llama | August 2023 | 7B, 13B, 34B | 34B | 4,096 (16K for some) | 500B additional | Dense transformer | Commercial (with restrictions) |
| Llama 3 | April 2024 | 8B, 70B | 70B | 8,192 | 15T+ | Dense transformer + GQA (all sizes) | Commercial (Llama 3 license) |
| Llama 3.1 | July 2024 | 8B, 70B, 405B | 405B | 128K | 15T+ | Dense transformer + GQA | Commercial (Llama 3.1 license) |
| Llama 3.2 | September 2024 | 1B, 3B, 11B, 90B | 90B | 128K | Up to 9T (small models) | Dense transformer; vision adapters | Commercial (Llama 3.2 license) |
| Llama 3.3 | December 2024 | 70B | 70B | 128K | ~15T | Dense transformer + GQA | Commercial (Llama 3.3 license) |
| Llama 4 | April 2025 | 109B, 400B, ~2T (total) | ~2T total (288B active) | Up to 10M | 30T+ | MoE + early fusion multimodal | Llama 4 license |
The Llama series has undergone steady architectural refinement across its generations. The core building blocks established in LLaMA 1 have persisted, but each generation introduced targeted improvements.
RMSNorm (Root Mean Square Normalization): All Llama models use pre-normalization with RMSNorm rather than the standard LayerNorm used in the original transformer. RMSNorm omits the mean-centering step, reducing computation by 5 to 15 percent per normalization layer while maintaining training stability.
SwiGLU Activation: The feed-forward network in every Llama transformer block uses the SwiGLU activation function, which combines a gating mechanism with the Swish activation. SwiGLU provides better expressiveness than ReLU and avoids the dead neuron problem, at the cost of requiring three weight projections instead of two (offset by reducing the intermediate dimension).
Rotary Position Embeddings (RoPE): All Llama models encode positional information through RoPE, which applies rotation matrices to query and key vectors based on their positions. RoPE naturally encodes relative distances between tokens without additional learned parameters.
Introduced in Llama 2 (70B only) and expanded to all sizes in Llama 3, Grouped-Query Attention (GQA) groups multiple query heads to share a single set of key-value heads. This reduces the memory required for the KV cache during inference, improving throughput and enabling longer sequences without proportional memory increases.
Llama 4 introduced MoE layers where each token is routed to a shared expert plus one selected routed expert. This allows Llama 4 models to have very large total parameter counts (for storing broad knowledge) while keeping active computation per token at just 17 billion parameters. The architecture alternates MoE layers with standard dense layers.
Llama 4 Scout introduced iRoPE (interleaved Rotary Position Embeddings), a variant of RoPE that uses interleaved attention layers with and without rotary position embeddings. This technique enabled the 10-million-token context window, a massive jump from the 128K context in Llama 3.1.
Prior multimodal Llama models (Llama 3.2 vision) added image understanding on top of a pretrained text model. Llama 4 instead uses early fusion, integrating text, image, and video tokens into a shared representation during pretraining. The vision encoder is based on MetaCLIP and was co-trained with the language model, producing better cross-modal understanding.
The release (and leak) of LLaMA 1 ignited an explosion of community-built derivative models. This ecosystem has grown with each successive Llama release, making the Llama family one of the most forked and adapted model families in AI history.
One of the earliest and most influential derivatives, Stanford Alpaca was created by Stanford University researchers in March 2023. The team fine-tuned the LLaMA 7B model on 52,000 instruction-following demonstrations generated using OpenAI's text-davinci-003 API. Alpaca demonstrated that a relatively small, inexpensive fine-tuning process could produce a model with instruction-following capabilities comparable to much larger systems. The total fine-tuning cost was reported at under $600.
Vicuna-13B was developed by researchers at UC Berkeley, CMU, Stanford, and UCSD. It was created by fine-tuning LLaMA-13B on approximately 70,000 user-shared conversations collected from ShareGPT. The researchers reported that Vicuna achieved more than 90 percent of the quality of ChatGPT responses, as evaluated by GPT-4. The training cost was approximately $300.
The Llama ecosystem has produced numerous other important models:
The Llama architecture and training techniques influenced several independent model families that, while not direct derivatives, drew significant inspiration from Meta's work:
By 2025, the Llama ecosystem had reached remarkable scale. Meta reported over 1.2 billion cumulative downloads across all Llama models. On Hugging Face alone, tens of thousands of Llama derivative models were published, with monthly downloads of community-created variants reaching into the hundreds of thousands. The usage of Llama models doubled between May and July 2024 alone, following the release of Llama 3.1.
The open availability of Llama weights has enabled a rich ecosystem of fine-tuning tools and deployment options.
Several frameworks and techniques have become standard for adapting Llama models:
Llama models can be deployed through multiple channels:
To make large Llama models practical for deployment on consumer and edge hardware, several quantization approaches are commonly used:
The Llama series has had a profound impact on the broader AI field. Before LLaMA 1, state-of-the-art language models were almost exclusively controlled by a handful of well-funded labs (OpenAI, Google, Anthropic). The release of competitive open-weight models changed the dynamics of the field in several ways.
By making high-quality model weights freely available, Meta enabled researchers at universities and smaller organizations to conduct experiments that previously required millions of dollars in compute budgets. This led to a surge in published research on topics like fine-tuning efficiency, alignment techniques, model merging, and quantization.
The permissive licensing of Llama 2 and subsequent versions allowed startups and enterprises to build commercial products on top of Llama without paying per-token API fees. Companies could run Llama models on their own infrastructure, maintaining data privacy and reducing costs compared to proprietary API-based approaches.
Open-weight models enabled independent safety researchers to study model behavior, test for biases, and develop alignment techniques without relying on API access that could be revoked. This transparency has been both praised (for enabling scrutiny) and criticized (for making it easier to remove safety guardrails).
The availability of strong open-weight models put competitive pressure on proprietary model providers, contributing to price reductions and more generous free tiers across the industry. The open-weight movement also prompted other organizations (Mistral AI, 01.AI, Alibaba, and others) to release their own model weights.
As with all large language models, the Llama family carries risks related to misuse and harm.
Llama models are trained on data from the web and therefore reflect biases present in their training data. Meta has evaluated Llama models for biases related to gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Meta applied data filtering during training (using Kneser-Ney language models and fastText classifiers to filter based on proximity to Wikipedia-quality text) and RLHF during fine-tuning to reduce harmful outputs.
The open availability of Llama weights means that safety guardrails applied during fine-tuning can potentially be removed through additional fine-tuning. This has raised concerns from policymakers and safety researchers about the potential for misuse in generating misinformation, malware, or other harmful content. Meta has argued that the benefits of open access (including enabling independent safety research) outweigh these risks.
Meta publishes responsible use guides alongside each Llama release, providing guidance on safe deployment practices, content filtering, and risk mitigation. The Llama license includes an acceptable use policy that prohibits specific harmful applications.