LLaMA 2
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 7,098 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 7,098 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLaMA 2 (Large Language Model Meta AI 2) is a family of large language models developed and released by Meta in July 2023. As the second generation of Meta's LLaMA series, it represented a turning point in the open-weight AI movement by making powerful foundation models freely available for both research and commercial use. The release included pretrained base models and fine-tuned chat variants at three sizes (7 billion, 13 billion, and 70 billion parameters), all trained on 2 trillion tokens of publicly available data. LLaMA 2 was accompanied by an unusually detailed research paper [1] and a new community license that allowed commercial deployment subject to certain conditions. Together with a high-profile partnership with Microsoft announced on the same day, the release positioned Meta as one of the leading advocates for open AI development [2].
LLaMA 2 quickly became the default foundation model for thousands of academic papers, startup products, and community fine-tunes throughout late 2023 and the first half of 2024, before being superseded by Llama 3 in April 2024 [3]. Even after newer Meta models arrived, LLaMA 2 has remained a common teaching example because its architecture is well documented, its weights are freely downloadable, and its size range covers everything from a laptop-friendly 7B to a research-cluster 70B.
Meta released the original LLaMA (LLaMA 1) on February 24, 2023, initially restricting access to researchers through an application process. The model weights were leaked online within a week of the announcement, and the resulting community activity demonstrated demand for openly available large language models that closed APIs could not satisfy. Fine-tuned derivatives such as Stanford's Alpaca and LMSYS's Vicuna appeared within weeks, showing that even relatively small open models could be adapted for a wide range of tasks at low cost.
Building on this experience, Meta took a different approach with LLaMA 2. Rather than limiting distribution to researchers, the company released the model weights openly on July 18, 2023, alongside a permissive community license that explicitly permitted commercial use. The announcement was made jointly with Microsoft at Microsoft's annual Inspire partner conference, and Microsoft chief executive Satya Nadella publicly endorsed the partnership on stage [2]. By this point Meta said it had received more than 100,000 access requests for LLaMA 1, an indicator of latent demand that Meta cited as one motivation for an open release [2].
The timing of the release was significant. In mid-2023, OpenAI's GPT-4 and ChatGPT dominated public attention, and the prevailing narrative in the industry favored closed, proprietary models. By releasing LLaMA 2 with commercial permissions, Meta challenged that narrative directly and gave developers, startups, and enterprises a competitive open alternative. Mark Zuckerberg framed the strategy in subsequent earnings calls and interviews as a way to commoditize the input layer that Meta's competitors profited from selling.
The accompanying paper, "Llama 2: Open Foundation and Fine-Tuned Chat Models" (arXiv:2307.09288), was published on July 18, 2023, the same day as the model release. The lead authors were Hugo Touvron, Louis Martin, Kevin Stone, and Peter Albert, with more than 50 additional contributors from Meta's GenAI team. Touvron had also been the lead author of the original LLaMA paper, providing continuity between the two model families [1]. The paper itself runs to 76 pages and reads more like a tech report than a conventional academic submission, with extensive appendices on data filtering, annotation guidelines, safety taxonomy, and example outputs. Several Meta researchers reported on social media that the level of disclosure was deliberately calibrated to be reproducible by other research groups working at smaller scale, even if the data could not be shared.
Meta uses the spelling "Llama 2" with title-case in its own documentation, but the original paper title and the academic literature continue to use "LLaMA 2" as a backronym for "Large Language Model Meta AI 2." Both spellings appear in primary sources and are interchangeable in practice. Starting with Llama 3, Meta dropped the all-caps form entirely, and "Llama" is now the canonical product name.
LLaMA 2 was released in three parameter sizes, each available as both a pretrained base model and a fine-tuned chat model optimized for dialogue. A 34B parameter variant was also trained but withheld from public release because Meta felt it had not been sufficiently red-teamed for safety [1]. The 34B size was eventually released in modified form as part of Code Llama the following month.
| Model | Parameters | Layers | Heads | KV Heads | Hidden Dim | Context | Attention | Public Release |
|---|---|---|---|---|---|---|---|---|
| Llama 2 7B | 7B | 32 | 32 | 32 | 4,096 | 4,096 | MHA | Base + Chat |
| Llama 2 13B | 13B | 40 | 40 | 40 | 5,120 | 4,096 | MHA | Base + Chat |
| Llama 2 34B | 34B | 48 | 56 | 8 | 7,168 | 4,096 | GQA | Withheld |
| Llama 2 70B | 70B | 80 | 64 | 8 | 8,192 | 4,096 | GQA | Base + Chat |
The base models were designed for general-purpose text generation and could be fine-tuned for specific downstream tasks. The chat variants (Llama 2-Chat) were tuned for multi-turn dialogue through supervised fine-tuning and reinforcement learning from human feedback (RLHF).
All three publicly released sizes shared the same transformer decoder-only backbone, with the principal difference that the 70B (and unreleased 34B) used grouped-query attention rather than standard multi-head attention. Grouped-query attention shares each set of key and value projections across multiple query heads, reducing memory bandwidth requirements during inference and improving throughput at the largest scale [1].
LLaMA 2 retained the core architectural choices of LLaMA 1 while making targeted improvements aimed at long-context performance and inference efficiency. The model is a standard decoder-only transformer with several specific design choices that have since become common in open-weight LLMs.
Following GPT-3 and LLaMA 1, the model applies RMSNorm (Root Mean Square Layer Normalization, Zhang and Sennrich 2019) before each transformer sub-layer rather than after. Pre-normalization improves training stability, particularly at scale, by keeping the residual stream's variance bounded as gradients flow back through deep networks. RMSNorm itself is a simplified variant of LayerNorm that drops the mean-centering step and the learnable bias, reducing parameter count and compute per layer while empirically matching full LayerNorm's quality.
The feed-forward network uses the SwiGLU activation introduced by Noam Shazeer in 2020. SwiGLU combines a gated linear unit with the Swish (also known as SiLU) activation, multiplying two linear projections of the input where one is passed through Swish. Empirically this provides better performance than standard ReLU or GELU activations at the same parameter budget, with a small constant cost from the extra projection. To keep the parameter count of the FFN comparable to a standard GELU MLP, the LLaMA 2 design uses an intermediate dimension of approximately 8/3 times the hidden dimension instead of the conventional 4x.
Instead of absolute or learned positional encodings, LLaMA 2 uses rotary position embeddings (RoPE), introduced by Su et al. (2021). RoPE encodes position by rotating the query and key vectors at each layer through angles that depend on token position, so that the dot product between two rotated vectors naturally encodes their relative position. RoPE generalizes more gracefully to sequence lengths not seen during training and forms the basis for later context-extension techniques like NTK-aware and YaRN scaling.
The 70B model uses grouped-query attention (GQA, Ainslie et al. 2023) with 8 key-value heads shared across 64 query heads, an 8x reduction in the size of the key-value cache compared to full multi-head attention. The smaller cache makes long-sequence inference dramatically more memory-efficient, which becomes critical when serving the 70B model at production scale. GQA sits between standard multi-head attention (one KV pair per query head) and multi-query attention (a single KV pair shared across all query heads), trading a small quality loss for a large efficiency gain. The 7B and 13B models, which fit more easily in GPU memory, retain conventional multi-head attention.
All models support a context window of 4,096 tokens, double the 2,048-token context of LLaMA 1. The longer window allows the model to process roughly 3,000 words of input plus output and was a key enabler for the chat use case, where multi-turn conversations and long system prompts can quickly exhaust shorter contexts.
The architecture does not use bias terms in the linear layers, a choice inherited from LLaMA 1 that slightly reduces parameter count and has been shown not to harm performance. The vocabulary is 32,000 tokens trained with byte-pair encoding (BPE) using the SentencePiece implementation, identical to LLaMA 1. Numbers are split into individual digits and unknown UTF-8 characters fall back to byte-level decomposition, which improves robustness on technical and multilingual text.
# Approximate hidden / feed-forward dimensions per variant
# (intermediate_size ~= 8/3 * hidden_size, rounded for hardware alignment)
LLAMA2_CONFIG = {
'7B': {'layers': 32, 'heads': 32, 'kv_heads': 32, 'hidden': 4096, 'ffn': 11008},
'13B': {'layers': 40, 'heads': 40, 'kv_heads': 40, 'hidden': 5120, 'ffn': 13824},
'34B': {'layers': 48, 'heads': 56, 'kv_heads': 8, 'hidden': 7168, 'ffn': 22016},
'70B': {'layers': 80, 'heads': 64, 'kv_heads': 8, 'hidden': 8192, 'ffn': 28672},
}
LLaMA 2 was pretrained on 2 trillion tokens drawn from publicly available sources, a 40% increase over the 1.4 trillion tokens used for LLaMA 1. Meta did not disclose the exact composition of the training data but stated that it included "a new mix of publicly available online data" and that data was filtered to remove sites known to contain high volumes of personal information [1]. The corpus excludes data from any of Meta's own products and services.
The paper's Table 10 reports the language distribution of the pretraining corpus: roughly 89.7% English, with the remaining 10% spread across 27 other languages including German, French, Swedish, Chinese, Spanish, Russian, and Dutch, plus a 8.4% bucket of unknown or programming-language tokens. The strong English bias is reflected in downstream performance, where non-English benchmarks lag substantially behind English ones.
The pretraining knowledge cutoff is September 2022, although some of the fine-tuning data extends to July 2023 [4]. The data was processed with the same SentencePiece BPE tokenizer used in LLaMA 1, with a 32,000-token vocabulary.
All models were trained with a standard autoregressive language-modeling objective using the AdamW optimizer with beta1 = 0.9, beta2 = 0.95, and weight decay of 0.1. The learning rate followed a cosine schedule with 2,000 warmup steps, decaying to 10% of the peak value. Peak learning rates were 3.0 x 10^-4 for the 7B and 13B variants and 1.5 x 10^-4 for the 70B. Gradient clipping was set to 1.0, and the global batch size was 4 million tokens for all variants [1][4].
Training was conducted on Meta's Research Super Cluster (RSC) and on internal production clusters, using NVIDIA A100-80GB GPUs. The total compute and carbon impact were disclosed at unusual granularity in the model card [4]:
| Model | GPU Hours (A100-80GB) | Power per GPU | CO2 Emissions |
|---|---|---|---|
| Llama 2 7B | 184,320 | 400 W | 31.22 tCO2eq |
| Llama 2 13B | 368,640 | 400 W | 62.44 tCO2eq |
| Llama 2 70B | 1,720,320 | 400 W | 291.42 tCO2eq |
| All variants combined | 3,311,616 | 400 W | 539.00 tCO2eq |
Meta reported that 100% of these emissions were offset through its sustainability program. At a wholesale rate of roughly $1 per A100 GPU-hour in mid-2023, the 70B run alone implied a compute cost north of $1.7 million in commodity terms, although the actual cost to Meta on owned hardware would have been substantially lower.
| Aspect | LLaMA 1 (Feb 2023) | LLaMA 2 (Jul 2023) |
|---|---|---|
| Training tokens | 1.4 trillion | 2.0 trillion |
| Context length | 2,048 | 4,096 |
| Largest released model | 65B | 70B |
| Attention (largest) | MHA | GQA (8 KV heads) |
| Tokenizer vocab | 32,000 (BPE) | 32,000 (BPE, identical) |
| Commercial license | Research only | Yes, with MAU clause |
| RLHF alignment | None | Yes (SFT + 5 RLHF rounds) |
| Disclosed compute | A100 hours not detailed | 3.31M A100-hours total |
The 40% increase in training data was one of the most impactful changes. Scaling laws research has consistently shown that training on more tokens improves model quality at a fixed parameter count, and the jump from 1.4T to 2T tokens produced measurable gains across benchmarks even at the same architecture and parameter count.
The Llama 2-Chat models underwent an extensive alignment process that combined supervised fine-tuning (SFT) with reinforcement learning from human feedback. The pipeline was described in unusual detail in the paper, making it one of the most transparent published accounts of an industrial-scale RLHF run [1]. Researchers and engineers at other labs treated the 76-page paper as a de facto recipe for replicating large-model alignment.
The first stage involved supervised fine-tuning on 27,540 high-quality prompt-response pairs written by human annotators. Meta found that a relatively small number of carefully curated examples was more effective than larger sets of lower-quality data, and explicitly cautioned that "quality is all you need" for the SFT stage [1]. The team reported that they began curating their own SFT data after observing that some publicly available instruction-tuning datasets contained noisy or generic responses.
SFT was run for two epochs with a cosine learning rate schedule peaking at 2 x 10^-5, weight decay 0.1, batch size 64, and a 4,096-token sequence length. Prompts and answers were concatenated with a special token in between, and loss was computed only over the answer tokens.
Meta trained two separate reward models on top of the SFT checkpoint:
Using two reward models rather than a single combined one allowed Meta to manage the well-known tension between safety and helpfulness, where overly cautious models refuse legitimate requests and overly helpful models generate unsafe content. The two scores were combined at PPO time using a piecewise function that weighted the safety score more heavily when responses fell below a safety threshold [1].
The reward models were trained on human preference data collected through a process in which annotators compared pairs of model responses and selected the one they preferred. The annotation effort was substantial: by the time of the paper, Meta had collected 1,418,091 internal binary comparisons, supplemented by seven publicly available preference datasets bringing the total above 2.9 million comparisons [1]. Average dialog depths varied between 1.0 and 3.9 turns depending on source.
The RLHF process was iterative, spanning five successive versions (RLHF-V1 through RLHF-V5). Each iteration refined the model's behavior based on updated reward models and new preference data collected from the latest checkpoint. Meta employed two complementary techniques:
Meta applied rejection sampling fine-tuning for the first four rounds, then followed with PPO in the fifth round. This sequential combination allowed the model to benefit from both the broad quality improvements of rejection sampling and the targeted optimization of PPO [1]. Notably, only the 70B model used rejection sampling at full strength; smaller models inherited responses from the 70B teacher in a form of sequence-level distillation, which the paper credits with closing much of the capability gap between the 7B/13B chat models and the 70B chat model.
Meta introduced a technique called Ghost Attention (GAtt) to help the model follow system-level instructions consistently throughout a multi-turn conversation. Without GAtt, chat models tend to drift away from the system prompt as the conversation grows longer, since the system message becomes a smaller fraction of the visible context. GAtt works by synthetically inserting the system message at multiple turns during training while masking out its tokens in the loss, teaching the model to attend to the original instruction even when many turns separate it from the current generation [1]. The paper reports that GAtt produced near-perfect adherence to constraints (such as "always reply in haiku" or "never mention apples") for at least 20 turns, compared to roughly 4 turns without GAtt.
The default system prompt used for Llama 2-Chat establishes the model's intended behavior:
"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature." [5]
The prompt continues with instructions for the model to acknowledge when it does not know an answer and to avoid sharing false information. Users and developers deploying Llama 2-Chat could replace this default with custom instructions, a flexibility that proved important for commercial adoption.
The verbose default prompt drew criticism in the months after release for producing models that refused benign requests (for example, refusing to give a recipe for killing a process in Linux because it pattern-matched on the word "kill"). Meta acknowledged this behavior in the paper as the safety/helpfulness trade-off, and community fine-tunes such as those by NousResearch and Eric Hartford became popular partly because they used less restrictive system prompts. Llama 2-Chat also follows a specific INST/SYS token format defined by Meta:
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>
What is the capital of France? [/INST] The capital of France is Paris. </s>
Mixing up this format (for example, omitting the SYS tags or running them on the wrong turn) silently degraded quality, and was a common source of bug reports in the weeks after release. Hugging Face's tokenizer.apply_chat_template and llama.cpp's --chat-template flag eventually standardized formatting, but raw inference scripts that hand-rolled the prompt frequently produced subtly worse responses than the reported benchmarks suggested.
Serving the 70B chat model in production typically required at least two NVIDIA A100-80GB GPUs in fp16, or one A100 in 4-bit quantization. Together AI, Anyscale, and Replicate published price points of roughly $0.65 to $1.00 per million output tokens for Llama 2-Chat 70B in the months after release, which was about 30% of GPT-3.5-turbo's then-current rate and made Llama 2 attractive for high-volume use cases. The 7B chat variant ran on a single consumer GPU (an RTX 3090 or 4090) at acceptable latency, and the GGUF-quantized version of 7B chat ran on Apple silicon laptops at 8 to 30 tokens per second, depending on quantization level.
LLaMA 2 demonstrated strong performance across academic benchmarks, consistently outperforming other open-source models available at the time of release. The paper reports both standard pretrained-model benchmarks and human evaluations of the chat variants.
| Benchmark | Llama 2 7B | Llama 2 13B | Llama 2 70B | Llama 1 65B | GPT-3.5 | GPT-4 |
|---|---|---|---|---|---|---|
| MMLU (5-shot) | 45.3 | 54.8 | 68.9 | 63.4 | 70.0 | 86.4 |
| GSM8K (8-shot) | 14.6 | 28.7 | 56.8 | 50.9 | 57.1 | 92.0 |
| HumanEval (pass@1) | 12.8 | 18.3 | 29.9 | 23.7 | 48.1 | 67.0 |
| TruthfulQA (% true & info) | 33.3 | 41.9 | 50.2 | 43.4 | 47.0 | n/a |
| BIG-Bench Hard (3-shot) | 32.6 | 39.4 | 51.2 | 44.5 | n/a | n/a |
| AGIEval | 21.8 | 28.5 | 40.0 | 31.6 | n/a | n/a |
| TriviaQA (1-shot) | 72.1 | 79.6 | 85.0 | 84.6 | n/a | n/a |
| NaturalQuestions (1-shot) | 25.7 | 31.8 | 33.0 | 32.5 | n/a | n/a |
The 70B model achieved 68.9 on MMLU, approaching GPT-3.5's 70.0 and improving over LLaMA 1's 65B by about 5.5 points. On mathematical reasoning, the gap was even larger: GSM8K jumped from 50.9 (Llama 1 65B) to 56.8 (Llama 2 70B). Code generation (HumanEval) remained a relative weakness at 29.9, well below GPT-3.5's 48.1, and the gap to GPT-4 (67.0) was even larger. Code Llama was released a month later partly to close this gap [1][6][7].
For the chat variants, Meta conducted human evaluations on roughly 4,000 prompts spanning helpfulness and safety, with three annotators per prompt. The headline result was that Llama 2-Chat 70B was statistically tied with ChatGPT (gpt-3.5-turbo, March 2023 snapshot) on helpfulness, with a 36% win rate, 31.5% tie rate, and 32.5% loss rate against ChatGPT [1].
On the unreleased Llama 2-Chat 34B, the helpfulness win rate exceeded 75% against the open-source baselines Falcon-40B-instruct and Vicuna-33B. Meta also ran a parallel GPT-4-as-judge evaluation that broadly confirmed the human results, although with somewhat higher variance.
Independent third-party evaluations followed quickly. LMSYS's Chatbot Arena, an Elo-style human preference ranking, placed Llama 2-Chat 70B in the top tier among open-weight models for the rest of 2023, with an Arena Elo around 1100 by late September. MT-Bench, a multi-turn evaluation harness developed at LMSYS, reported a score of 6.86 for Llama 2-Chat 70B compared to GPT-4's 8.99 and ChatGPT's 7.94. AlpacaEval, an automated win-rate benchmark, scored the same model at 92.7% relative to text-davinci-003. These third-party numbers were broadly consistent with Meta's internal reports and helped legitimize the comparison with closed models.
The alignment process produced large improvements on safety benchmarks:
The paper devoted nearly as much space to safety methodology as to capability training, including extensive red-teaming with both internal and external annotators. More than 350 people participated in adversarial red-teaming exercises, generating around 2,000 adversarial prompts that were used to evaluate and refine the safety reward model [1].
The Llama 2 Community License was one of the most consequential aspects of the release. Unlike LLaMA 1 (which was research-only), LLaMA 2 was released under a license that explicitly permitted commercial use. The key terms included:
The license was not technically open source by the Open Source Initiative's definition, because it imposed both use restrictions (the Acceptable Use Policy) and the 700M MAU threshold. The OSI publicly stated in 2023 that the Llama 2 license did not meet the Open Source Definition. Critics, including linguist Mark Dingemanse and the policy think tank Open Future, argued that calling LLaMA 2 "open source" was misleading because Meta did not release training data and provided only partial transparency about the data pipeline [10].
Meta and others in the industry countered that the license was far more permissive than anything previously offered at this model quality, and that "open weights" was the more accurate descriptor. The terminological debate sharpened in 2024 when the OSI published a formal Open Source AI Definition (OSAID) that required, among other things, sufficient information to recreate the model from scratch, a bar that no commercial frontier model met as of 2026.
Regardless of the debate, the practical effect was transformative. Thousands of developers and companies began building on LLaMA 2 within weeks of release, and the license became a template that influenced subsequent open-weight releases from other organizations.
Meta designated Microsoft as the "preferred partner" for LLaMA 2, and the two companies announced their expanded AI partnership on the same day as the model release. The partnership had several components:
The Microsoft partnership was unusual because Microsoft was simultaneously the primary backer and largest investor in OpenAI, which operated a closed-model strategy. By partnering with Meta on open models, Microsoft hedged its position, ensuring that Azure customers could access both closed (OpenAI) and open (Meta) ecosystems through the same control plane [12].
On August 24, 2023, roughly five weeks after the LLaMA 2 release, Meta introduced Code Llama, a family of code-specialized language models built on the LLaMA 2 foundation. Code Llama was created by further training the base LLaMA 2 models on 500 billion tokens of code and code-related data, with additional long-context fine-tuning [13].
| Model | Parameters | Specialization | FIM | Context | Release |
|---|---|---|---|---|---|
| Code Llama 7B | 7B | General code | Yes | 16,384 | Aug 2023 |
| Code Llama 13B | 13B | General code | Yes | 16,384 | Aug 2023 |
| Code Llama 34B | 34B | General code | No | 16,384 | Aug 2023 |
| Code Llama 70B | 70B | General code | No | 16,384 | Jan 2024 |
| Code Llama - Python 7B/13B/34B/70B | 7-70B | Python-specific | Mixed | 16,384 | Aug 2023 / Jan 2024 |
| Code Llama - Instruct 7B/13B/34B/70B | 7-70B | Instruction-following | Mixed | 16,384 | Aug 2023 / Jan 2024 |
Three variant types were released for each size:
The 7B and 13B variants supported fill-in-the-middle (FIM) capability, allowing them to insert code into existing files given surrounding context, which made them suitable for IDE autocomplete. All Code Llama models supported a 16,384-token context window, four times that of base LLaMA 2, achieved through additional long-context fine-tuning with adjusted RoPE frequencies (theta = 1,000,000 instead of 10,000) [13].
Code Llama 34B scored 53.7 on HumanEval (pass@1), almost double the base LLaMA 2 70B's 29.9, demonstrating the value of domain-specific continued pretraining. The 70B variant added in January 2024 reached 67.8 pass@1, briefly making it the highest-scoring open-weight code model.
Code Llama also seeded a wave of community fine-tunes. Phind-CodeLlama-34B (from the Phind search startup) reportedly matched GPT-4 on HumanEval at 73.8 pass@1; WizardCoder-34B from Microsoft Research used Evol-Instruct to push the same base above 70 pass@1; and DeepSeek's first code model began life as a Code Llama derivative before its team retrained from scratch. The released-but-not-flagship 34B parameter point in the Code Llama lineup partially compensated for Meta's decision not to release a 34B base text model.
LLaMA 2 became the most fine-tuned foundation model of 2023. By the time Llama 3 was announced in April 2024, more than 60,000 derivative models based on the LLaMA family had been uploaded to Hugging Face [3]. Below is a non-exhaustive sample of notable derivatives.
| Derivative | Creator | Base | Notes |
|---|---|---|---|
| Vicuna v1.5 | LMSYS | Llama 2 7B/13B | ShareGPT-style dialogues, 4K and 16K context variants |
| WizardLM-2 (early) | WizardLM team | Llama 2 7B/13B/70B | Evol-Instruct synthetic data, strong MT-Bench scores |
| Nous-Hermes-2 | NousResearch | Llama 2 13B/70B | GPT-4 distilled instructions |
| OpenChat 3.5 | OpenChat team | Llama 2 7B | C-RLFT fine-tuning, MT-Bench above 7.0 |
| Tulu 2 | AI2 | Llama 2 7B/13B/70B | Diverse instruction mix with DPO alignment |
| Llama-2-7B-32K | Together AI | Llama 2 7B | RoPE rescaling for 32K context |
| Code Llama variants | Meta | Llama 2 7B/13B/34B/70B | Official code specializations |
| MedAlpaca, Meditron | Stanford / EPFL | Llama 2 7B/13B/70B | Medical question answering |
| OpenLLaMA | Berkeley AI Research | n/a (re-pretrain) | Apache 2.0 reproduction trained from scratch |
| LLaVA-1.5 | UW-Madison / Microsoft | Llama 2 7B/13B | Visual instruction tuning, image inputs |
| Llama Guard | Meta | Llama 2 7B | Input/output safety classifier shipped with Purple Llama |
Many of these derivatives took the top spots on the Hugging Face Open LLM Leaderboard during late 2023 and early 2024, and they collectively defined what "open-weight chatbot" meant in this period. The community also developed runtime tools that made LLaMA 2 unusually portable, including llama.cpp (Georgi Gerganov's C++ port that runs the 7B on CPUs and Apple silicon), GGML/GGUF quantization formats, ExLlama, vLLM, and Text Generation Inference.
In December 2023, Meta launched Purple Llama, an umbrella initiative for open trust-and-safety tools built on top of Llama 2. The initial release included Llama Guard, a 7B classifier fine-tuned to detect unsafe content in both inputs and outputs against a six-category taxonomy (violence, sexual content, criminal planning, weapons, regulated substances, and self-harm). Purple Llama also shipped CyberSecEval, a benchmark suite for testing whether code-generating LLMs produce insecure code or assist in cyber-offense tasks. The Purple Llama project was significant because it gave deployers a way to compose a model with an open safety filter rather than relying solely on a proprietary moderation API.
In September 2023, Meta researchers published a follow-up paper on "Effective Long-Context Scaling of Foundation Models" [14] describing Llama 2 Long, a continuation pretraining run that extended the context window from 4K to 32K tokens. Llama 2 Long modified the RoPE base frequency and trained on an additional 400 billion tokens with longer sequences. Although the weights were not publicly released, the paper became influential as a recipe for context extension and informed both Code Llama's long context and later Llama 3 work.
LLaMA 2's release had an outsized impact on the open AI ecosystem. Several factors contributed:
Within the first ten days of release (July 18 to 28, 2023), early adopters demonstrated successful implementations spanning model deployment, chatbot development, multilingual fine-tuning, domain-specific adaptation (including medical applications), and runtime optimization for resource-constrained environments [15]. The pace of adoption reflected both the quality of the models and the pent-up demand for commercially usable open weights.
The commercial license established a precedent that other model developers followed. Mistral AI's decision to release Mistral 7B and Mixtral under the Apache 2.0 license, and the broader trend toward open-weight releases from companies like Alibaba (Qwen), 01.AI (Yi), and DeepSeek, were all influenced by LLaMA 2's demonstration that open distribution could be commercially viable. By 2025 a majority of widely-used non-frontier LLMs were distributed under either Apache 2.0 or a Llama-style community license.
Meta reported that Llama usage (across all versions) grew 10x from January to July 2024, with token volume among major cloud providers more than doubling between May and July 2024 [3]. Cumulative downloads of Llama models passed 400 million by July 2024 and exceeded 600 million by early 2025. Meta also issued more than $2 million in Llama Impact Grants and Awards to support community projects.
LLaMA 2's release prompted immediate policy discussion. In the United States, the model became a reference case in debates over how the Biden administration's October 2023 Executive Order on AI should treat dual-use foundation models with widely available weights. The National Telecommunications and Information Administration (NTIA) ran a public comment period in early 2024 specifically asking whether "open foundation model weights" should be subject to additional reporting or restriction. In the United Kingdom, the AI Safety Institute used Llama 2-Chat as one of its early reference models for evaluation methodology. Within the European Union, LLaMA 2 was cited during AI Act trilogue negotiations as an example of a model whose distribution would not fit neatly into either the "general-purpose AI model" or the "high-risk AI system" categories then being drafted. Meta's policy team argued in submissions to all three jurisdictions that openness should be treated as a feature for safety, not a risk factor, citing the security benefits of independent red-teaming.
| Feature | LLaMA 1 (Feb 2023) | LLaMA 2 (Jul 2023) |
|---|---|---|
| Model sizes | 7B, 13B, 33B, 65B | 7B, 13B, 70B (34B trained, not released) |
| Training data | 1.4T tokens | 2.0T tokens (40% more) |
| Context window | 2,048 tokens | 4,096 tokens |
| License | Research only (gated) | Llama 2 Community License (commercial) |
| Chat variants | None (community-created) | Official Llama 2-Chat (SFT + RLHF) |
| RLHF alignment | None | 5 iterations (RS x4, then PPO) |
| GQA support | No | Yes (70B model) |
| MMLU (largest) | 63.4 (65B) | 68.9 (70B) |
| HumanEval (largest) | 23.7 (65B) | 29.9 (70B) |
| GSM8K (largest) | 50.9 (65B) | 56.8 (70B) |
| Code variants | None | Code Llama (Aug 2023) |
| Disclosed compute | partial | full per-variant breakdown |
| Distribution | Gated research access (then leaked) | Open download + cloud catalogs |
The most impactful differences were the commercial license and the RLHF-aligned chat variants. LLaMA 1 required the community to create its own chat-tuned versions (Alpaca, Vicuna, etc.), which varied widely in quality and safety. Llama 2-Chat provided an official baseline that developers could use directly or further customize.
| Feature | Llama 2 70B (Jul 2023) | Llama 3 70B (Apr 2024) | Llama 4 Maverick (Apr 2025) |
|---|---|---|---|
| Tokenizer vocab | 32,000 | 128,000 | 128,000 (extended) |
| Pretraining tokens | 2T | 15T | >30T |
| Native context | 4K | 8K (128K with 3.1) | up to 10M |
| Architecture | Dense decoder | Dense decoder | Mixture-of-Experts |
| MMLU | 68.9 | 82 | 89+ |
| HumanEval | 29.9 | 81.7 | 90+ |
| Multimodal | Text only | Text only (3.2 added vision) | Native multimodal |
| License | Llama 2 Community | Llama 3 Community | Llama 4 Community |
Despite its strengths, LLaMA 2 had several notable limitations:
LLaMA 2 served as the foundation for Meta's continued investment in open AI. Its successors built directly on its codebase and conventions:
As of early 2026, LLaMA 2 weights remain freely downloadable and are still used in some production systems, particularly where regulatory or supply-chain audits favor a well-understood older model over a frontier one. The 7B variant in particular continues to appear in research papers as a standard baseline, and llama.cpp's GGUF distribution of Llama 2 7B chat is one of the most-downloaded GGUF files on Hugging Face. For new deployments, however, the Llama 3 and Llama 4 families offer substantially better performance across all benchmarks, and most active users have migrated.
LLaMA 2's most lasting contribution is not the models themselves but the precedent they set. By demonstrating that a major technology company could release high-quality models under a permissive license and still benefit strategically, Meta shifted the industry's expectations about openness. The model proved that open and commercial were not opposing goals, and that proof has continued to shape how AI models are developed, distributed, and regulated.