LLaMA 2

LLaMA 2 (Large Language Model Meta AI 2) is a family of large language models developed and released by Meta in July 2023. As the second generation of Meta's LLaMA series, it represented a turning point in the open-weight AI movement by making powerful foundation models freely available for both research and commercial use. The release included pretrained base models and fine-tuned chat variants at three sizes (7 billion, 13 billion, and 70 billion parameters), all trained on 2 trillion tokens of publicly available data. LLaMA 2 was accompanied by an unusually detailed research paper [1] and a new community license that allowed commercial deployment subject to certain conditions. Together with a high-profile partnership with Microsoft announced on the same day, the release positioned Meta as one of the leading advocates for open AI development [2].

LLaMA 2 quickly became the default foundation model for thousands of academic papers, startup products, and community fine-tunes throughout late 2023 and the first half of 2024, before being superseded by Llama 3 in April 2024 [3]. Even after newer Meta models arrived, LLaMA 2 has remained a common teaching example because its architecture is well documented, its weights are freely downloadable, and its size range covers everything from a laptop-friendly 7B to a research-cluster 70B.

background and release

Meta released the original LLaMA (LLaMA 1) on February 24, 2023, initially restricting access to researchers through an application process. The model weights were leaked online within a week of the announcement, and the resulting community activity demonstrated demand for openly available large language models that closed APIs could not satisfy. Fine-tuned derivatives such as Stanford's Alpaca and LMSYS's Vicuna appeared within weeks, showing that even relatively small open models could be adapted for a wide range of tasks at low cost.

Building on this experience, Meta took a different approach with LLaMA 2. Rather than limiting distribution to researchers, the company released the model weights openly on July 18, 2023, alongside a permissive community license that explicitly permitted commercial use. The announcement was made jointly with Microsoft at Microsoft's annual Inspire partner conference, and Microsoft chief executive Satya Nadella publicly endorsed the partnership on stage [2]. By this point Meta said it had received more than 100,000 access requests for LLaMA 1, an indicator of latent demand that Meta cited as one motivation for an open release [2].

The timing of the release was significant. In mid-2023, OpenAI's GPT-4 and ChatGPT dominated public attention, and the prevailing narrative in the industry favored closed, proprietary models. By releasing LLaMA 2 with commercial permissions, Meta challenged that narrative directly and gave developers, startups, and enterprises a competitive open alternative. Mark Zuckerberg framed the strategy in subsequent earnings calls and interviews as a way to commoditize the input layer that Meta's competitors profited from selling.

authorship

The accompanying paper, "Llama 2: Open Foundation and Fine-Tuned Chat Models" (arXiv:2307.09288), was published on July 18, 2023, the same day as the model release. The lead authors were Hugo Touvron, Louis Martin, Kevin Stone, and Peter Albert, with more than 50 additional contributors from Meta's GenAI team. Touvron had also been the lead author of the original LLaMA paper, providing continuity between the two model families [1]. The paper itself runs to 76 pages and reads more like a tech report than a conventional academic submission, with extensive appendices on data filtering, annotation guidelines, safety taxonomy, and example outputs. Several Meta researchers reported on social media that the level of disclosure was deliberately calibrated to be reproducible by other research groups working at smaller scale, even if the data could not be shared.

naming and capitalization

Meta uses the spelling "Llama 2" with title-case in its own documentation, but the original paper title and the academic literature continue to use "LLaMA 2" as a backronym for "Large Language Model Meta AI 2." Both spellings appear in primary sources and are interchangeable in practice. Starting with Llama 3, Meta dropped the all-caps form entirely, and "Llama" is now the canonical product name.

model variants

LLaMA 2 was released in three parameter sizes, each available as both a pretrained base model and a fine-tuned chat model optimized for dialogue. A 34B parameter variant was also trained but withheld from public release because Meta felt it had not been sufficiently red-teamed for safety [1]. The 34B size was eventually released in modified form as part of Code Llama the following month.

Model	Parameters	Layers	Heads	KV Heads	Hidden Dim	Context	Attention	Public Release
Llama 2 7B	7B	32	32	32	4,096	4,096	MHA	Base + Chat
Llama 2 13B	13B	40	40	40	5,120	4,096	MHA	Base + Chat
Llama 2 34B	34B	48	56	8	7,168	4,096	GQA	Withheld
Llama 2 70B	70B	80	64	8	8,192	4,096	GQA	Base + Chat

The base models were designed for general-purpose text generation and could be fine-tuned for specific downstream tasks. The chat variants (Llama 2-Chat) were tuned for multi-turn dialogue through supervised fine-tuning and reinforcement learning from human feedback (RLHF).

All three publicly released sizes shared the same transformer decoder-only backbone, with the principal difference that the 70B (and unreleased 34B) used grouped-query attention rather than standard multi-head attention. Grouped-query attention shares each set of key and value projections across multiple query heads, reducing memory bandwidth requirements during inference and improving throughput at the largest scale [1].

architecture

LLaMA 2 retained the core architectural choices of LLaMA 1 while making targeted improvements aimed at long-context performance and inference efficiency. The model is a standard decoder-only transformer with several specific design choices that have since become common in open-weight LLMs.

pre-normalization with RMSNorm

Following GPT-3 and LLaMA 1, the model applies RMSNorm (Root Mean Square Layer Normalization, Zhang and Sennrich 2019) before each transformer sub-layer rather than after. Pre-normalization improves training stability, particularly at scale, by keeping the residual stream's variance bounded as gradients flow back through deep networks. RMSNorm itself is a simplified variant of LayerNorm that drops the mean-centering step and the learnable bias, reducing parameter count and compute per layer while empirically matching full LayerNorm's quality.

SwiGLU activation

The feed-forward network uses the SwiGLU activation introduced by Noam Shazeer in 2020. SwiGLU combines a gated linear unit with the Swish (also known as SiLU) activation, multiplying two linear projections of the input where one is passed through Swish. Empirically this provides better performance than standard ReLU or GELU activations at the same parameter budget, with a small constant cost from the extra projection. To keep the parameter count of the FFN comparable to a standard GELU MLP, the LLaMA 2 design uses an intermediate dimension of approximately 8/3 times the hidden dimension instead of the conventional 4x.

rotary position embeddings

Instead of absolute or learned positional encodings, LLaMA 2 uses rotary position embeddings (RoPE), introduced by Su et al. (2021). RoPE encodes position by rotating the query and key vectors at each layer through angles that depend on token position, so that the dot product between two rotated vectors naturally encodes their relative position. RoPE generalizes more gracefully to sequence lengths not seen during training and forms the basis for later context-extension techniques like NTK-aware and YaRN scaling.

grouped-query attention (70B only)

The 70B model uses grouped-query attention (GQA, Ainslie et al. 2023) with 8 key-value heads shared across 64 query heads, an 8x reduction in the size of the key-value cache compared to full multi-head attention. The smaller cache makes long-sequence inference dramatically more memory-efficient, which becomes critical when serving the 70B model at production scale. GQA sits between standard multi-head attention (one KV pair per query head) and multi-query attention (a single KV pair shared across all query heads), trading a small quality loss for a large efficiency gain. The 7B and 13B models, which fit more easily in GPU memory, retain conventional multi-head attention.

context length

All models support a context window of 4,096 tokens, double the 2,048-token context of LLaMA 1. The longer window allows the model to process roughly 3,000 words of input plus output and was a key enabler for the chat use case, where multi-turn conversations and long system prompts can quickly exhaust shorter contexts.

other choices

The architecture does not use bias terms in the linear layers, a choice inherited from LLaMA 1 that slightly reduces parameter count and has been shown not to harm performance. The vocabulary is 32,000 tokens trained with byte-pair encoding (BPE) using the SentencePiece implementation, identical to LLaMA 1. Numbers are split into individual digits and unknown UTF-8 characters fall back to byte-level decomposition, which improves robustness on technical and multilingual text.

# Approximate hidden / feed-forward dimensions per variant
# (intermediate_size ~= 8/3 * hidden_size, rounded for hardware alignment)
LLAMA2_CONFIG = {
    '7B':  {'layers': 32, 'heads': 32, 'kv_heads': 32, 'hidden': 4096, 'ffn': 11008},
    '13B': {'layers': 40, 'heads': 40, 'kv_heads': 40, 'hidden': 5120, 'ffn': 13824},
    '34B': {'layers': 48, 'heads': 56, 'kv_heads':  8, 'hidden': 7168, 'ffn': 22016},
    '70B': {'layers': 80, 'heads': 64, 'kv_heads':  8, 'hidden': 8192, 'ffn': 28672},
}

training

pretraining data

LLaMA 2 was pretrained on 2 trillion tokens drawn from publicly available sources, a 40% increase over the 1.4 trillion tokens used for LLaMA 1. Meta did not disclose the exact composition of the training data but stated that it included "a new mix of publicly available online data" and that data was filtered to remove sites known to contain high volumes of personal information [1]. The corpus excludes data from any of Meta's own products and services.

The paper's Table 10 reports the language distribution of the pretraining corpus: roughly 89.7% English, with the remaining 10% spread across 27 other languages including German, French, Swedish, Chinese, Spanish, Russian, and Dutch, plus a 8.4% bucket of unknown or programming-language tokens. The strong English bias is reflected in downstream performance, where non-English benchmarks lag substantially behind English ones.

The pretraining knowledge cutoff is September 2022, although some of the fine-tuning data extends to July 2023 [4]. The data was processed with the same SentencePiece BPE tokenizer used in LLaMA 1, with a 32,000-token vocabulary.

pretraining procedure

All models were trained with a standard autoregressive language-modeling objective using the AdamW optimizer with beta1 = 0.9, beta2 = 0.95, and weight decay of 0.1. The learning rate followed a cosine schedule with 2,000 warmup steps, decaying to 10% of the peak value. Peak learning rates were 3.0 x 10^-4 for the 7B and 13B variants and 1.5 x 10^-4 for the 70B. Gradient clipping was set to 1.0, and the global batch size was 4 million tokens for all variants [1][4].

Training was conducted on Meta's Research Super Cluster (RSC) and on internal production clusters, using NVIDIA A100-80GB GPUs. The total compute and carbon impact were disclosed at unusual granularity in the model card [4]:

Model	GPU Hours (A100-80GB)	Power per GPU	CO2 Emissions
Llama 2 7B	184,320	400 W	31.22 tCO2eq
Llama 2 13B	368,640	400 W	62.44 tCO2eq
Llama 2 70B	1,720,320	400 W	291.42 tCO2eq
All variants combined	3,311,616	400 W	539.00 tCO2eq

Meta reported that 100% of these emissions were offset through its sustainability program. At a wholesale rate of roughly $1 per A100 GPU-hour in mid-2023, the 70B run alone implied a compute cost north of $1.7 million in commodity terms, although the actual cost to Meta on owned hardware would have been substantially lower.

comparison with LLaMA 1 training

Aspect	LLaMA 1 (Feb 2023)	LLaMA 2 (Jul 2023)
Training tokens	1.4 trillion	2.0 trillion
Context length	2,048	4,096
Largest released model	65B	70B
Attention (largest)	MHA	GQA (8 KV heads)
Tokenizer vocab	32,000 (BPE)	32,000 (BPE, identical)
Commercial license	Research only	Yes, with MAU clause
RLHF alignment	None	Yes (SFT + 5 RLHF rounds)
Disclosed compute	A100 hours not detailed	3.31M A100-hours total

The 40% increase in training data was one of the most impactful changes. Scaling laws research has consistently shown that training on more tokens improves model quality at a fixed parameter count, and the jump from 1.4T to 2T tokens produced measurable gains across benchmarks even at the same architecture and parameter count.

RLHF alignment

The Llama 2-Chat models underwent an extensive alignment process that combined supervised fine-tuning (SFT) with reinforcement learning from human feedback. The pipeline was described in unusual detail in the paper, making it one of the most transparent published accounts of an industrial-scale RLHF run [1]. Researchers and engineers at other labs treated the 76-page paper as a de facto recipe for replicating large-model alignment.

supervised fine-tuning

The first stage involved supervised fine-tuning on 27,540 high-quality prompt-response pairs written by human annotators. Meta found that a relatively small number of carefully curated examples was more effective than larger sets of lower-quality data, and explicitly cautioned that "quality is all you need" for the SFT stage [1]. The team reported that they began curating their own SFT data after observing that some publicly available instruction-tuning datasets contained noisy or generic responses.

SFT was run for two epochs with a cosine learning rate schedule peaking at 2 x 10^-5, weight decay 0.1, batch size 64, and a 4,096-token sequence length. Prompts and answers were concatenated with a special token in between, and loss was computed only over the answer tokens.

reward modeling

Meta trained two separate reward models on top of the SFT checkpoint:

A helpfulness reward model trained to score responses based on how relevant, complete, and useful they are to the user's request.
A safety reward model trained to penalize responses that contain harmful, toxic, unethical, or dangerous content.

Using two reward models rather than a single combined one allowed Meta to manage the well-known tension between safety and helpfulness, where overly cautious models refuse legitimate requests and overly helpful models generate unsafe content. The two scores were combined at PPO time using a piecewise function that weighted the safety score more heavily when responses fell below a safety threshold [1].

The reward models were trained on human preference data collected through a process in which annotators compared pairs of model responses and selected the one they preferred. The annotation effort was substantial: by the time of the paper, Meta had collected 1,418,091 internal binary comparisons, supplemented by seven publicly available preference datasets bringing the total above 2.9 million comparisons [1]. Average dialog depths varied between 1.0 and 3.9 turns depending on source.

iterative RLHF: rejection sampling and PPO

The RLHF process was iterative, spanning five successive versions (RLHF-V1 through RLHF-V5). Each iteration refined the model's behavior based on updated reward models and new preference data collected from the latest checkpoint. Meta employed two complementary techniques:

Rejection sampling: For each prompt, the policy generates K candidate responses (K typically between 8 and 16). The reward model scores all candidates and the highest-scoring response is used as a fine-tuning target. This is a simple but effective method that leverages the reward model as a filter rather than directly optimizing against it, and is robust to reward-model misspecification.
Proximal Policy Optimization (PPO): The standard online RLHF algorithm that adjusts the policy to maximize expected reward while staying close to the SFT model (the reference policy) via a KL penalty. PPO provides finer-grained optimization than rejection sampling but is more sensitive to reward hacking.

Meta applied rejection sampling fine-tuning for the first four rounds, then followed with PPO in the fifth round. This sequential combination allowed the model to benefit from both the broad quality improvements of rejection sampling and the targeted optimization of PPO [1]. Notably, only the 70B model used rejection sampling at full strength; smaller models inherited responses from the 70B teacher in a form of sequence-level distillation, which the paper credits with closing much of the capability gap between the 7B/13B chat models and the 70B chat model.

ghost attention (GAtt)

Meta introduced a technique called Ghost Attention (GAtt) to help the model follow system-level instructions consistently throughout a multi-turn conversation. Without GAtt, chat models tend to drift away from the system prompt as the conversation grows longer, since the system message becomes a smaller fraction of the visible context. GAtt works by synthetically inserting the system message at multiple turns during training while masking out its tokens in the loss, teaching the model to attend to the original instruction even when many turns separate it from the current generation [1]. The paper reports that GAtt produced near-perfect adherence to constraints (such as "always reply in haiku" or "never mention apples") for at least 20 turns, compared to roughly 4 turns without GAtt.

Llama 2-Chat system prompt

The default system prompt used for Llama 2-Chat establishes the model's intended behavior:

"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature." [5]

The prompt continues with instructions for the model to acknowledge when it does not know an answer and to avoid sharing false information. Users and developers deploying Llama 2-Chat could replace this default with custom instructions, a flexibility that proved important for commercial adoption.

The verbose default prompt drew criticism in the months after release for producing models that refused benign requests (for example, refusing to give a recipe for killing a process in Linux because it pattern-matched on the word "kill"). Meta acknowledged this behavior in the paper as the safety/helpfulness trade-off, and community fine-tunes such as those by NousResearch and Eric Hartford became popular partly because they used less restrictive system prompts. Llama 2-Chat also follows a specific INST/SYS token format defined by Meta:

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
<</SYS>>

What is the capital of France? [/INST] The capital of France is Paris. </s>

Mixing up this format (for example, omitting the SYS tags or running them on the wrong turn) silently degraded quality, and was a common source of bug reports in the weeks after release. Hugging Face's tokenizer.apply_chat_template and llama.cpp's --chat-template flag eventually standardized formatting, but raw inference scripts that hand-rolled the prompt frequently produced subtly worse responses than the reported benchmarks suggested.

inference cost

Serving the 70B chat model in production typically required at least two NVIDIA A100-80GB GPUs in fp16, or one A100 in 4-bit quantization. Together AI, Anyscale, and Replicate published price points of roughly $0.65 to $1.00 per million output tokens for Llama 2-Chat 70B in the months after release, which was about 30% of GPT-3.5-turbo's then-current rate and made Llama 2 attractive for high-volume use cases. The 7B chat variant ran on a single consumer GPU (an RTX 3090 or 4090) at acceptable latency, and the GGUF-quantized version of 7B chat ran on Apple silicon laptops at 8 to 30 tokens per second, depending on quantization level.

benchmarks and performance

LLaMA 2 demonstrated strong performance across academic benchmarks, consistently outperforming other open-source models available at the time of release. The paper reports both standard pretrained-model benchmarks and human evaluations of the chat variants.

pretrained model benchmarks

Benchmark	Llama 2 7B	Llama 2 13B	Llama 2 70B	Llama 1 65B	GPT-3.5	GPT-4
MMLU (5-shot)	45.3	54.8	68.9	63.4	70.0	86.4
GSM8K (8-shot)	14.6	28.7	56.8	50.9	57.1	92.0
HumanEval (pass@1)	12.8	18.3	29.9	23.7	48.1	67.0
TruthfulQA (% true & info)	33.3	41.9	50.2	43.4	47.0	n/a
BIG-Bench Hard (3-shot)	32.6	39.4	51.2	44.5	n/a	n/a
AGIEval	21.8	28.5	40.0	31.6	n/a	n/a
TriviaQA (1-shot)	72.1	79.6	85.0	84.6	n/a	n/a
NaturalQuestions (1-shot)	25.7	31.8	33.0	32.5	n/a	n/a

The 70B model achieved 68.9 on MMLU, approaching GPT-3.5's 70.0 and improving over LLaMA 1's 65B by about 5.5 points. On mathematical reasoning, the gap was even larger: GSM8K jumped from 50.9 (Llama 1 65B) to 56.8 (Llama 2 70B). Code generation (HumanEval) remained a relative weakness at 29.9, well below GPT-3.5's 48.1, and the gap to GPT-4 (67.0) was even larger. Code Llama was released a month later partly to close this gap [1][6][7].

chat model evaluations

For the chat variants, Meta conducted human evaluations on roughly 4,000 prompts spanning helpfulness and safety, with three annotators per prompt. The headline result was that Llama 2-Chat 70B was statistically tied with ChatGPT (gpt-3.5-turbo, March 2023 snapshot) on helpfulness, with a 36% win rate, 31.5% tie rate, and 32.5% loss rate against ChatGPT [1].

On the unreleased Llama 2-Chat 34B, the helpfulness win rate exceeded 75% against the open-source baselines Falcon-40B-instruct and Vicuna-33B. Meta also ran a parallel GPT-4-as-judge evaluation that broadly confirmed the human results, although with somewhat higher variance.

Independent third-party evaluations followed quickly. LMSYS's Chatbot Arena, an Elo-style human preference ranking, placed Llama 2-Chat 70B in the top tier among open-weight models for the rest of 2023, with an Arena Elo around 1100 by late September. MT-Bench, a multi-turn evaluation harness developed at LMSYS, reported a score of 6.86 for Llama 2-Chat 70B compared to GPT-4's 8.99 and ChatGPT's 7.94. AlpacaEval, an automated win-rate benchmark, scored the same model at 92.7% relative to text-davinci-003. These third-party numbers were broadly consistent with Meta's internal reports and helped legitimize the comparison with closed models.

safety evaluations

The alignment process produced large improvements on safety benchmarks:

TruthfulQA: Llama 2-Chat 70B reached 64.14% "true and informative" responses, compared to 50.18% for the base 70B model.
ToxiGen (lower is better, percentage of toxic generations): Llama 2-Chat 70B reduced toxic outputs to roughly 0%, down from 24.6% for the base model.
BOLD (sentiment regard): Chat variants showed improved sentiment balance across protected demographic groups compared to the base models.

The paper devoted nearly as much space to safety methodology as to capability training, including extensive red-teaming with both internal and external annotators. More than 350 people participated in adversarial red-teaming exercises, generating around 2,000 adversarial prompts that were used to evaluate and refine the safety reward model [1].

Llama 2 community license

The Llama 2 Community License was one of the most consequential aspects of the release. Unlike LLaMA 1 (which was research-only), LLaMA 2 was released under a license that explicitly permitted commercial use. The key terms included:

Free commercial use: Any individual, company, or organization could use, modify, and deploy LLaMA 2 models in commercial products without paying licensing fees.
Monthly active user threshold: Companies whose products had "greater than 700 million monthly active users in the preceding calendar month" as of the LLaMA 2 release date were required to request a separate license from Meta. The threshold targeted only a handful of technology companies (such as Google, Apple, Amazon, TikTok, and a few others) [8].
Attribution requirement: Licensees were required to include a copy of the license agreement and display "Built with Llama" branding in their products. Distributing the materials further required retaining the notice "Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved."
Output restriction: Outputs could not be used to improve any other large language model, with the exception of LLaMA 2 itself or its derivatives. This clause closed off using Llama 2 to train a competing closed model via distillation.
Acceptable use policy: Users were required to comply with Meta's Acceptable Use Policy, which prohibits military, surveillance, election-influence, and disinformation use cases, among others [9].
Patent termination: Filing patent litigation against Meta over the model voids the licensee's rights under the agreement.

the open source debate

The license was not technically open source by the Open Source Initiative's definition, because it imposed both use restrictions (the Acceptable Use Policy) and the 700M MAU threshold. The OSI publicly stated in 2023 that the Llama 2 license did not meet the Open Source Definition. Critics, including linguist Mark Dingemanse and the policy think tank Open Future, argued that calling LLaMA 2 "open source" was misleading because Meta did not release training data and provided only partial transparency about the data pipeline [10].

Meta and others in the industry countered that the license was far more permissive than anything previously offered at this model quality, and that "open weights" was the more accurate descriptor. The terminological debate sharpened in 2024 when the OSI published a formal Open Source AI Definition (OSAID) that required, among other things, sufficient information to recreate the model from scratch, a bar that no commercial frontier model met as of 2026.

Regardless of the debate, the practical effect was transformative. Thousands of developers and companies began building on LLaMA 2 within weeks of release, and the license became a template that influenced subsequent open-weight releases from other organizations.

Microsoft partnership

Meta designated Microsoft as the "preferred partner" for LLaMA 2, and the two companies announced their expanded AI partnership on the same day as the model release. The partnership had several components:

Azure integration: LLaMA 2 was made available through the Azure AI model catalog, allowing Azure customers to fine-tune and deploy all three model sizes directly within Microsoft's cloud. The integration included Azure AI Content Safety tools layered on top of Meta's own safety techniques [11].
Windows optimization: LLaMA 2 was optimized to run locally on Windows devices through the DirectML execution provider via ONNX Runtime, enabling on-device inference for developers building Windows applications.
Broad distribution: Beyond Azure, LLaMA 2 was also made available through Amazon Web Services (later through Amazon Bedrock), Hugging Face, Google Cloud Vertex AI, Together AI, Replicate, Anyscale, and many other model hosting platforms.

The Microsoft partnership was unusual because Microsoft was simultaneously the primary backer and largest investor in OpenAI, which operated a closed-model strategy. By partnering with Meta on open models, Microsoft hedged its position, ensuring that Azure customers could access both closed (OpenAI) and open (Meta) ecosystems through the same control plane [12].

Code Llama

On August 24, 2023, roughly five weeks after the LLaMA 2 release, Meta introduced Code Llama, a family of code-specialized language models built on the LLaMA 2 foundation. Code Llama was created by further training the base LLaMA 2 models on 500 billion tokens of code and code-related data, with additional long-context fine-tuning [13].

Code Llama variants

Model	Parameters	Specialization	FIM	Context	Release
Code Llama 7B	7B	General code	Yes	16,384	Aug 2023
Code Llama 13B	13B	General code	Yes	16,384	Aug 2023
Code Llama 34B	34B	General code	No	16,384	Aug 2023
Code Llama 70B	70B	General code	No	16,384	Jan 2024
Code Llama - Python 7B/13B/34B/70B	7-70B	Python-specific	Mixed	16,384	Aug 2023 / Jan 2024
Code Llama - Instruct 7B/13B/34B/70B	7-70B	Instruction-following	Mixed	16,384	Aug 2023 / Jan 2024

Three variant types were released for each size:

Code Llama (base): General-purpose code generation and understanding, further trained on a code-heavy data mixture from the base LLaMA 2 models.
Code Llama - Python: Additionally fine-tuned on 100 billion tokens of Python code, optimized for Python development tasks.
Code Llama - Instruct: Fine-tuned with instruction-following data to better understand natural-language coding prompts.

The 7B and 13B variants supported fill-in-the-middle (FIM) capability, allowing them to insert code into existing files given surrounding context, which made them suitable for IDE autocomplete. All Code Llama models supported a 16,384-token context window, four times that of base LLaMA 2, achieved through additional long-context fine-tuning with adjusted RoPE frequencies (theta = 1,000,000 instead of 10,000) [13].

Code Llama 34B scored 53.7 on HumanEval (pass@1), almost double the base LLaMA 2 70B's 29.9, demonstrating the value of domain-specific continued pretraining. The 70B variant added in January 2024 reached 67.8 pass@1, briefly making it the highest-scoring open-weight code model.

Code Llama also seeded a wave of community fine-tunes. Phind-CodeLlama-34B (from the Phind search startup) reportedly matched GPT-4 on HumanEval at 73.8 pass@1; WizardCoder-34B from Microsoft Research used Evol-Instruct to push the same base above 70 pass@1; and DeepSeek's first code model began life as a Code Llama derivative before its team retrained from scratch. The released-but-not-flagship 34B parameter point in the Code Llama lineup partially compensated for Meta's decision not to release a 34B base text model.

ecosystem and derivatives

LLaMA 2 became the most fine-tuned foundation model of 2023. By the time Llama 3 was announced in April 2024, more than 60,000 derivative models based on the LLaMA family had been uploaded to Hugging Face [3]. Below is a non-exhaustive sample of notable derivatives.

Derivative	Creator	Base	Notes
Vicuna v1.5	LMSYS	Llama 2 7B/13B	ShareGPT-style dialogues, 4K and 16K context variants
WizardLM-2 (early)	WizardLM team	Llama 2 7B/13B/70B	Evol-Instruct synthetic data, strong MT-Bench scores
Nous-Hermes-2	NousResearch	Llama 2 13B/70B	GPT-4 distilled instructions
OpenChat 3.5	OpenChat team	Llama 2 7B	C-RLFT fine-tuning, MT-Bench above 7.0
Tulu 2	AI2	Llama 2 7B/13B/70B	Diverse instruction mix with DPO alignment
Llama-2-7B-32K	Together AI	Llama 2 7B	RoPE rescaling for 32K context
Code Llama variants	Meta	Llama 2 7B/13B/34B/70B	Official code specializations
MedAlpaca, Meditron	Stanford / EPFL	Llama 2 7B/13B/70B	Medical question answering
OpenLLaMA	Berkeley AI Research	n/a (re-pretrain)	Apache 2.0 reproduction trained from scratch
LLaVA-1.5	UW-Madison / Microsoft	Llama 2 7B/13B	Visual instruction tuning, image inputs
Llama Guard	Meta	Llama 2 7B	Input/output safety classifier shipped with Purple Llama

Many of these derivatives took the top spots on the Hugging Face Open LLM Leaderboard during late 2023 and early 2024, and they collectively defined what "open-weight chatbot" meant in this period. The community also developed runtime tools that made LLaMA 2 unusually portable, including llama.cpp (Georgi Gerganov's C++ port that runs the 7B on CPUs and Apple silicon), GGML/GGUF quantization formats, ExLlama, vLLM, and Text Generation Inference.

Purple Llama

In December 2023, Meta launched Purple Llama, an umbrella initiative for open trust-and-safety tools built on top of Llama 2. The initial release included Llama Guard, a 7B classifier fine-tuned to detect unsafe content in both inputs and outputs against a six-category taxonomy (violence, sexual content, criminal planning, weapons, regulated substances, and self-harm). Purple Llama also shipped CyberSecEval, a benchmark suite for testing whether code-generating LLMs produce insecure code or assist in cyber-offense tasks. The Purple Llama project was significant because it gave deployers a way to compose a model with an open safety filter rather than relying solely on a proprietary moderation API.

Llama 2 Long

In September 2023, Meta researchers published a follow-up paper on "Effective Long-Context Scaling of Foundation Models" [14] describing Llama 2 Long, a continuation pretraining run that extended the context window from 4K to 32K tokens. Llama 2 Long modified the RoPE base frequency and trained on an additional 400 billion tokens with longer sequences. Although the weights were not publicly released, the paper became influential as a recipe for context extension and informed both Code Llama's long context and later Llama 3 work.

adoption metrics

LLaMA 2's release had an outsized impact on the open AI ecosystem. Several factors contributed:

community adoption

Within the first ten days of release (July 18 to 28, 2023), early adopters demonstrated successful implementations spanning model deployment, chatbot development, multilingual fine-tuning, domain-specific adaptation (including medical applications), and runtime optimization for resource-constrained environments [15]. The pace of adoption reflected both the quality of the models and the pent-up demand for commercially usable open weights.

industry influence

The commercial license established a precedent that other model developers followed. Mistral AI's decision to release Mistral 7B and Mixtral under the Apache 2.0 license, and the broader trend toward open-weight releases from companies like Alibaba (Qwen), 01.AI (Yi), and DeepSeek, were all influenced by LLaMA 2's demonstration that open distribution could be commercially viable. By 2025 a majority of widely-used non-frontier LLMs were distributed under either Apache 2.0 or a Llama-style community license.

scale of adoption

Meta reported that Llama usage (across all versions) grew 10x from January to July 2024, with token volume among major cloud providers more than doubling between May and July 2024 [3]. Cumulative downloads of Llama models passed 400 million by July 2024 and exceeded 600 million by early 2025. Meta also issued more than $2 million in Llama Impact Grants and Awards to support community projects.

policy and regulatory reception

LLaMA 2's release prompted immediate policy discussion. In the United States, the model became a reference case in debates over how the Biden administration's October 2023 Executive Order on AI should treat dual-use foundation models with widely available weights. The National Telecommunications and Information Administration (NTIA) ran a public comment period in early 2024 specifically asking whether "open foundation model weights" should be subject to additional reporting or restriction. In the United Kingdom, the AI Safety Institute used Llama 2-Chat as one of its early reference models for evaluation methodology. Within the European Union, LLaMA 2 was cited during AI Act trilogue negotiations as an example of a model whose distribution would not fit neatly into either the "general-purpose AI model" or the "high-risk AI system" categories then being drafted. Meta's policy team argued in submissions to all three jurisdictions that openness should be treated as a feature for safety, not a risk factor, citing the security benefits of independent red-teaming.

comparison with LLaMA 1

Feature	LLaMA 1 (Feb 2023)	LLaMA 2 (Jul 2023)
Model sizes	7B, 13B, 33B, 65B	7B, 13B, 70B (34B trained, not released)
Training data	1.4T tokens	2.0T tokens (40% more)
Context window	2,048 tokens	4,096 tokens
License	Research only (gated)	Llama 2 Community License (commercial)
Chat variants	None (community-created)	Official Llama 2-Chat (SFT + RLHF)
RLHF alignment	None	5 iterations (RS x4, then PPO)
GQA support	No	Yes (70B model)
MMLU (largest)	63.4 (65B)	68.9 (70B)
HumanEval (largest)	23.7 (65B)	29.9 (70B)
GSM8K (largest)	50.9 (65B)	56.8 (70B)
Code variants	None	Code Llama (Aug 2023)
Disclosed compute	partial	full per-variant breakdown
Distribution	Gated research access (then leaked)	Open download + cloud catalogs

The most impactful differences were the commercial license and the RLHF-aligned chat variants. LLaMA 1 required the community to create its own chat-tuned versions (Alpaca, Vicuna, etc.), which varied widely in quality and safety. Llama 2-Chat provided an official baseline that developers could use directly or further customize.

comparison with successors

Feature	Llama 2 70B (Jul 2023)	Llama 3 70B (Apr 2024)	Llama 4 Maverick (Apr 2025)
Tokenizer vocab	32,000	128,000	128,000 (extended)
Pretraining tokens	2T	15T	>30T
Native context	4K	8K (128K with 3.1)	up to 10M
Architecture	Dense decoder	Dense decoder	Mixture-of-Experts
MMLU	68.9	82	89+
HumanEval	29.9	81.7	90+
Multimodal	Text only	Text only (3.2 added vision)	Native multimodal
License	Llama 2 Community	Llama 3 Community	Llama 4 Community

limitations

Despite its strengths, LLaMA 2 had several notable limitations:

Coding performance: Even the 70B model scored only 29.9 on HumanEval, roughly 60% of GPT-3.5's score and less than half of GPT-4's. Code Llama partially addressed this gap, but general-purpose coding remained weak in the base models.
Mathematical reasoning: Although improved over LLaMA 1, the 70B model's GSM8K score of 56.8 still trailed GPT-4 by a wide margin and was effectively zero on harder math benchmarks like MATH.
Context length: The 4,096-token window, double LLaMA 1's, was short compared to the 8K (GPT-4) and 32K (GPT-4-32k) windows then available from OpenAI, and far shorter than the 100K-token contexts already shipped by Anthropic's Claude at the time of release.
Knowledge cutoff: The September 2022 cutoff meant the model lacked knowledge of events from the final 10 months before release, including the launch of ChatGPT itself.
Refusal behavior: The default Llama 2-Chat refused a substantial fraction of benign queries ("how do I kill a process", "what's the best way to murder a video file"), an issue acknowledged by Meta and subsequently mitigated in Llama 3.
Hallucination: Like all large language models of its generation, LLaMA 2 was prone to generating plausible but factually incorrect information, particularly on topics underrepresented in the training data.
Multilinguality: With training data only about 10% non-English (and most of that in a narrow set of European languages), performance on languages like Hindi, Arabic, or Vietnamese was weak. Community efforts such as Bode (Portuguese), Sabia (Portuguese), Chinese-LLaMA-2 (Chinese), and OpenBuddy attempted to fill this gap, with mixed success.
Tokenizer: The 32K-token vocabulary was efficient for English but inflated token counts for non-Latin scripts. A Mandarin sentence might use 2-3x as many Llama 2 tokens as English equivalent, raising both latency and cost. Llama 3 expanded the vocabulary to 128K tokens largely to address this.

legacy and successors

LLaMA 2 served as the foundation for Meta's continued investment in open AI. Its successors built directly on its codebase and conventions:

Llama 3 (April 18, 2024): Released in 8B and 70B sizes initially, later expanded with a 405B variant. Llama 3 used a 128K-token tokenizer, trained on more than 15 trillion tokens, expanded context to 8,192 tokens, and applied GQA at all sizes. Performance improvements were substantial: Llama 3 70B matched or exceeded GPT-4 on several benchmarks at release.
Llama 3.1 (July 2024): Extended context to 128K tokens and introduced the 405B variant, the largest openly available language model at that time.
Llama 3.2 (September 2024): Added multimodal capabilities (vision) and lightweight 1B and 3B models for edge deployment.
Llama 3.3 (December 2024): A 70B-only update that approached the 405B's quality at a fraction of the inference cost.
Llama 4 (April 2025): A complete architectural redesign using mixture of experts, native multimodality, and context windows up to 10 million tokens.

As of early 2026, LLaMA 2 weights remain freely downloadable and are still used in some production systems, particularly where regulatory or supply-chain audits favor a well-understood older model over a frontier one. The 7B variant in particular continues to appear in research papers as a standard baseline, and llama.cpp's GGUF distribution of Llama 2 7B chat is one of the most-downloaded GGUF files on Hugging Face. For new deployments, however, the Llama 3 and Llama 4 families offer substantially better performance across all benchmarks, and most active users have migrated.

LLaMA 2's most lasting contribution is not the models themselves but the precedent they set. By demonstrating that a major technology company could release high-quality models under a permissive license and still benefit strategically, Meta shifted the industry's expectations about openness. The model proved that open and commercial were not opposing goals, and that proof has continued to shape how AI models are developed, distributed, and regulated.

references

Touvron, H., Martin, L., Stone, K., Albert, P., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. https://arxiv.org/abs/2307.09288
Meta. (2023). "Meta and Microsoft Introduce the Next Generation of Llama." https://about.fb.com/news/2023/07/llama-2/
Meta. (2024). "With 10x growth since 2023, Llama is the leading engine of AI innovation." https://ai.meta.com/blog/llama-usage-doubled-may-through-july-2024/
Meta. (2023). "Llama 2 Model Card." https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md
Meta. (2023). "Meta Llama 2: Model Cards and Prompt Formats." https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-2/
Wolfe, C. R. (2023). "LLaMA-2 from the Ground Up." Substack. https://cameronrwolfe.substack.com/p/llama-2-from-the-ground-up
PromptEngineering.org. (2023). "How Does Llama-2 Compare to GPT-4/3.5 and Other AI Language Models." https://promptengineering.org/how-does-llama-2-compare-to-gpt-and-other-ai-language-models/
Meta. (2023). "Llama 2 Community License Agreement." https://www.llama.com/llama2/license/
Meta. (2023). "Llama 2 Acceptable Use Policy." https://www.llama.com/use-policy/
Open Future. (2023). "The Mirage of Open-Source AI: Analyzing Meta's Llama 2 Release Strategy." https://openfuture.eu/blog/the-mirage-of-open-source-ai-analyzing-metas-llama-2-release-strategy/
Microsoft. (2023). "Microsoft and Meta expand their AI partnership with Llama 2 on Azure and Windows." https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/
CNBC. (2023). "Microsoft goes beyond OpenAI, makes Meta's new A.I. model available to Azure customers." https://www.cnbc.com/2023/07/18/microsoft-makes-metas-new-ai-model-available-to-azure-customers.html
Roziere, B., Gehring, J., Gloeckle, F., et al. (2023). "Code Llama: Open Foundation Models for Code." arXiv:2308.12950. https://arxiv.org/abs/2308.12950
Xiong, W., Liu, J., Molybog, I., et al. (2023). "Effective Long-Context Scaling of Foundation Models." arXiv:2309.16039. https://arxiv.org/abs/2309.16039
Ayala, O., et al. (2023). "Llama 2: Early Adopters' Utilization of Meta's New Open-Source Pretrained Model." Preprints. https://www.preprints.org/manuscript/202307.2142
Hugging Face. (2023). "Llama 2 Documentation." https://huggingface.co/docs/transformers/model_doc/llama2
Wikipedia. "Llama (language model)." https://en.wikipedia.org/wiki/Llama_(language_model)

background and release

authorship

naming and capitalization

model variants

architecture

pre-normalization with RMSNorm

SwiGLU activation

rotary position embeddings

grouped-query attention (70B only)

context length

other choices

training

pretraining data

pretraining procedure

comparison with LLaMA 1 training

RLHF alignment

supervised fine-tuning

reward modeling

iterative RLHF: rejection sampling and PPO

ghost attention (GAtt)

Llama 2-Chat system prompt

inference cost

benchmarks and performance

pretrained model benchmarks

chat model evaluations

safety evaluations

Llama 2 community license

the open source debate

Microsoft partnership

Code Llama

Code Llama variants

ecosystem and derivatives

Purple Llama

Llama 2 Long

adoption metrics

community adoption

industry influence

scale of adoption

policy and regulatory reception

comparison with LLaMA 1

comparison with successors

limitations

legacy and successors

see also

references

Improve this article

Related Articles

ARC-AGI 2

DeepSeek 3.0

Multi-token prediction

LLaMA 3

Llama 3.2

Llama 3.3

background and release

authorship

naming and capitalization

model variants

architecture

pre-normalization with RMSNorm

SwiGLU activation

rotary position embeddings

grouped-query attention (70B only)

context length

other choices

training

pretraining data

pretraining procedure

comparison with LLaMA 1 training

RLHF alignment

supervised fine-tuning

reward modeling

iterative RLHF: rejection sampling and PPO

ghost attention (GAtt)

Llama 2-Chat system prompt

inference cost

benchmarks and performance

pretrained model benchmarks

chat model evaluations

safety evaluations

Llama 2 community license