LLaMA 2 (Large Language Model Meta AI 2) is a family of large language models developed and released by Meta in July 2023. As the second generation of Meta's LLaMA model series, LLaMA 2 represented a major step forward in the open-weight AI movement by making powerful foundation models freely available for both research and commercial use. The release included pretrained base models and fine-tuned chat variants at three sizes (7 billion, 13 billion, and 70 billion parameters), all trained on 2 trillion tokens of publicly available data. LLaMA 2 was accompanied by a detailed research paper and a new community license that allowed commercial deployment, subject to certain conditions. Together with a high-profile partnership with Microsoft, the release positioned Meta as one of the leading advocates for open AI development [1].
Meta released the original LLaMA (LLaMA 1) in February 2023, initially restricting access to researchers through an application process. The model weights were leaked online within days of the announcement, and the resulting community activity demonstrated massive demand for openly available large language models. Fine-tuned derivatives such as Alpaca (from Stanford) and Vicuna (from LMSYS) quickly appeared, showing that even relatively small open models could be adapted for a wide range of tasks.
Building on this experience, Meta took a different approach with LLaMA 2. Rather than limiting distribution to researchers, the company released the model weights openly on July 18, 2023, alongside a permissive community license that explicitly permitted commercial use. The announcement was made jointly with Microsoft at a dedicated event, signaling the strategic importance both companies placed on open AI models [2].
The timing of the release was significant. In mid-2023, OpenAI's GPT-4 and ChatGPT dominated public attention, and the prevailing narrative in the industry favored closed, proprietary models. By releasing LLaMA 2 with commercial permissions, Meta challenged that narrative directly and gave developers, startups, and enterprises a competitive open alternative.
LLaMA 2 was released in three parameter sizes, each available as both a pretrained base model and a fine-tuned chat model optimized for dialogue:
| Model | Parameters | Context Length | Attention Type | Training Tokens | Availability |
|---|---|---|---|---|---|
| Llama 2 7B | 7 billion | 4,096 | Multi-Head Attention (MHA) | 2 trillion | Base + Chat |
| Llama 2 13B | 13 billion | 4,096 | Multi-Head Attention (MHA) | 2 trillion | Base + Chat |
| Llama 2 70B | 70 billion | 4,096 | Grouped-Query Attention (GQA) | 2 trillion | Base + Chat |
The base models were designed for general-purpose text generation and could be fine-tuned for specific downstream tasks. The chat variants (Llama 2-Chat) were specifically optimized for multi-turn dialogue through supervised fine-tuning and reinforcement learning from human feedback (RLHF).
All three sizes shared the same transformer decoder-only architecture, with the key difference being that the 70B model used grouped-query attention (GQA) rather than standard multi-head attention. GQA groups multiple query heads to share the same key and value projections, reducing memory bandwidth requirements during inference and improving throughput for the largest model [1].
LLaMA 2 retained the core architectural innovations from LLaMA 1 while making targeted improvements:
The architecture does not use bias terms in the linear layers, a design choice inherited from LLaMA 1 that slightly reduces parameter count and has been shown not to harm performance.
LLaMA 2 was pretrained on 2 trillion tokens drawn from publicly available sources. This represents a 40% increase over the 1.4 trillion tokens used for LLaMA 1. Meta did not disclose the exact composition of the training data but stated that it included a "new mix of publicly available online data" and that the data was filtered to remove sites known to contain high volumes of personal information [1].
The pretraining data has a knowledge cutoff of September 2022, though some of the fine-tuning data extends to July 2023. The data was processed with a byte pair encoding tokenizer with a vocabulary of 32,000 tokens, the same tokenizer used in LLaMA 1.
All models were trained with a standard autoregressive language modeling objective using the AdamW optimizer. Key training hyperparameters included a cosine learning rate schedule with warmup, a global batch size of 4 million tokens, and gradient clipping. The training was conducted on Meta's Research Super Cluster (RSC) and internal production clusters, using NVIDIA A100 GPUs.
Training the full suite of models required substantial compute. While Meta did not disclose exact GPU hours for LLaMA 2, the scale is comparable to other models of similar size. The 70B model, in particular, required weeks of continuous training across thousands of GPUs.
| Aspect | LLaMA 1 | LLaMA 2 |
|---|---|---|
| Training tokens | 1.4 trillion | 2 trillion |
| Context length | 2,048 | 4,096 |
| Largest model | 65B parameters | 70B parameters |
| Attention (largest) | Multi-Head Attention | Grouped-Query Attention |
| Commercial license | No | Yes |
| RLHF alignment | No | Yes (Chat variants) |
The 40% increase in training data was one of the most impactful changes. Scaling laws research has consistently shown that training on more tokens improves model quality, and the jump from 1.4T to 2T tokens produced measurable gains across benchmarks.
The Llama 2-Chat models underwent an extensive alignment process that combined supervised fine-tuning (SFT) with reinforcement learning from human feedback. The alignment pipeline was described in unusual detail in Meta's research paper, making it one of the most transparent accounts of RLHF training published by a major AI lab [1].
The first stage of alignment involved supervised fine-tuning on 27,540 high-quality prompt-response pairs. Meta found that a relatively small number of carefully curated examples was more effective than a larger set of lower-quality data. The SFT data was created by human annotators who wrote both the prompts and the desired responses.
Meta trained two separate reward models to evaluate model outputs:
The reward models were trained on human preference data collected through a process in which annotators compared pairs of model responses and selected the one they preferred. The annotation effort involved over 1 million human preference comparisons, drawn from both Meta's internal annotation pipeline and seven publicly available datasets [1].
The RLHF process was iterative, spanning five successive versions (RLHF-V1 through RLHF-V5). Each iteration refined the model's behavior based on updated reward models and new preference data. Meta employed two complementary techniques:
In the final alignment stage, Meta applied rejection sampling fine-tuning for the first four rounds, then followed with PPO in the fifth round. This sequential combination allowed the model to benefit from both the broad quality improvements of rejection sampling and the targeted optimization of PPO [3].
Meta introduced a technique called Ghost Attention (GAtt) to help the model follow system-level instructions consistently throughout a multi-turn conversation. Without GAtt, chat models tend to "forget" their system prompt as conversations grow longer. The technique works by synthetically inserting the system message into the training data at multiple points during a conversation, teaching the model to attend to system instructions even when they appear far back in the context.
The default system prompt used for Llama 2-Chat establishes the model's intended behavior:
"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature." [4]
This system prompt reflects Meta's emphasis on both helpfulness and safety. The prompt also includes instructions for the model to acknowledge when it does not know an answer and to avoid sharing false information. Users and developers deploying Llama 2-Chat could replace this default system prompt with custom instructions tailored to their specific use case, a flexibility that proved important for commercial adoption.
LLaMA 2 demonstrated strong performance across a wide range of academic benchmarks, consistently outperforming other open-source models available at the time of release.
| Benchmark | Llama 2 7B | Llama 2 13B | Llama 2 70B | Llama 1 65B | GPT-3.5 (approx.) |
|---|---|---|---|---|---|
| MMLU (5-shot) | 45.3 | 54.8 | 68.9 | 63.4 | 70.0 |
| GSM8K (8-shot) | 14.6 | 28.7 | 56.8 | 50.9 | 57.1 |
| HumanEval (pass@1) | 12.8 | 18.3 | 29.9 | 23.7 | 48.1 |
| TruthfulQA | 33.3 | 41.9 | 50.2 | 43.4 | 47.0 |
| BIG-Bench Hard (3-shot) | 32.6 | 39.4 | 51.2 | 44.5 | N/A |
The 70B model achieved 68.9 on MMLU, approaching GPT-3.5's performance and representing an approximately 5-point improvement over the LLaMA 1 65B model. On mathematical reasoning (GSM8K), the improvement was even more pronounced, with the 70B model scoring 56.8 compared to LLaMA 1's 50.9. Code generation (HumanEval) remained a relative weakness, with the 70B model scoring 29.9, well below GPT-3.5's 48.1 and far behind GPT-4's 67.0 [1][5].
For the chat-optimized variants, Meta conducted extensive human evaluations. In head-to-head comparisons, Llama 2-Chat 70B achieved a 36% win rate against ChatGPT (GPT-3.5 Turbo, March 2023 version), effectively tying with OpenAI's commercial offering on helpfulness. Against other open-source models, the results were more decisive: Llama 2-Chat 34B (an intermediate size used internally) defeated Falcon-40B with a 76% win rate.
The chat models also showed significant improvements in safety. On the TruthfulQA benchmark, Llama 2-Chat 70B scored 64.1, a substantial jump from the base model's 50.2, demonstrating the effectiveness of the RLHF alignment process [1].
The Llama 2 Community License was one of the most consequential aspects of the release. Unlike LLaMA 1 (which was restricted to research), LLaMA 2 was released under a license that explicitly permitted commercial use. The key terms included:
The license was not technically "open source" by the Open Source Initiative's definition, because it imposed use restrictions and the 700M MAU threshold. Critics, including researcher Mark Dingemanse, argued that calling LLaMA 2 "open source" was misleading because Meta did not release the training data or provide full transparency about the data pipeline [7]. Meta and others in the industry countered that the license was far more permissive than anything previously offered at this model quality level, and that "open weights" was the more accurate descriptor.
Regardless of the terminological debate, the practical effect was transformative. Thousands of developers and companies began building on LLaMA 2 within weeks of release, and the license became a template that influenced subsequent open model releases from other organizations.
Meta designated Microsoft as the "preferred partner" for LLaMA 2, and the two companies announced their expanded AI partnership on the same day as the model release. The partnership had several components:
The Microsoft partnership was notable because Microsoft was simultaneously the primary backer and largest investor in OpenAI, which operated a closed model strategy. By partnering with Meta on open models, Microsoft hedged its position, ensuring that Azure customers could access both closed (OpenAI) and open (Meta) model ecosystems [9].
On August 24, 2023, roughly five weeks after the LLaMA 2 release, Meta introduced Code Llama, a family of code-specialized language models built on the LLaMA 2 foundation. Code Llama was created by further training the base LLaMA 2 models on 500 billion tokens of code and code-related data [10].
| Model | Parameters | Specialization | FIM Support | Context |
|---|---|---|---|---|
| Code Llama 7B | 7 billion | General code | Yes | 16,384 |
| Code Llama 13B | 13 billion | General code | Yes | 16,384 |
| Code Llama 34B | 34 billion | General code | No | 16,384 |
| Code Llama - Python 7B | 7 billion | Python-specific | Yes | 16,384 |
| Code Llama - Python 13B | 13 billion | Python-specific | Yes | 16,384 |
| Code Llama - Python 34B | 34 billion | Python-specific | No | 16,384 |
| Code Llama - Instruct 7B | 7 billion | Instruction-following | Yes | 16,384 |
| Code Llama - Instruct 13B | 13 billion | Instruction-following | Yes | 16,384 |
| Code Llama - Instruct 34B | 34 billion | Instruction-following | No | 16,384 |
Three variant types were released:
The 7B and 13B variants supported fill-in-the-middle (FIM) capability, allowing them to insert code into existing codebases given surrounding context. This was particularly useful for code completion in IDE integrations. All Code Llama models supported a context length of 16,384 tokens, four times the context window of the base LLaMA 2 models, achieved through additional long-context fine-tuning [10].
Code Llama 34B scored 53.7 on HumanEval (pass@1), a dramatic improvement over the base LLaMA 2 70B's score of 29.9, demonstrating the value of domain-specific continued pretraining.
LLaMA 2's release had a catalytic effect on the open AI ecosystem. Several factors contributed to its outsized impact:
Within the first ten days of release (July 18-28, 2023), early adopters demonstrated successful implementations spanning model deployment, chatbot development, multilingual fine-tuning, domain-specific adaptation (including medical applications), and runtime optimization for resource-constrained environments. The speed of adoption reflected both the quality of the models and the pent-up demand for commercially usable open models [11].
LLaMA 2 became the most widely fine-tuned foundation model of 2023. By the time LLaMA 3 was announced in 2024, over 60,000 derivative models based on the LLaMA family had been uploaded to Hugging Face. These derivatives covered an enormous range of use cases: medical question-answering, legal document analysis, multilingual chat, coding assistants, role-playing characters, and many more [12].
The commercial license established a precedent that other model developers followed. Mistral AI's decision to release Mixtral under the Apache 2.0 license, and the broader trend toward open-weight releases from companies like Alibaba (Qwen), 01.AI (Yi), and others, were all influenced by LLaMA 2's demonstration that open distribution could be commercially viable.
Meta reported that Llama usage (across all versions) grew 10x from January to July 2024, with token volume among major cloud providers more than doubling in just three months from May to July 2024. Since 2023, there have been more than 400 million downloads of all Llama versions combined. Meta also issued over $2 million in Llama Impact Grants and Awards to support community projects [12].
LLaMA 2 improved on its predecessor across nearly every dimension:
| Feature | LLaMA 1 (Feb 2023) | LLaMA 2 (Jul 2023) |
|---|---|---|
| Model sizes | 7B, 13B, 33B, 65B | 7B, 13B, 70B |
| Training data | 1.4 trillion tokens | 2 trillion tokens (40% more) |
| Context window | 2,048 tokens | 4,096 tokens |
| License | Research only | Commercial use permitted |
| Chat variants | None (community-created) | Official Llama 2-Chat |
| RLHF alignment | None | SFT + RLHF (5 iterations) |
| GQA support | No | Yes (70B model) |
| MMLU (largest) | 63.4 (65B) | 68.9 (70B) |
| Code variants | None | Code Llama (Aug 2023) |
| Distribution | Gated research access | Open download + Azure, AWS, HuggingFace |
The most impactful differences were the commercial license and the RLHF-aligned chat variants. LLaMA 1 required the community to create its own chat-tuned versions (Alpaca, Vicuna, etc.), which varied widely in quality and safety. LLaMA 2-Chat provided a high-quality official baseline that developers could use directly or further customize.
Despite its strengths, LLaMA 2 had several notable limitations:
LLaMA 2 served as the foundation for Meta's continued investment in open AI. The model family's successors built on its foundation:
As of early 2026, LLaMA 2 models are still used in some production systems and remain popular for research and education due to their manageable size and well-documented architecture. However, for new deployments, the Llama 3 and Llama 4 families offer substantially better performance across all benchmarks.
LLaMA 2's most lasting contribution was not the models themselves but the precedent they set. By demonstrating that a major technology company could release high-quality models under a permissive license and still benefit strategically, Meta helped shift the industry's center of gravity toward greater openness. The model proved that open and commercial were not opposing goals, a lesson that continues to shape how AI models are developed and distributed.