LLaMA 3 is a family of open-weight large language models developed by Meta and released across several waves between April 2024 and December 2024. The series introduced models ranging from compact 1B parameter variants intended for on-device inference up to a 405B parameter dense transformer that, at the time of release, was the largest openly available language model in the world. LLaMA 3 represented a substantial step beyond the earlier LLaMA 2 generation, expanding training data from roughly 2 trillion tokens to more than 15 trillion, broadening multilingual coverage, lengthening context windows from 4K up to 128K tokens, and adding multimodal vision capabilities in the LLaMA 3.2 update [1][2].
This page is a short overview of the LLaMA 3 family. For detailed coverage of architecture, training, benchmarks, the 4D parallelism strategy, fine-tuning recipes, safety tooling, and the broader ecosystem, see the canonical article: LLaMA 3. The successor generation is documented at LLaMA 4.
Meta released the LLaMA 3 family in four major waves over an eight month period in 2024. Each wave either added new sizes, extended capability, or improved efficiency at an existing parameter count.
| Release | Date | Sizes | Headline Change |
|---|---|---|---|
| LLaMA 3 | April 18, 2024 | 8B, 70B | Initial release; 8K context; 128K token vocabulary |
| LLaMA 3.1 | July 23, 2024 | 8B, 70B, 405B | 128K context window; 405B dense flagship; 8 languages |
| LLaMA 3.2 | September 25, 2024 | 1B, 3B, 11B-Vision, 90B-Vision | Edge models and first multimodal LLaMA variants |
| LLaMA 3.3 | December 6, 2024 | 70B | 405B-class quality at a fraction of inference cost |
The full release table, including context lengths, parameter counts per variant, and the specific tasks each model was tuned for, is maintained in the LLaMA 3 article.
All LLaMA 3 models are dense decoder-only transformers. Meta deliberately avoided mixture-of-experts routing in this generation, citing better training stability and simpler deployment as the reason. Several architectural choices are consistent across the family [3]:
The three primary backbone sizes (8B, 70B, 405B) differ in depth, width, and head count but share these building blocks. The 405B configuration uses 126 layers, a model dimension of 16,384, and 128 attention heads.
LLaMA 3 was pretrained on more than 15 trillion tokens of publicly available text, a roughly sevenfold increase over LLaMA 2. About 5 percent of the corpus consisted of non-English content covering more than 30 languages. The data pipeline included MinHash-based document deduplication, line-level n-gram filtering, custom NSFW and quality classifiers, and prompt-tuned content classifiers that boosted the share of mathematical reasoning, STEM material, and code in the final mix [1][3].
The LLaMA 3.1 405B model was trained on a cluster of 16,384 NVIDIA H100 80GB GPUs over approximately 54 days, consuming roughly 39.3 million GPU-hours and around 3.8 x 10^25 FLOPs of compute. Meta combined four forms of distributed training in parallel: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism. The pretraining schedule and the data mix together represented one of the most thoroughly documented large model training runs disclosed publicly to date [3].
Meta has acknowledged that data and compute scaling decisions for the 405B model went beyond the traditional Chinchilla optimum from the scaling laws literature. Compute-optimal training would have suggested a smaller dataset for a model of this size, but Meta deliberately overtrained to improve inference-time efficiency, since serving costs scale with parameter count rather than with the original training token budget.
Post-training combined supervised fine-tuning, rejection sampling, direct preference optimization, and several rounds of synthetic data generation that used earlier checkpoints to bootstrap higher quality instruction data for later rounds.
The LLaMA 3.1 405B Instruct model achieved scores broadly competitive with GPT-4 and Claude 3.5 Sonnet on standard reasoning, mathematics, code, and tool-use evaluations, while remaining the only model in that performance tier with openly downloadable weights at the time of its release [3]. Selected published results for the instruction-tuned 405B variant include the following.
| Benchmark | LLaMA 3.1 405B Instruct |
|---|---|
| MMLU (5-shot) | 87.3 |
| MMLU-Pro (5-shot CoT) | 73.3 |
| HumanEval (0-shot) | 89.0 |
| GSM8K (8-shot CoT) | 96.8 |
| MATH (0-shot CoT) | 73.8 |
| ARC Challenge (0-shot) | 96.9 |
| GPQA Diamond (0-shot CoT) | 51.1 |
| BFCL (function calling) | 88.5 |
The full benchmark table for every released variant, plus head-to-head comparisons against contemporary closed models, is maintained on the LLaMA 3 page.
LLaMA 3 is distributed under the LLaMA 3 Community License, a custom source-available license that permits commercial use under conditions. The two most commonly cited terms are an attribution requirement ("Built with Llama" branding for derivative products and services) and a scale clause that requires a separate commercial agreement with Meta for any licensee whose products served more than 700 million monthly active users at the time the model was released. The license was widely adopted in industry but has been contested by groups such as the Open Source Initiative, which argues that the use restrictions disqualify it from being labeled as open source in the formal sense [4].
Despite the licensing debate, LLaMA 3 has consistently been the most downloaded open-weight LLM family. Meta reported in early 2025 that cumulative LLaMA downloads across all generations had surpassed 400 million, a tenfold increase year over year [2].
LLaMA 3 weights are hosted on Meta's official llama.com portal, the Hugging Face Hub, Kaggle Models, and through cloud partner catalogs at AWS, Azure, Google Cloud, Databricks, Snowflake, and NVIDIA NIM. Hosted inference is offered by Together AI, Fireworks AI, Replicate, Groq, Cerebras, and several others, often at substantially lower per-token cost than the closed proprietary alternatives.
The model family also seeded a large derivative ecosystem. Public fine-tunes built on LLaMA 3 backbones include code specialists such as Code Llama derivatives and dedicated tool-use variants released by enterprise vendors. Meta also published reference implementations of the Llama Stack, a set of APIs and components covering inference, safety, retrieval-augmented generation, and agent orchestration intended to give developers a uniform way to deploy LLaMA models across local, on-premises, and cloud environments.
For the multimodal tier (3.2 11B-Vision and 90B-Vision), Meta added a dedicated vision encoder bridged into the language backbone via cross-attention adapters, enabling image understanding tasks such as visual question answering, chart interpretation, and document analysis without disturbing the text-only behavior of the original models.
LLaMA 3 was succeeded by LLaMA 4, released in April 2025. LLaMA 4 introduced Meta's first official mixture-of-experts language models (Scout, Maverick, and the in-training Behemoth) and a 10 million token context window for the Scout variant, marking a significant architectural break from the dense-only LLaMA 3 generation [5].