LLaMA

Large Language Models Meta AI Natural Language Processing Open Source AI

27 min read

Updated Apr 7, 2026

LLaMA (Large Language Model Meta AI), stylized as Llama from version 2 onward, is a family of large language models developed by Meta AI (formerly Facebook AI Research, or FAIR). First released in February 2023, the Llama series has grown into one of the most widely adopted open-weight model families in the history of artificial intelligence. The series spans multiple generations, from the original LLaMA with up to 65 billion parameters to Llama 4's mixture-of-experts models with nearly 2 trillion total parameters. Llama models have been downloaded over 1.2 billion times as of 2025 and have spawned tens of thousands of derivative models on platforms like Hugging Face.

Overview

The Llama family represents Meta's commitment to open-weight AI research. Unlike proprietary models from OpenAI or Google, Meta has made Llama weights freely available for research and (from Llama 2 onward) commercial use. This decision has had a transformative effect on the AI ecosystem, enabling researchers, startups, and independent developers to build on top of state-of-the-art language models without the cost of training from scratch.

Each generation of Llama has introduced significant improvements in model size, training data scale, context length, and architectural innovation. The series progressed from a text-only, dense transformer architecture in LLaMA 1 to natively multimodal mixture-of-experts models in Llama 4 that can process text, images, and video in a single unified framework.

LLaMA 1 (February 2023)

Release and Motivation

Meta AI announced LLaMA on February 24, 2023, alongside a research paper titled "LLaMA: Open and Efficient Foundation Language Models" (arXiv:2302.13971). The project was led by the FAIR (Fundamental AI Research) team at Meta. The stated goal was to demonstrate that smaller models trained on more data could match or exceed the performance of much larger models, challenging the prevailing assumption that raw parameter count was the primary driver of capability.

LLaMA was initially released under a non-commercial research license. Access was granted on a case-by-case basis to academic researchers, government-affiliated organizations, civil society groups, and industry research laboratories.

Model Sizes and Training

LLaMA 1 consisted of four model sizes:

Model	Parameters	Dimension	Attention Heads	Layers	Learning Rate	Batch Size	Training Tokens
LLaMA 7B	7 billion	4,096	32	32	3.0e-4	4M	1T
LLaMA 13B	13 billion	5,120	40	40	3.0e-4	4M	1T
LLaMA 33B	33 billion	6,656	52	60	1.5e-4	4M	1.4T
LLaMA 65B	65 billion	8,192	64	80	1.5e-4	4M	1.4T

All models used a context window of 2,048 tokens. The training dataset comprised 1.4 trillion tokens drawn from publicly available sources:

Source	Proportion
CCNet (Common Crawl)	67%
C4	15%
GitHub	4.5%
Wikipedia	4.5%
Books	4.5%
ArXiv	2.5%
Stack Exchange	2%

The Wikipedia and Books data included text in 20 languages: Bulgarian, Catalan, Czech, Danish, German, English, Spanish, French, Croatian, Hungarian, Italian, Dutch, Polish, Portuguese, Romanian, Russian, Slovenian, Serbian, Swedish, and Ukrainian.

Architecture

LLaMA 1 used a decoder-only transformer architecture with several modifications compared to the original transformer design:

Pre-normalization with RMSNorm: Rather than applying layer normalization after each sub-layer (post-norm), LLaMA applied Root Mean Square Normalization before each sub-layer (pre-norm). RMSNorm removes mean centering, saving 5 to 15 percent of computation at every normalization step without sacrificing training stability.
SwiGLU activation function: The standard ReLU activation in the feed-forward network was replaced with SwiGLU, which combines gating with the Swish activation for improved expressiveness. SwiGLU uses three weight projections but with a reduced intermediate dimension to maintain the overall parameter count.
Rotary Position Embeddings (RoPE): Instead of absolute or learned positional embeddings, LLaMA used RoPE, which encodes position through rotation of query and key vectors. This approach produces relative position sensitivity in attention scores without adding extra parameters.

Performance

LLaMA demonstrated that smaller, well-trained models could compete with much larger ones. LLaMA-13B outperformed GPT-3 (175B parameters) on most benchmarks despite being more than 10 times smaller. LLaMA-65B was competitive with Chinchilla-70B and PaLM-540B on standard evaluation tasks.

Model	BoolQ	PIQA	SIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA
LLaMA 7B	76.5	79.8	48.9	76.1	70.1	76.7	47.6	57.2
LLaMA 13B	78.1	80.1	50.4	79.2	73.0	78.1	52.7	56.4
LLaMA 33B	83.1	82.3	50.4	82.8	76.0	81.4	57.8	58.6
LLaMA 65B	85.3	82.8	52.3	84.2	77.0	81.5	56.0	60.2

The Leak Controversy

Although Meta intended LLaMA 1 for controlled distribution to vetted researchers, the model weights were leaked to the public on March 3, 2023. A torrent containing the weights was uploaded and shared on the 4chan imageboard, then spread rapidly through online AI communities. Within days, the full model was available to anyone via BitTorrent.

Meta responded by filing takedown requests with Hugging Face and a DMCA takedown request with GitHub on March 20, 2023. Both platforms complied. However, the leak had already spread widely, and copies of the weights remained accessible through various channels.

The incident drew attention from U.S. lawmakers. Senators Richard Blumenthal and Josh Hawley wrote to Meta CEO Mark Zuckerberg expressing concern over the leak. They argued that Meta appeared to have "failed to conduct any meaningful risk assessment in advance of release" and that the company's approach was "unrestrained and permissive." The letter cited potential misuse for spam, fraud, malware, privacy violations, and harassment.

Paradoxically, the leak accelerated the open-source AI movement. Developers and researchers who gained access to the weights quickly began experimenting, producing fine-tuned variants and adaptations that demonstrated the potential of open-weight models. This groundswell of community activity is widely credited with influencing Meta's decision to release subsequent Llama versions under more permissive terms.

Llama 2 (July 2023)

Release and Licensing

On July 18, 2023, Meta released Llama 2 in partnership with Microsoft. In a significant shift from LLaMA 1's restricted license, Llama 2 was made freely available for both research and commercial use. The license allowed most commercial applications but included restrictions for organizations with more than 700 million monthly active users, effectively requiring the largest technology companies to negotiate separate agreements.

This release represented Meta's strategic bet that an open ecosystem around Llama would benefit the company more than a closed approach. The partnership with Microsoft meant Llama 2 was available from day one in the Azure AI model catalog, as well as through Amazon Web Services, Hugging Face, and other cloud providers.

Model Sizes and Training

Llama 2 was available in three primary sizes: 7B, 13B, and 70B parameters. Meta also trained a 34B-parameter variant that was tested internally but not publicly released with the initial batch. Each model was trained on 2 trillion tokens of publicly available data, a 40 percent increase over LLaMA 1's training corpus. The context length was doubled from 2,048 to 4,096 tokens.

Model	Parameters	Training Tokens	Context Length
Llama 2 7B	7 billion	2T	4,096
Llama 2 13B	13 billion	2T	4,096
Llama 2 70B	70 billion	2T	4,096

Llama 2-Chat

Alongside the base pretrained models, Meta released Llama 2-Chat, a set of models fine-tuned specifically for dialogue applications. Llama 2-Chat was trained through a multi-stage process:

Supervised Fine-Tuning (SFT): The base model was first fine-tuned on high-quality dialogue examples.
Reinforcement Learning from Human Feedback (RLHF): The SFT model was then further aligned using RLHF, combining rejection sampling with Proximal Policy Optimization (PPO). The RLHF training used a combination of 1,418,091 Meta-generated examples and data from seven smaller external datasets.

Llama 2-Chat models were available in 7B, 13B, and 70B sizes. The RLHF process improved the model's ability to follow instructions, produce helpful responses, and refuse harmful or inappropriate requests.

Architecture Changes

Llama 2 retained most of the architectural choices from LLaMA 1 (RMSNorm, SwiGLU, RoPE) but introduced Grouped-Query Attention (GQA) in the 70B model. GQA is a compromise between standard Multi-Head Attention (MHA) and Multi-Query Attention (MQA). It allows multiple query heads to share the same set of key and value heads, reducing the memory footprint and computational overhead of the KV cache during inference. This improvement made the 70B model substantially more efficient to deploy.

Code Llama (August 2023)

On August 24, 2023, Meta released Code Llama, a specialized variant of Llama 2 fine-tuned for code generation and understanding. Code Llama supported many popular programming languages including Python, C++, Java, PHP, TypeScript, C#, and Bash.

Code Llama was released in three sizes (7B, 13B, and 34B parameters), each trained on an additional 500 billion tokens of code and code-related data. Meta also provided two specialized variants:

Code Llama - Python: Further fine-tuned on 100 billion tokens of Python code.
Code Llama - Instruct: An instruction-tuned variant optimized for following natural language prompts about coding tasks.

The 7B and 13B models additionally supported fill-in-the-middle (FIM) capability, allowing them to insert code into existing code blocks for tasks like code completion. Code Llama was released under the same permissive license as Llama 2.

Llama 3 (April 2024)

Release Details

Meta released Llama 3 on April 18, 2024, with pretrained and instruction-tuned models in two sizes: 8B and 70B parameters. Meta described Llama 3 as "the most capable openly available LLM to date" at the time of its release.

Training Scale

Llama 3 represented a major leap in training scale. The models were pretrained on over 15 trillion tokens of publicly available data, seven times more than Llama 2. Compared to its predecessor, Llama 3 was three times more efficient to train, and the training data contained four times more code.

Model	Parameters	Training Tokens	Context Length	Vocabulary Size
Llama 3 8B	8 billion	15T+	8,192	128K
Llama 3 70B	70 billion	15T+	8,192	128K

Tokenizer Improvements

One of the most significant changes in Llama 3 was a new tokenizer with a vocabulary of 128,000 tokens, four times larger than Llama 2's 32,000-token vocabulary. This larger vocabulary allowed the tokenizer to encode text much more efficiently, producing up to 15 percent fewer tokens for the same input text. Fewer tokens per input means faster inference and the ability to fit more content within the context window.

Architecture

Llama 3 retained the decoder-only transformer architecture with RMSNorm, SwiGLU, and RoPE. A notable change was the adoption of Grouped-Query Attention (GQA) across both the 8B and 70B model sizes, whereas in Llama 2, GQA was used only in the 70B model. This improved inference efficiency across the entire model family.

The fine-tuning process for the instruction-tuned models incorporated publicly available instruction datasets as well as over 10 million human-annotated examples, a substantial increase over Llama 2's fine-tuning data.

Llama 3.1 (July 2024)

The 405B Flagship

On July 23, 2024, Meta released Llama 3.1 with updated versions of the 8B and 70B models and a new flagship: the 405B-parameter model. This was the largest openly available language model at the time and the first open model that Meta claimed could rival leading proprietary models like GPT-4, GPT-4o, and Claude 3.5 Sonnet.

Training the 405B model required over 16,000 NVIDIA H100 GPUs and over 15 trillion tokens of training data. Meta deliberately chose a dense transformer architecture rather than a mixture-of-experts design to maximize training stability at this unprecedented scale. For production deployment, the model was quantized from 16-bit (BF16) to 8-bit (FP8) precision to reduce resource requirements.

Context Length and Multilingual Support

All Llama 3.1 models (8B, 70B, and 405B) supported a 128K-token context length, a 16-fold increase over Llama 3's 8,192-token context window. This extended context enabled use cases like long-form document summarization, codebase analysis, and multi-turn conversational agents that need to maintain context across many exchanges.

Llama 3.1 added official multilingual support for eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Model	Parameters	Context Length	Training Tokens	Languages
Llama 3.1 8B	8 billion	128K	15T+	8
Llama 3.1 70B	70 billion	128K	15T+	8
Llama 3.1 405B	405 billion	128K	15T+	8

Performance

Meta evaluated Llama 3.1 on over 150 benchmark datasets. The 405B model demonstrated strong performance in general knowledge, long-form text generation, multilingual translation, coding, mathematics, tool use, and advanced reasoning. It was the first openly available model to be broadly competitive with frontier proprietary models across these categories.

Llama 3.2 (September 2024)

Multimodal and Edge Models

At Meta Connect 2024 in September, Meta released Llama 3.2, which split the Llama family in two new directions: multimodal vision models and lightweight edge models.

Vision Models (11B and 90B)

The Llama 3.2 11B and 90B vision language models (VLMs) were Meta's first multimodal Llama releases. These models could process both text and images, enabling tasks like image captioning, visual question answering, and document understanding. They were trained on a dataset of 6 billion image-text pairs.

The vision models were designed as drop-in replacements for their text-only counterparts, meaning existing applications using Llama 3.1 could upgrade to gain image understanding capabilities with minimal code changes. Meta reported that the 11B and 90B vision models exceeded Claude 3 Haiku on image understanding tasks.

Lightweight Models (1B and 3B)

The Llama 3.2 1B and 3B models were designed for on-device deployment on edge and mobile hardware. Despite their small size, they supported the full 128K-token context length and were trained on 9 trillion tokens. These models were optimized from day one for Qualcomm and MediaTek hardware and for Arm processors.

The 3B model outperformed Gemma 2 2.6B and Phi 3.5-mini on instruction following, summarization, prompt rewriting, and tool use benchmarks.

Model	Parameters	Type	Context Length	Key Capability
Llama 3.2 1B	1 billion	Text-only	128K	Edge/mobile deployment
Llama 3.2 3B	3 billion	Text-only	128K	Edge/mobile deployment
Llama 3.2 11B	11 billion	Vision + Text	128K	Image understanding
Llama 3.2 90B	90 billion	Vision + Text	128K	Image understanding

Llama 3.3 (December 2024)

On December 6, 2024, Meta released Llama 3.3, a text-only instruction-tuned model with 70 billion parameters. Llama 3.3 70B delivered performance comparable to the much larger Llama 3.1 405B while requiring only a fraction of the computational resources.

The model showed substantial improvements in reasoning, mathematical understanding, coding, tool calling, and multilingual text support compared to Llama 3.1 70B. It was pretrained on approximately 15 trillion tokens and fine-tuned with over 25 million synthetically generated examples in addition to publicly available instruction datasets. Training utilized a cumulative 39.3 million GPU hours on H100-80GB hardware.

Llama 3.3 supported the same eight languages as Llama 3.1: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

Llama 4 (April 2025)

A New Architecture

Meta released Llama 4 on April 5, 2025, marking the most significant architectural shift in the series. Llama 4 introduced two major changes: a mixture-of-experts (MoE) architecture and native multimodality through early fusion.

Mixture-of-Experts

Llama 4 was Meta's first model family to use a mixture-of-experts architecture. In an MoE model, each input token is routed to only a subset of the model's total parameters (the "active" parameters), while the remaining parameters (organized as specialized "expert" sub-networks) stay dormant for that token. This design allows the model to have a very large total parameter count for knowledge capacity while keeping per-token computation costs manageable.

Each token in a Llama 4 model is processed by a shared expert plus one routed expert selected from the available expert pool. The architecture also uses alternating dense layers alongside the MoE layers.

Native Multimodality and Early Fusion

Unlike Llama 3.2's vision models (which added multimodal capabilities on top of a text-only foundation), Llama 4 was natively multimodal from the start of pretraining. Meta used an "early fusion" approach in which text, image, and video tokens are combined into a single unified representation during pretraining itself. This means the model does not freeze text parameters or use separate multimodal parameters when training with images and videos. Instead, all modalities share the same representational space from the beginning.

The vision encoder in Llama 4 is based on MetaCLIP but was trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM's internal representations.

Llama 4 Scout

Llama 4 Scout has 109 billion total parameters organized into 16 experts, with 17 billion active parameters per token. Its most notable feature is an industry-leading context window of 10 million tokens, achieved through a new architecture called iRoPE (interleaved attention layers with rotary position embeddings). The model was pretrained with a 256K-token context and then extended.

Despite its large context window and total parameter count, Scout fits on a single NVIDIA H100 GPU thanks to its MoE architecture (only 17B parameters are active per token). Meta reported that Scout outperformed Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of benchmarks.

Llama 4 Maverick

Llama 4 Maverick scales up the expert count to 128 routed experts (plus a shared expert), giving it 400 billion total parameters while maintaining the same 17 billion active parameters per token as Scout. Maverick fits on a single NVIDIA H100 DGX host.

Meta described Maverick as the best multimodal model in its class, reporting that it beat GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks. An experimental chat-optimized version of Maverick achieved an ELO score of 1,417 on LMArena. Meta also noted that Maverick achieved comparable results to DeepSeek v3 on reasoning and coding tasks.

Llama 4 Behemoth

Llama 4 Behemoth is the largest model in the family, with 288 billion active parameters, 16 experts, and nearly 2 trillion total parameters. As of mid-2025, Behemoth was still in training and had not been publicly released. Meta disclosed that Behemoth serves as a teacher model for distilling knowledge into the smaller Scout and Maverick models.

Even in its unfinished state, Meta reported that Behemoth outperformed GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks such as MATH-500 and GPQA Diamond.

Pre-training Llama 4 Behemoth using FP8 precision and 32,000 GPUs achieved 390 TFLOPs per GPU.

Training Data and Scale

All Llama 4 models were trained on over 30 trillion tokens, more than double the Llama 3 pretraining mixture. The training data included diverse text, image, and video datasets with coverage of over 200 languages, with 100 or more languages having at least 1 billion tokens each.

The post-training pipeline for Llama 4 consisted of three stages: lightweight supervised fine-tuning (SFT), online reinforcement learning (RL), and lightweight Direct Preference Optimization (DPO).

Model	Active Parameters	Total Parameters	Experts	Context Length	Status
Llama 4 Scout	17B	109B	16	10M	Released (April 2025)
Llama 4 Maverick	17B	400B	128 (+1 shared)	Not specified	Released (April 2025)
Llama 4 Behemoth	288B	~2T	16	Not specified	Training (as of mid-2025)

Comprehensive Version Comparison

The following table summarizes all major Llama releases:

Version	Release Date	Model Sizes	Max Parameters	Context Length	Training Tokens	Architecture	License
LLaMA 1	February 2023	7B, 13B, 33B, 65B	65B	2,048	1.4T	Dense transformer	Non-commercial research
Llama 2	July 2023	7B, 13B, 70B	70B	4,096	2T	Dense transformer + GQA (70B)	Commercial (with restrictions)
Code Llama	August 2023	7B, 13B, 34B	34B	4,096 (16K for some)	500B additional	Dense transformer	Commercial (with restrictions)
Llama 3	April 2024	8B, 70B	70B	8,192	15T+	Dense transformer + GQA (all sizes)	Commercial (Llama 3 license)
Llama 3.1	July 2024	8B, 70B, 405B	405B	128K	15T+	Dense transformer + GQA	Commercial (Llama 3.1 license)
Llama 3.2	September 2024	1B, 3B, 11B, 90B	90B	128K	Up to 9T (small models)	Dense transformer; vision adapters	Commercial (Llama 3.2 license)
Llama 3.3	December 2024	70B	70B	128K	~15T	Dense transformer + GQA	Commercial (Llama 3.3 license)
Llama 4	April 2025	109B, 400B, ~2T (total)	~2T total (288B active)	Up to 10M	30T+	MoE + early fusion multimodal	Llama 4 license

Architecture Evolution

The Llama series has undergone steady architectural refinement across its generations. The core building blocks established in LLaMA 1 have persisted, but each generation introduced targeted improvements.

Core Components (Present Since LLaMA 1)

RMSNorm (Root Mean Square Normalization): All Llama models use pre-normalization with RMSNorm rather than the standard LayerNorm used in the original transformer. RMSNorm omits the mean-centering step, reducing computation by 5 to 15 percent per normalization layer while maintaining training stability.

SwiGLU Activation: The feed-forward network in every Llama transformer block uses the SwiGLU activation function, which combines a gating mechanism with the Swish activation. SwiGLU provides better expressiveness than ReLU and avoids the dead neuron problem, at the cost of requiring three weight projections instead of two (offset by reducing the intermediate dimension).

Rotary Position Embeddings (RoPE): All Llama models encode positional information through RoPE, which applies rotation matrices to query and key vectors based on their positions. RoPE naturally encodes relative distances between tokens without additional learned parameters.

Grouped-Query Attention

Introduced in Llama 2 (70B only) and expanded to all sizes in Llama 3, Grouped-Query Attention (GQA) groups multiple query heads to share a single set of key-value heads. This reduces the memory required for the KV cache during inference, improving throughput and enabling longer sequences without proportional memory increases.

Mixture-of-Experts (Llama 4)

Llama 4 introduced MoE layers where each token is routed to a shared expert plus one selected routed expert. This allows Llama 4 models to have very large total parameter counts (for storing broad knowledge) while keeping active computation per token at just 17 billion parameters. The architecture alternates MoE layers with standard dense layers.

iRoPE (Llama 4 Scout)

Llama 4 Scout introduced iRoPE (interleaved Rotary Position Embeddings), a variant of RoPE that uses interleaved attention layers with and without rotary position embeddings. This technique enabled the 10-million-token context window, a massive jump from the 128K context in Llama 3.1.

Early Fusion Multimodality (Llama 4)

Prior multimodal Llama models (Llama 3.2 vision) added image understanding on top of a pretrained text model. Llama 4 instead uses early fusion, integrating text, image, and video tokens into a shared representation during pretraining. The vision encoder is based on MetaCLIP and was co-trained with the language model, producing better cross-modal understanding.

Community and Derivative Models

The release (and leak) of LLaMA 1 ignited an explosion of community-built derivative models. This ecosystem has grown with each successive Llama release, making the Llama family one of the most forked and adapted model families in AI history.

Stanford Alpaca

One of the earliest and most influential derivatives, Stanford Alpaca was created by Stanford University researchers in March 2023. The team fine-tuned the LLaMA 7B model on 52,000 instruction-following demonstrations generated using OpenAI's text-davinci-003 API. Alpaca demonstrated that a relatively small, inexpensive fine-tuning process could produce a model with instruction-following capabilities comparable to much larger systems. The total fine-tuning cost was reported at under $600.

Vicuna

Vicuna-13B was developed by researchers at UC Berkeley, CMU, Stanford, and UCSD. It was created by fine-tuning LLaMA-13B on approximately 70,000 user-shared conversations collected from ShareGPT. The researchers reported that Vicuna achieved more than 90 percent of the quality of ChatGPT responses, as evaluated by GPT-4. The training cost was approximately $300.

Other Notable Derivatives

The Llama ecosystem has produced numerous other important models:

Koala: Developed at UC Berkeley, fine-tuned on dialogue data from the web.
GPT4All: A project by Nomic AI that fine-tuned LLaMA for use on consumer hardware.
Guanaco: A QLoRA fine-tuned version that demonstrated efficient fine-tuning could produce high-quality results with minimal GPU resources.
WizardLM: Used an Evol-Instruct method to generate increasingly complex training instructions, producing models with strong instruction-following abilities.
Code Llama: Meta's own officially released code-specialized variant (described above).

Influence on Independent Model Families

The Llama architecture and training techniques influenced several independent model families that, while not direct derivatives, drew significant inspiration from Meta's work:

Mistral: Mistral AI, founded by former Meta and DeepMind researchers, built models that incorporated architectural ideas popularized by Llama (such as GQA and sliding window attention). Mistral 7B was specifically compared to LLaMA models at launch.
Yi: 01.AI's Yi models adopted a similar architecture to Llama with modifications.
Qwen: Alibaba's Qwen series similarly built on transformer design patterns that Llama helped establish in the open-weight ecosystem.

Ecosystem Scale

By 2025, the Llama ecosystem had reached remarkable scale. Meta reported over 1.2 billion cumulative downloads across all Llama models. On Hugging Face alone, tens of thousands of Llama derivative models were published, with monthly downloads of community-created variants reaching into the hundreds of thousands. The usage of Llama models doubled between May and July 2024 alone, following the release of Llama 3.1.

Fine-Tuning and Deployment Ecosystem

The open availability of Llama weights has enabled a rich ecosystem of fine-tuning tools and deployment options.

Fine-Tuning Tools

Several frameworks and techniques have become standard for adapting Llama models:

LoRA (Low-Rank Adaptation): A parameter-efficient training method that keeps the base model's weights frozen and trains small low-rank adapter matrices. LoRA allows fine-tuning large Llama models on consumer GPUs.
QLoRA: Combines LoRA with 4-bit quantization, enabling fine-tuning of 70B-parameter models on a single 48GB GPU.
LLaMA-Factory: An open-source framework providing a unified interface for fine-tuning Llama models with various methods including LoRA, QLoRA, full fine-tuning, and more advanced techniques like DoRA, LongLoRA, and GaLore.
Unsloth: A specialized library that optimizes Llama fine-tuning for speed, offering 2x to 5x faster training with reduced memory usage. Unsloth has expanded support for Llama 4 Scout and Maverick.

Deployment Options

Llama models can be deployed through multiple channels:

Ollama: A user-friendly tool for running Llama models locally. Ollama handles model downloading, quantization, and serving with a simple command-line interface.
llama.cpp: A low-level C/C++ inference engine that enables running Llama models on CPUs and a wide range of hardware. The GGUF format used by llama.cpp has become a standard for local model deployment.
vLLM: A high-throughput serving engine designed for multi-user scenarios, offering optimized GPU utilization with techniques like PagedAttention. vLLM provided day-zero support for Llama 4.
Cloud Platforms: All major cloud providers (AWS, Azure, Google Cloud, Oracle Cloud) offer managed Llama deployment through their AI model catalogs.

Quantization Formats

To make large Llama models practical for deployment on consumer and edge hardware, several quantization approaches are commonly used:

GGUF: The standard format for llama.cpp, supporting various quantization levels (Q4, Q5, Q8) that trade small amounts of quality for significant reductions in memory and compute requirements.
GPTQ: A post-training quantization method popular for GPU inference.
AWQ: Activation-aware weight quantization that preserves important weights at higher precision.
FP8: Used by Meta for the Llama 3.1 405B and Llama 4 models to reduce production serving costs while maintaining quality.

Open-Source Impact

The Llama series has had a profound impact on the broader AI field. Before LLaMA 1, state-of-the-art language models were almost exclusively controlled by a handful of well-funded labs (OpenAI, Google, Anthropic). The release of competitive open-weight models changed the dynamics of the field in several ways.

Democratization of AI Research

By making high-quality model weights freely available, Meta enabled researchers at universities and smaller organizations to conduct experiments that previously required millions of dollars in compute budgets. This led to a surge in published research on topics like fine-tuning efficiency, alignment techniques, model merging, and quantization.

Commercial Applications

The permissive licensing of Llama 2 and subsequent versions allowed startups and enterprises to build commercial products on top of Llama without paying per-token API fees. Companies could run Llama models on their own infrastructure, maintaining data privacy and reducing costs compared to proprietary API-based approaches.

Safety and Alignment Research

Open-weight models enabled independent safety researchers to study model behavior, test for biases, and develop alignment techniques without relying on API access that could be revoked. This transparency has been both praised (for enabling scrutiny) and criticized (for making it easier to remove safety guardrails).

Competitive Pressure

The availability of strong open-weight models put competitive pressure on proprietary model providers, contributing to price reductions and more generous free tiers across the industry. The open-weight movement also prompted other organizations (Mistral AI, 01.AI, Alibaba, and others) to release their own model weights.

Ethical Considerations and Safety

As with all large language models, the Llama family carries risks related to misuse and harm.

Bias and Toxicity

Llama models are trained on data from the web and therefore reflect biases present in their training data. Meta has evaluated Llama models for biases related to gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Meta applied data filtering during training (using Kneser-Ney language models and fastText classifiers to filter based on proximity to Wikipedia-quality text) and RLHF during fine-tuning to reduce harmful outputs.

Dual-Use Concerns

The open availability of Llama weights means that safety guardrails applied during fine-tuning can potentially be removed through additional fine-tuning. This has raised concerns from policymakers and safety researchers about the potential for misuse in generating misinformation, malware, or other harmful content. Meta has argued that the benefits of open access (including enabling independent safety research) outweigh these risks.

Responsible Use Guidelines

Meta publishes responsible use guides alongside each Llama release, providing guidance on safe deployment practices, content filtering, and risk mitigation. The Llama license includes an acceptable use policy that prohibits specific harmful applications.

References

Touvron, H., et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, February 2023.
Touvron, H., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288, July 2023.
Meta AI. "Introducing Meta Llama 3: The most capable openly available LLM to date." ai.meta.com, April 18, 2024.
Meta AI. "Introducing Llama 3.1: Our most capable models to date." ai.meta.com, July 23, 2024.
Meta AI. "Llama 3.2: Revolutionizing edge AI and vision with open, customizable models." ai.meta.com, September 2024.
Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." ai.meta.com, April 5, 2025.
Meta and Microsoft. "Meta and Microsoft Introduce the Next Generation of Llama." about.fb.com, July 18, 2023.
Roziere, B., et al. "Code Llama: Open Foundation Models for Code." arXiv:2308.12950, August 2023.
Taori, R., et al. "Stanford Alpaca: An Instruction-following LLaMA model." Stanford University, March 2023.
Chiang, W., et al. "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality." LMSYS Org, March 2023.
Blumenthal, R. and Hawley, J. "Letter to Meta CEO Mark Zuckerberg regarding LLaMA model leak." U.S. Senate, June 6, 2023.
The Register. "LLaMA drama as Meta's mega language model files leak." theregister.com, March 8, 2023.
Meta AI. "With 10x growth since 2023, Llama is the leading engine of AI innovation." ai.meta.com, 2024.
Hugging Face. "Welcome Llama 4 Maverick & Scout on Hugging Face." huggingface.co, April 2025.
Dettmers, T., et al. "QLoRA: Efficient Finetuning of Quantized Language Models." arXiv:2305.14314, May 2023.

Overview

LLaMA 1 (February 2023)

Release and Motivation

Model Sizes and Training

Architecture

Performance

The Leak Controversy

Llama 2 (July 2023)

Release and Licensing

Model Sizes and Training

Llama 2-Chat

Architecture Changes

Code Llama (August 2023)

Llama 3 (April 2024)

Release Details

Training Scale

Tokenizer Improvements

Architecture

Llama 3.1 (July 2024)

The 405B Flagship

Context Length and Multilingual Support

Performance

Llama 3.2 (September 2024)

Multimodal and Edge Models

Vision Models (11B and 90B)

Lightweight Models (1B and 3B)

Llama 3.3 (December 2024)

Llama 4 (April 2025)

A New Architecture

Mixture-of-Experts

Native Multimodality and Early Fusion

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Training Data and Scale

Comprehensive Version Comparison

Architecture Evolution

Core Components (Present Since LLaMA 1)

Grouped-Query Attention

Mixture-of-Experts (Llama 4)

iRoPE (Llama 4 Scout)

Early Fusion Multimodality (Llama 4)

Community and Derivative Models

Stanford Alpaca

Vicuna

Other Notable Derivatives

Influence on Independent Model Families

Ecosystem Scale

Fine-Tuning and Deployment Ecosystem

Fine-Tuning Tools

Deployment Options

Quantization Formats

Open-Source Impact

Democratization of AI Research

Commercial Applications

Safety and Alignment Research

Competitive Pressure

Ethical Considerations and Safety

Bias and Toxicity

Dual-Use Concerns

Responsible Use Guidelines

See Also

References

Related Articles

LLaMA 3

Code Llama

Qwen

Falcon (language model)

Wav2Vec

LlamaIndex

Overview

LLaMA 1 (February 2023)

Release and Motivation

Model Sizes and Training

Architecture

Performance

The Leak Controversy

Llama 2 (July 2023)

Release and Licensing

Model Sizes and Training