Code Llama is a family of large language models specialized for code generation and understanding, developed by Meta AI. Released in August 2023, Code Llama is built on top of Llama 2 and further trained on code-heavy datasets, producing models that achieve state-of-the-art performance among open-weight models on standard coding benchmarks. The model family includes three variants (Code Llama base, Code Llama-Python, and Code Llama-Instruct) offered in four parameter sizes: 7B, 13B, 34B, and 70B. Code Llama supports advanced capabilities such as fill-in-the-middle (FIM) code infilling, extended context windows of up to 100,000 tokens, and zero-shot instruction following for programming tasks.
The accompanying research paper, "Code Llama: Open Foundation Models for Code," was authored by Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jeremy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. The paper was first posted on arXiv on August 24, 2023, and was later published as a conference paper at the Conference on Language Modeling (COLM) in 2024.
Before Code Llama, several models had demonstrated the potential of language models trained on code. OpenAI's Codex, the model behind GitHub Copilot, showed that GPT-based architectures could be adapted for code generation. The open-source community produced models like StarCoder (developed by BigCode and Hugging Face) and CodeGen (from Salesforce), which offered publicly available alternatives but with certain limitations in performance or licensing terms.
Meta's motivation for Code Llama was to provide a fully open-weight, commercially usable family of code models that could match or exceed the performance of proprietary systems. By starting from the already capable Llama 2 foundation and applying targeted code specialization, the research team aimed to create models that developers could run locally, fine-tune for domain-specific tasks, and integrate into their own tools without reliance on closed API services.
Code Llama inherits the transformer architecture of Llama 2, which is a decoder-only autoregressive transformer. The architecture incorporates several modern design choices that distinguish it from the original transformer formulation:
The tokenizer is a byte-pair encoding (BPE) tokenizer shared with Llama 2, featuring a vocabulary of 32,000 tokens.
| Parameter | 7B | 13B | 34B | 70B |
|---|---|---|---|---|
| Transformer layers | 32 | 40 | 48 | 80 |
| Attention heads | 32 | 40 | 48 | 64 |
| Embedding dimension | 4,096 | 5,120 | 8,192 | 8,192 |
| Context length (training) | 16,384 | 16,384 | 16,384 | 16,384 |
| Context length (inference) | 100,000 | 100,000 | 100,000 | 100,000 |
| Vocabulary size | 32,000 | 32,000 | 32,000 | 32,000 |
| GQA | No | No | Yes | Yes |
Code Llama is offered in three distinct variants, each tailored to different use cases:
The foundation model serves as the general-purpose code generation backbone. It is initialized from Llama 2 and further trained on a large corpus of code and code-adjacent data. The base model supports a wide range of programming languages and is suitable for code completion, code generation from natural language descriptions, and general code understanding tasks.
This variant undergoes additional training on a Python-specific dataset of 100 billion tokens after the initial code training phase. The Python-focused dataset comprises 75% Python code, 10% code in other languages, 10% natural language related to code, and 5% general natural language. Code Llama-Python consistently outperforms the base variant on Python-specific benchmarks such as HumanEval and is designed for developers working primarily in Python.
The instruction-tuned variant is fine-tuned to follow natural language instructions for programming tasks. It is designed for conversational coding assistance, where a user describes a task in plain English and the model generates the corresponding code. The instruction tuning uses a combination of proprietary instruction data and a self-instruct dataset generated through an automated pipeline (described in the Training section below). Code Llama-Instruct is the recommended variant for interactive and safety-sensitive applications.
| Variant | Description | Best For | FIM Support (7B/13B) | FIM Support (34B) | FIM Support (70B) |
|---|---|---|---|---|---|
| Code Llama (Base) | General-purpose code model | Code completion, generation | Yes | No | Yes |
| Code Llama-Python | Python-specialized | Python development | No | No | No |
| Code Llama-Instruct | Instruction-following | Conversational coding, Q&A | Yes | No | Yes |
The training of Code Llama follows a multi-stage pipeline, where each stage progressively adds capabilities to the base Llama 2 model. The entire training process, covering all 12 models (three variants at four sizes), required approximately 1.4 million GPU hours on NVIDIA A100-80GB hardware, performed on Meta's Research Super Cluster.
Starting from the pre-trained Llama 2 checkpoints, the models are further trained on a near-deduplicated dataset of publicly available code. The 7B, 13B, and 34B models were each trained on 500 billion tokens, while the 70B model (released later, in January 2024) was trained on 1 trillion tokens.
The training data composition for this stage is:
| Data Source | Proportion |
|---|---|
| Open-source code from GitHub | 85% |
| Natural language about code (e.g., Stack Overflow discussions, code documentation) | 8% |
| General natural language | 7% |
The inclusion of natural language data, both code-related and general, is a deliberate design choice. It helps the model retain the strong natural language understanding capabilities of Llama 2, which is important for tasks like generating code from English descriptions or understanding code comments.
The model is trained on code from multiple programming languages. The primary languages supported include Python, C++, Java, PHP, TypeScript (JavaScript), C#, and Bash.
A key capability of Code Llama is its ability to perform code infilling, where the model generates code that fits between a given prefix and suffix. This is trained using a fill-in-the-middle (FIM) objective, applied to the 7B, 13B, and 70B models (but not the 34B model).
During FIM training, a document is randomly split at two points into three segments: a prefix, a middle section, and a suffix. The model is then trained in a multitask fashion with two formats:
The FIM transformation is applied with approximately 90% probability on training documents. Importantly, the authors demonstrated that FIM training does not degrade standard left-to-right autoregressive performance. Models trained with FIM achieve comparable scores to models trained without it on standard benchmarks, while gaining the additional infilling capability.
The 34B model was trained without the FIM objective. This decision was based on the observation that the 34B model was intended primarily for large-scale code generation tasks where infilling was considered less critical. However, the 70B model, released later, was trained with FIM support, responding to community demand for infilling in larger models.
All Code Llama models undergo a dedicated long context fine-tuning stage that extends their effective context window. While Llama 2 was pre-trained with a context length of 4,096 tokens, Code Llama extends this to 16,384 tokens during training and demonstrates stable performance on sequences of up to 100,000 tokens at inference time.
The key technique enabling this extension is a modification to the Rotary Position Embeddings. The base period of the RoPE frequency is increased from 10,000 (the Llama 2 default) to 1,000,000. This rescaling allows the positional encoding to represent much longer sequences without the periodic aliasing that would occur with the original frequency base. The long context fine-tuning phase uses a relatively small number of additional training steps, making it an efficient adaptation.
The extended context capability is particularly valuable for code-related tasks, where developers frequently need to work with files containing thousands of lines. Repository-level code understanding, long function implementations, and cross-file dependency analysis all benefit from the larger context window.
The Code Llama-Instruct variants undergo an additional fine-tuning stage to learn how to follow natural language instructions. The instruction-tuning dataset is composed of two main sources:
This self-instruct approach uses execution feedback as a quality signal rather than relying on expensive human annotation or reinforcement learning from human feedback (RLHF). The instruction fine-tuning dataset also includes a rehearsal component: a small proportion of data from the original code training stage is mixed in to prevent catastrophic forgetting of the model's code generation capabilities.
The paper also describes an experimental model called "Unnatural Code Llama," which is the Code Llama-Python 34B model fine-tuned on 15,000 samples from an "unnatural instructions" dataset (following the methodology of Honovich et al., 2023). This model achieved notably strong results (reported as 62.2% on HumanEval and 61.2% on MBPP), demonstrating that even a small set of high-quality, diverse coding data can yield significant improvements. However, this model was not publicly released.
Code infilling is one of the most practically useful capabilities of Code Llama, directly applicable to integrated development environments (IDEs) and code editors. When a developer is writing code, the model can predict what should appear at the cursor position given the code that comes before and after the cursor.
The infilling capability is supported in the following models:
| Model Size | Code Llama (Base) | Code Llama-Python | Code Llama-Instruct |
|---|---|---|---|
| 7B | Yes | No | Yes |
| 13B | Yes | No | Yes |
| 34B | No | No | No |
| 70B | Yes | No | Yes |
The FIM approach uses special tokens to delineate the prefix, middle, and suffix regions. During inference, a user provides the code before the cursor (prefix) and the code after the cursor (suffix), and the model generates the missing middle section. This is distinct from standard autoregressive generation, which can only continue from the end of a given prompt.
Applications of code infilling include:
Code Llama was evaluated on several widely used coding benchmarks. The primary benchmarks are HumanEval (a set of 164 hand-written Python programming problems by OpenAI) and MBPP (Mostly Basic Python Programming, containing 974 crowd-sourced Python tasks from Google). Results are reported as pass@1, the percentage of problems solved correctly on the first attempt.
| Model | Size | HumanEval (pass@1) |
|---|---|---|
| Llama 2 | 7B | 12.2% |
| Llama 2 | 13B | 20.1% |
| Llama 2 | 70B | 30.5% |
| StarCoder Base | 15.5B | 30.4% |
| code-cushman-001 (Codex) | 12B | 33.5% |
| Code Llama (Base) | 7B | 33.5% |
| Code Llama (Base) | 13B | 36.0% |
| Code Llama (Base) | 34B | 48.8% |
| Code Llama (Base) | 70B | 53.0% |
| Code Llama-Python | 7B | 38.4% |
| Code Llama-Python | 13B | 43.3% |
| Code Llama-Python | 34B | 53.7% |
| Code Llama-Python | 70B | 57.3% |
| Code Llama-Instruct | 7B | 34.8% |
| Code Llama-Instruct | 13B | 42.7% |
| Code Llama-Instruct | 34B | 41.5% |
| Code Llama-Instruct | 70B | 67.8% |
| GPT-3.5 (ChatGPT) | N/A | 48.1% |
| GPT-4 | N/A | 67.0% |
A notable finding is that Code Llama-Python 7B (38.4%) outperforms Llama 2 70B (30.5%) on HumanEval, demonstrating the effectiveness of domain-specific code training. At the largest scale, Code Llama-Instruct 70B (67.8%) slightly surpasses GPT-4 (67.0%) on the HumanEval benchmark in zero-shot evaluation.
| Model | Size | MBPP (pass@1) |
|---|---|---|
| Llama 2 | 70B | 45.4% |
| StarCoder Base | 15.5B | 43.6% |
| code-cushman-001 (Codex) | 12B | 45.9% |
| Code Llama (Base) | 7B | 41.4% |
| Code Llama (Base) | 13B | 47.0% |
| Code Llama (Base) | 34B | 55.0% |
| Code Llama (Base) | 70B | 62.4% |
| Code Llama-Python | 7B | 47.6% |
| Code Llama-Python | 13B | 49.0% |
| Code Llama-Python | 34B | 56.2% |
| Code Llama-Python | 70B | 65.6% |
| Code Llama-Instruct | 7B | 44.4% |
| Code Llama-Instruct | 13B | 49.4% |
| Code Llama-Instruct | 34B | 57.0% |
| Code Llama-Instruct | 70B | 62.2% |
| GPT-3.5 (ChatGPT) | N/A | 52.2% |
Code Llama was also evaluated on the MultiPL-E benchmark, which extends HumanEval to multiple programming languages. The Code Llama 70B model demonstrated strong performance across languages:
| Language | Code Llama 70B (pass@1) |
|---|---|
| Python | 52.8% |
| C++ | 51.9% |
| Java | 50.9% |
| PHP | 43.5% |
| TypeScript | 51.3% |
| C# | 42.1% |
| Bash | 31.5% |
These results show that while Code Llama performs strongest in Python (especially the Python variant), it maintains competitive performance across a variety of programming languages.
The following table summarizes how Code Llama's best results compare with other prominent code generation models at the time of its release:
| Model | Organization | Parameters | HumanEval (pass@1) | MBPP (pass@1) | Open Weights |
|---|---|---|---|---|---|
| Code Llama-Instruct 70B | Meta | 70B | 67.8% | 62.2% | Yes |
| GPT-4 | OpenAI | Undisclosed | 67.0% | N/A | No |
| GPT-3.5 (ChatGPT) | OpenAI | Undisclosed | 48.1% | 52.2% | No |
| Code Llama-Python 34B | Meta | 34B | 53.7% | 56.2% | Yes |
| StarCoder Base | BigCode / Hugging Face | 15.5B | 30.4% | 43.6% | Yes |
| code-cushman-001 (Codex) | OpenAI | 12B | 33.5% | 45.9% | No |
| Llama 2 | Meta | 70B | 30.5% | 45.4% | Yes |
Code Llama-Instruct 70B was the first open-weight model to match GPT-4 on HumanEval, a significant milestone for the open-source AI community.
Code Llama was released in two phases:
| Date | Release | Details |
|---|---|---|
| August 24, 2023 | Code Llama 7B, 13B, 34B | Initial release of all three variants (base, Python, Instruct) at 7B, 13B, and 34B sizes. Nine models total. |
| January 29, 2024 | Code Llama 70B | Release of the 70B parameter models in all three variants. Trained on 1 trillion tokens (double the smaller models). Added FIM support at 70B scale. |
The 70B release was notable because it was trained on twice the number of tokens as the earlier models (1 trillion vs. 500 billion) and introduced FIM infilling capability at the 70B scale, which was not available in the 34B model.
Code Llama is released under the same community license as Llama 2, which permits both research and commercial use. The key terms of the license include:
It is worth noting that the Open Source Initiative has stated that the Llama 2 Community License does not meet the traditional definition of "open source" due to its commercial restrictions for very large companies and other usage limitations. Nonetheless, the license is substantially more permissive than those of comparable proprietary models like Codex or GPT-4.
Code Llama has seen broad adoption across the developer community and has been integrated into numerous tools and platforms.
One of Code Llama's key advantages is that developers can run the models locally without relying on cloud APIs. Several tools facilitate local deployment:
Code Llama has been integrated into code editors through several extensions and plugins:
The open nature of Code Llama's weights has enabled the community to create numerous fine-tuned derivatives:
Code Llama models are available through various cloud inference platforms, including Amazon Web Services (via SageMaker and Bedrock), NVIDIA NIM, Hugging Face Inference Endpoints, Perplexity, Together AI, and Replicate, making them accessible to developers who prefer API-based access.
StarCoder is a 15.5B parameter code generation model developed by the BigCode project, a collaboration between Hugging Face and ServiceNow. StarCoder was trained on The Stack, a curated dataset of permissively licensed source code. While StarCoder achieved strong results for its time (30.4% on HumanEval), Code Llama surpassed it across all model sizes. Even Code Llama 7B (33.5% on HumanEval) outperforms StarCoder Base despite having fewer than half the parameters. StarCoder does offer a larger context window of 8,192 tokens natively, though Code Llama's long context fine-tuning extends its effective context far beyond this.
OpenAI's Codex (code-cushman-001, a 12B parameter model) was the model powering GitHub Copilot at launch. Codex scored 33.5% on HumanEval, matching Code Llama 7B. However, Codex was a proprietary model with no publicly available weights, while Code Llama provides comparable or superior performance in a fully downloadable, modifiable package. OpenAI deprecated the Codex API in March 2023, months before Code Llama's release.
At the time of Code Llama's initial release in August 2023, GPT-4 scored 67.0% on HumanEval, well above any Code Llama model. However, the January 2024 release of Code Llama 70B-Instruct closed this gap entirely, achieving 67.8% on HumanEval, a score slightly above GPT-4's. It is important to note that GPT-4 is a much larger, general-purpose model, while Code Llama is specialized for code. Also, subsequent GPT-4 updates and the release of models like GPT-4o and o1 have since raised the bar for proprietary models on coding benchmarks.
DeepSeek Coder, released in late 2023 by the Chinese AI lab DeepSeek, offered competitive performance with Code Llama across various benchmarks. DeepSeek Coder models ranged from 1.3B to 33B parameters and were trained from scratch on a 2-trillion-token code corpus. DeepSeek Coder 33B achieved around 56% on HumanEval, competitive with Code Llama-Python 34B's 53.7%.
Despite its strong performance, Code Llama has several limitations:
Code Llama represented a significant milestone in the democratization of AI-assisted coding. It was one of the first open-weight code models to match the performance of proprietary alternatives like GPT-4 on standard benchmarks. The release of model weights, combined with the relatively permissive community license, enabled a wide ecosystem of tools, fine-tuned models, and applications that would not have been possible with closed models.
The model's influence extends beyond its direct use. Its training methodology, particularly the multi-stage pipeline of code specialization, FIM training, and long context fine-tuning, has informed subsequent work on code LLMs. The self-instruct approach for generating coding instruction data without human annotation has been adopted and refined by other research groups.
Code Llama also demonstrated the viability of the "foundation model plus specialization" approach to building code models. Rather than training a code model from scratch, Meta showed that starting from a strong general-purpose language model and applying targeted code training could yield competitive results with lower total compute cost.