# StarCoder

> Source: https://aiwiki.ai/wiki/starcoder
> Updated: 2026-06-21
> Categories: AI Code Generation, Large Language Models, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**StarCoder** is a family of open-access [large language models](/wiki/large_language_model) for code generation and code understanding, developed by the [BigCode](/wiki/bigcode) project, an open scientific collaboration led by [Hugging Face](/wiki/hugging_face) and [ServiceNow](/wiki/servicenow). The original StarCoder, released on May 4, 2023, is a 15.5 billion parameter model trained on over 80 programming languages from permissively licensed source code, and it scores 33.6% pass@1 on the [HumanEval](/wiki/humaneval) benchmark (40.8% when prompted).[1][13] Its successor, StarCoder2, launched on February 28, 2024, expanded coverage to more than 600 programming languages, introduced three model sizes (3B, 7B, and 15B parameters), and pushed StarCoder2-15B to 46.3% on HumanEval, matching or outperforming [Code Llama](/wiki/code_llama)-34B despite using fewer than half the parameters.[2][14] Both generations are released under the BigCode OpenRAIL-M license, which permits commercial use while enforcing ethical usage restrictions.[10]

StarCoder is widely regarded as one of the most significant open-source contributions to AI-assisted programming, in large part because it demonstrated that a model trained exclusively on permissively licensed data could rival proprietary systems. The StarCoder paper frames the project's transparency goal directly: "the largest open-access LLM for code to date, surpassing the largest models such as PaLM, LaMDA, and LLaMA."[1]

## What is the BigCode project?

The BigCode project was officially launched in September 2022 as an open scientific collaboration between Hugging Face and ServiceNow Research. Its stated goal is the responsible development of large language models for code.[1] The project operates under open governance, with a steering committee jointly led by ServiceNow and Hugging Face, and has attracted more than 1,200 members from institutions and companies across 62 countries.[1]

BigCode distinguishes itself from proprietary alternatives like [Codex](/wiki/openai_codex) and [GitHub Copilot](/wiki/github_copilot) through its commitment to transparency, reproducibility, and ethical data practices. The organizers ensure that only files from repositories with permissive licenses go into the training datasets, and the project has established multiple working groups covering topics such as licensing, attribution of generated code, the handling of personally identifiable information (PII), and risks of malicious code.[1]

Before releasing StarCoder, the BigCode community produced SantaCoder in December 2022, a 1.1 billion parameter model trained on the Python, Java, and JavaScript subsets of The Stack (v1.1).[4] Despite its relatively small size, SantaCoder outperformed larger open-source multilingual code models available at the time, including InCoder-6.7B and CodeGen-Multi-2.7B, in both left-to-right generation and infilling tasks.[4] SantaCoder served as a proof of concept and technical stepping stone for the much larger StarCoder models that followed.[4]

## StarCoder (May 2023)

### Architecture

StarCoder uses a decoder-only [Transformer](/wiki/transformer) architecture with learned absolute positional embeddings. The model consists of 40 layers, a hidden dimension of 6,144, and 48 attention heads.[1] A defining architectural choice is the use of Multi-Query [Attention](/wiki/attention) (MQA), which shares key and value projections across all attention heads. This approach significantly reduces the memory footprint during inference and increases throughput, making the model more practical for deployment compared to standard multi-head attention.[1]

The model supports a context window of 8,192 tokens, which was among the longest context lengths for code models at the time of release. It uses a byte-level [byte pair encoding](/wiki/bpe) (BPE) vocabulary of 49,152 tokens.[1]

### Training Data: The Stack v1

Both StarCoderBase and StarCoder were trained on data from The Stack (v1.2), a large-scale dataset of permissively licensed source code.[1] The Stack was first released on October 27, 2022, and contains over 6 terabytes of permissively licensed source code files covering 358 programming languages collected from public GitHub repositories.[1]

The BigCode team applied careful curation to the dataset. They filtered repositories based on license type, selecting only those with permissive licenses that impose minimal restrictions on copying, modifying, and redistributing the code.[1] The dataset also includes an opt-out mechanism that allows code owners to request the removal of their code from the training data. Near-deduplication and comment-to-code ratio filtering were applied as additional quality controls.[1]

### StarCoderBase and StarCoder

The StarCoder model family from the first generation consists of two variants:

| Model | Parameters | Training Data | Training Tokens | Context Length | Attention Type |
|---|---|---|---|---|---|
| StarCoderBase | 15.5B | The Stack v1.2 (80+ languages) | 1 trillion | 8,192 | Multi-Query Attention |
| StarCoder | 15.5B | StarCoderBase + Python fine-tuning | 1 trillion + 35B Python tokens | 8,192 | Multi-Query Attention |

**StarCoderBase** is the foundational model, pretrained on 1 trillion tokens sourced from over 80 programming languages in The Stack.[1] The pretraining was conducted over approximately 250,000 steps across 24 days, using 512 NVIDIA Tesla A100 GPUs. This amounted to roughly 320,256 GPU hours for pretraining alone.[1]

**StarCoder** was created by further fine-tuning StarCoderBase on 35 billion tokens of Python code. This additional training was performed for two more epochs on the same Python data from the pretraining set, adding approximately 11,208 GPU hours.[1] The result is a model that retains its multilingual code generation capabilities while excelling particularly at Python tasks.[1]

Both models were trained using the [Fill-in-the-Middle](/wiki/fill_in_the_middle) (FIM) objective, which allows the model to complete code not only at the end of a sequence but also in the middle.[1] This is accomplished through special sentinel tokens (`<fim_prefix>`, `<fim_suffix>`, `<fim_middle>`) that indicate where the model should insert code, making the models suitable for code completion, insertion, and infilling tasks commonly needed in integrated development environments.[1]

### Benchmark Results (StarCoder v1)

The following table summarizes the performance of StarCoderBase and StarCoder on key code generation benchmarks, measured by pass@1 (the probability that the first generated sample passes all unit tests):

| Model | Parameters | HumanEval (pass@1) | MBPP (pass@1) |
|---|---|---|---|
| StarCoderBase | 15.5B | 30.4% | 49.0% |
| StarCoder | 15.5B | 33.6% | 52.7% |
| StarCoder (prompted) | 15.5B | 40.8% | -- |

StarCoder also demonstrated strong multilingual performance on the [MultiPL-E](/wiki/multipl_e) benchmark, which translates HumanEval problems into 18 additional programming languages.[9] StarCoder's pass@1 scores on selected languages include:

| Language | StarCoder pass@1 | StarCoderBase pass@1 |
|---|---|---|
| Python | 33.6% | 30.3% |
| JavaScript | 30.8% | 31.7% |
| C++ | 31.6% | 30.6% |
| Java | 30.2% | 28.5% |
| Rust | 21.8% | 24.5% |
| Go | 17.6% | 21.5% |
| PHP | 26.1% | 26.8% |
| Scala | 27.6% | 28.8% |

At the time of release, StarCoder matched or outperformed OpenAI's code-cushman-001 model across multiple programming languages, making it one of the most capable open-access code generation models available.[1]

## StarCoder2 (February 2024)

### Overview

StarCoder2 represents the second generation of the StarCoder model family and was released on February 28, 2024.[2] This generation introduced three model sizes to serve different computational budgets, and it brought [NVIDIA](/wiki/nvidia) on board as a training partner alongside Hugging Face and ServiceNow. Each organization took responsibility for training one of the three model variants.[2][14]

### Architecture

StarCoder2 models use an updated architecture compared to the original StarCoder. Key changes include:

- **Grouped Query Attention (GQA):** Replacing the Multi-Query Attention of StarCoder v1, GQA provides a balance between the memory efficiency of MQA and the representational capacity of full multi-head attention. GQA groups query heads into clusters that share key-value pairs, reducing memory footprint and inference latency while maintaining model quality.[2]
- **Sliding Window Attention:** All StarCoder2 models use a sliding window attention mechanism with a window size of 4,096 tokens, combined with an expanded context window of 16,384 tokens. This approach enables efficient processing of longer code sequences.[2]
- **Fill-in-the-Middle:** Like its predecessor, StarCoder2 supports the FIM objective for code infilling tasks.[2]

### Model Variants

| Model | Parameters | Trained By | Training Framework | Languages | Training Tokens |
|---|---|---|---|---|---|
| StarCoder2-3B | 3 billion | ServiceNow | Fast LLM | 17 | 3+ trillion |
| StarCoder2-7B | 7 billion | Hugging Face | nanotron | 17 | 3.5+ trillion |
| StarCoder2-15B | 15 billion | NVIDIA | NeMo | 600+ | 4+ trillion |

The StarCoder2-15B model was trained on NVIDIA's Eos Supercomputer using 1,024 NVIDIA H100 GPUs over approximately 1 million training steps.[2] It supports more than 600 programming languages, while the smaller 3B and 7B variants were trained on 17 carefully selected high-resource programming languages.[2]

### Training Data: The Stack v2

StarCoder2 models are trained on The Stack v2, a significantly expanded dataset built in partnership with Software Heritage, a nonprofit initiative founded by Inria in partnership with UNESCO to collect, preserve, and share all publicly available source code.[2]

| Metric | The Stack v1 | The Stack v2 |
|---|---|---|
| Full Size | 6.4 TB | 67.5 TB |
| Deduplicated Size | 2.9 TB | 32.1 TB |
| Training Tokens | ~200B | ~900B |
| Programming Languages | 358 | 619 |

The Stack v2 is roughly 4x larger than The Stack v1 in terms of training data.[2] Beyond raw source code, The Stack v2 includes additional high-quality data sources such as GitHub pull requests, Kaggle notebooks, and code documentation.[2] The dataset features improved language and license detection, better filtering heuristics, and repository-grouped training data that helps models learn contextual relationships between files in the same project.[2]

### Benchmark Results (StarCoder2)

StarCoder2 models show substantial improvements over the first generation across all benchmarks:

| Model | Parameters | HumanEval (pass@1) | HumanEval+ (pass@1) | Context Length |
|---|---|---|---|---|
| StarCoder2-3B | 3B | 31.7% | 27.4% | 16,384 |
| StarCoder2-7B | 7B | 35.4% | 29.9% | 16,384 |
| StarCoder2-15B | 15B | 46.3% | 37.8% | 16,384 |

Notably, StarCoder2-3B matches the performance of the original StarCoderBase-15B (30.4% on HumanEval), despite being roughly 5x smaller.[2] StarCoder2-15B significantly outperforms models of comparable size and matches or outperforms [Code Llama](/wiki/code_llama)-34B on multiple benchmarks, including surpassing it on both MBPP and MBPP+ and matching or exceeding it on 10 out of 18 programming languages in MultiPL-E.[2]

Additional benchmark results for StarCoder2-15B include:

| Benchmark | Metric | Score |
|---|---|---|
| DS-1000 | pass@1 | 33.8% |
| GSM8K (PAL) | accuracy | 65.1% |
| RepoBench v1.1 | edit-similarity | 74.08% |
| CruxEval-I | pass@1 | 48.1% |

## How does StarCoder compare with other code generation models?

The following table compares StarCoder models with other prominent code generation models on standard benchmarks. All scores represent pass@1 using greedy decoding on the [HumanEval](/wiki/humaneval) and [MBPP](/wiki/mbpp) Python code generation benchmarks for base (non-instruction-tuned) models:

| Model | Organization | Parameters | HumanEval (pass@1) | MBPP (pass@1) | Training Data | Release Date |
|---|---|---|---|---|---|---|
| Codex | [OpenAI](/wiki/openai) | 12B | 28.8% | -- | GitHub code (proprietary) | August 2021 |
| StarCoderBase | BigCode | 15.5B | 30.4% | 49.0% | The Stack v1 (1T tokens) | May 2023 |
| StarCoder | BigCode | 15.5B | 33.6% | 52.7% | The Stack v1 + Python FT | May 2023 |
| Code Llama | [Meta](/wiki/meta_ai) | 7B | 33.5% | 41.4% | Code-heavy dataset (500B tokens) | August 2023 |
| Code Llama | Meta | 13B | 36.0% | 47.0% | Code-heavy dataset (500B tokens) | August 2023 |
| Code Llama | Meta | 34B | 48.8% | 55.0% | Code-heavy dataset (500B tokens) | August 2023 |
| DeepSeek-Coder | [DeepSeek](/wiki/deepseek) | 6.7B | 47.6% | 60.6% | 87% code + 13% NL (2T tokens) | November 2023 |
| DeepSeek-Coder | DeepSeek | 33B | 56.1% | 66.0% | 87% code + 13% NL (2T tokens) | November 2023 |
| StarCoder2-3B | BigCode | 3B | 31.7% | -- | The Stack v2 (3T+ tokens) | February 2024 |
| StarCoder2-7B | BigCode | 7B | 35.4% | -- | The Stack v2 (3.5T+ tokens) | February 2024 |
| StarCoder2-15B | BigCode | 15B | 46.3% | -- | The Stack v2 (4T+ tokens) | February 2024 |

Several observations stand out from this comparison. StarCoder2-15B closes the gap between open-access code models and proprietary systems, achieving 46.3% on HumanEval with only 15 billion parameters compared to Code Llama-34B's 48.8% with more than double the parameters.[2][6] DeepSeek-Coder demonstrates particularly strong performance for its size, with its 6.7B model outperforming much larger alternatives.[7] The progression from the original Codex (28.8%) to StarCoder2-15B (46.3%) over roughly two and a half years illustrates the rapid advancement in open code generation models.[8]

## What is Fill-in-the-Middle (FIM)?

One of the most practically useful features of the StarCoder models is their support for the Fill-in-the-Middle (FIM) objective. Traditional [language models](/wiki/large_language_model) can only generate text from left to right, appending tokens to the end of a given prefix. FIM allows the model to generate code that fills in a gap between a prefix and a suffix, which is critical for real-world code editing workflows.[1]

During training, a fraction of the training sequences are randomly split into three parts: a prefix, a middle section, and a suffix. These parts are rearranged using special tokens so the model learns to predict the middle given the prefix and suffix.[1] At inference time, a developer can specify code before and after a cursor position, and the model will generate the appropriate code to bridge the gap.[1]

This capability makes StarCoder models particularly well-suited for integration into code editors and IDEs, where developers frequently need completions at arbitrary positions rather than just at the end of a file. Both StarCoder v1 and StarCoder2 support FIM through sentinel tokens.[1][2]

## OctoPack: Instruction Tuning for Code

OctoPack, introduced in August 2023, is the BigCode project's approach to [instruction tuning](/wiki/fine_tuning) code language models.[3] The work leverages the natural structure of Git commits, which pair code changes with human-written commit messages, to create training data for instruction following.[3]

### CommitPack

The core dataset behind OctoPack is CommitPack, a 4-terabyte collection of Git commits spanning 350 programming languages.[3] CommitPack was compiled by extracting commit diffs and their associated commit messages from public repositories. A filtered subset called CommitPackFT was curated for higher quality instruction-response pairs.[3]

### OctoCoder and OctoGeeX

Two instruction-tuned models were produced from this work:

| Model | Base Model | Parameters | Fine-tuning Data | HumanEval (pass@1) |
|---|---|---|---|---|
| OctoCoder | StarCoder | 15.5B | CommitPackFT + OASST | 46.2% |
| OctoGeeX | CodeGeeX2 | 6B | CommitPackFT + OASST | -- |

OctoCoder achieved 46.2% pass@1 on HumanEval, which at the time represented state-of-the-art performance among models not trained on outputs from OpenAI models.[3] The OctoPack paper also introduced HumanEvalPack, an expanded benchmark that covers three coding tasks (code synthesis, code repair, and code explanation) across six programming languages (Python, JavaScript, Java, Go, C++, and Rust).[3]

### StarCoder2-15B-Instruct

For the StarCoder2 generation, an instruction-tuned variant called StarCoder2-15B-Instruct-v0.1 was released in April 2024. It was described by the BigCode team as "the first entirely self-aligned code LLM trained with a fully permissive and transparent pipeline."[2][15] The model was created through a self-alignment process (later formalized as SelfCodeAlign), where instruction-response pairs were generated by the StarCoder2-15B base model itself rather than relying on outputs from a separate teacher model.[2][15] StarCoder2-15B-Instruct achieved 72.6% pass@1 on HumanEval, a large jump over the base model's 46.3%, and edged out CodeLlama-70B-Instruct (72.0%) despite being roughly one fifth the size.[15] BigCode reported that on LiveCodeBench the self-aligned model outperformed the same base model trained on data distilled from [GPT-4](/wiki/gpt_4), arguing that a model "could learn more effectively from data within its own distribution than a shifted distribution from a teacher LLM."[15]

## Community Fine-Tunes and Derivatives

StarCoder's open-access licensing and strong base performance have made it a popular foundation for community fine-tuning efforts. Several notable derivative models have emerged from the StarCoder ecosystem:

### StarChat

StarChat (also known as StarChat-Beta) is a conversational coding assistant built by fine-tuning StarCoderBase on the Dolly and OpenAssistant datasets.[11] It transforms the base code completion model into one that can follow conversational instructions, answer coding questions, and provide explanations. StarChat demonstrated that relatively straightforward fine-tuning on dialogue data could produce a usable coding chatbot from the StarCoder foundation.

### WizardCoder

WizardCoder, developed by researchers at Microsoft and Hong Kong Baptist University, applies the Evol-Instruct method to StarCoder.[5] Evol-Instruct generates increasingly complex instruction-response pairs through iterative "evolution" of simpler prompts.[5] WizardCoder achieved 57.3% pass@1 on HumanEval, a dramatic improvement of +22.3 percentage points over StarCoder's base score of 35.0%.[5] This result was remarkable because it surpassed several closed-source models, including [Anthropic](/wiki/anthropic)'s [Claude](/wiki/claude) and Google's [Bard](/wiki/bard), on the HumanEval benchmark at the time.[5] The WizardCoder paper was published in June 2023 and later presented at ICLR 2024.[5]

### Other Derivatives

The broader community has produced numerous additional fine-tunes and adaptations of StarCoder models, including quantized versions for running on consumer hardware, domain-specific fine-tunes for particular programming languages or frameworks, and integrations with popular code editors. The availability of StarCoder on platforms like [Ollama](/wiki/ollama) and through [GGUF](/wiki/gguf) quantized formats has made it accessible for local deployment.

## Is StarCoder open source?

The StarCoder models use a specialized licensing framework developed by the BigCode project in collaboration with the [Responsible AI](/wiki/responsible_ai) Licenses (RAIL) initiative.[10] StarCoder is best described as open-access rather than strictly open source: the weights, training data, and training code are public, but the license attaches behavioral use restrictions that the Open Source Initiative's definition does not allow.[10]

### BigCode OpenRAIL-M License

Both StarCoder v1 and StarCoder2 are released under the BigCode OpenRAIL-M v1 license. "OpenRAIL-M" stands for Open Responsible AI License for Models.[10] This license is designed to balance open access with responsible use:

- **Open access:** The license permits royalty-free access, downstream use, redistribution, and modification of the model for both research and commercial purposes.[10]
- **Use restrictions:** The license includes a set of specific use restrictions that prohibit certain applications. For example, the model cannot be used for generating malware, and outputs must be disclosed as machine-generated when presented to users.[10]
- **Propagation of restrictions:** Any derivative models or applications built on StarCoder must maintain the same use restrictions, ensuring responsible use practices propagate through the ecosystem.[10]

The BigCode OpenRAIL-M license is not considered an "open source" license under the Open Source Initiative's definition because it includes behavioral restrictions.[10] However, it is substantially more permissive than many proprietary model licenses and allows for broad commercial adoption.

The license document itself is released under a CC-BY-4.0 license, allowing other organizations to adapt it for their own models.[10]

### Training Data Licensing

A distinguishing feature of the StarCoder project is its strict approach to training data licensing. The Stack and The Stack v2 are composed exclusively of source code from repositories with permissive licenses (such as MIT, Apache 2.0, and BSD).[1][2] This stands in contrast to some other code models that train on broader collections of code without license filtering. The BigCode project also provides opt-out tools that allow code authors to remove their code from the training dataset.[1]

The BigCode project maintains a full-text search tool (BigCode Search v2) that allows users to check whether specific code snippets appear in the training data. This transparency tool helps users identify potential attribution requirements and understand the provenance of generated code.[2]

## What is StarCoder used for?

### Code Completion

The primary use case for StarCoder models is code completion, where the model generates the continuation of a given code prefix.[1] Both StarCoder v1 and StarCoder2 support this capability across all languages in their respective training sets. The models can generate function implementations from signatures and docstrings, complete partial code blocks, and extend existing code with new functionality.

### Code Infilling

Through the FIM objective, StarCoder models can fill in missing code between a prefix and suffix.[1] This is particularly useful in IDE integrations where a developer positions their cursor in the middle of existing code and requests a completion. The model considers both the preceding and following context to generate appropriate code.

### Repository-Level Context

StarCoder2 models benefit from repository-grouped training data, meaning they have some understanding of cross-file dependencies and project structure.[2] The RepoBench v1.1 benchmark measures this capability, and StarCoder2-15B achieves an edit-similarity score of 74.08%, indicating meaningful ability to work with repository-level context.[2]

### Natural Language Understanding

While primarily trained on code, StarCoder models also incorporate natural language text from GitHub issues, commits, and documentation.[1] This gives them the ability to understand and generate natural language descriptions of code, respond to code-related questions when instruction-tuned, and process docstrings and comments as context for code generation.

## Evaluation Benchmarks

StarCoder models are evaluated on several established code generation benchmarks:

### HumanEval

[HumanEval](/wiki/humaneval) is a benchmark created by OpenAI consisting of 164 hand-crafted Python programming problems.[8] Each problem includes a function signature, a docstring, and an average of 7.7 unit tests.[8] The pass@1 metric measures the probability that a single generated code sample passes all unit tests. HumanEval has become the de facto standard for comparing code generation models.

### MBPP

The Mostly Basic Programming Problems ([MBPP](/wiki/mbpp)) benchmark contains 974 programming tasks designed to be solvable by entry-level programmers. Each problem provides a natural language description and three input/output test cases written as assert statements. MBPP differs from HumanEval in its formatting and the level of difficulty of the tasks.

### MultiPL-E

MultiPL-E extends HumanEval and MBPP by translating the Python problems into 18 additional programming languages, enabling multilingual evaluation of code models.[9] This benchmark is particularly important for StarCoder, which supports a wide range of programming languages. StarCoder v1 was evaluated on MultiPL-E and showed competitive performance across languages including JavaScript, C++, Java, Rust, Go, and others.[1]

### HumanEval+ and MBPP+

HumanEval+ and MBPP+ are stricter versions of the original benchmarks that include additional test cases to reduce the rate of false positives (solutions that pass the original tests but contain bugs). StarCoder2-15B scores 37.8% on HumanEval+, compared to 46.3% on the original HumanEval, illustrating how additional test cases catch edge case failures.[2]

## Deployment and Usage

StarCoder models are available through multiple channels:

- **Hugging Face Hub:** All StarCoder and StarCoder2 models are hosted on the [Hugging Face](/wiki/hugging_face) Hub with model cards, documentation, and code examples.[11][12]
- **Hugging Face Transformers:** The models are integrated into the Transformers library, allowing straightforward loading with `AutoModelForCausalLM` and `AutoTokenizer`.[12]
- **NVIDIA [TensorRT](/wiki/tensorrt)-LLM:** StarCoder2-15B can be optimized for inference using NVIDIA's TensorRT-LLM software.[14]
- **Ollama:** Quantized versions of StarCoder2 are available through Ollama for local deployment.
- **VSCode Extension:** A Visual Studio Code extension is available for using StarCoder2 as an in-editor code assistant.

For the StarCoder2-15B model, memory requirements vary by precision:

| Precision | Memory Footprint |
|---|---|
| Full (bfloat16) | ~32 GB |
| 8-bit Quantized | ~17 GB |
| 4-bit Quantized | ~9 GB |

The availability of 4-bit quantized versions means that StarCoder2-15B can run on consumer-grade GPUs with 12 GB or more of VRAM, making advanced code generation accessible without enterprise hardware.

## Impact and Significance

StarCoder has had a broad impact on the open-source AI ecosystem for code generation:

1. **Demonstrating open-access viability:** StarCoder proved that open-access code models trained on permissively licensed data could match or exceed the performance of proprietary models like OpenAI's code-cushman-001, challenging the assumption that competitive code generation required proprietary training data.[1]

2. **Responsible AI practices:** The BigCode project's approach to data licensing, opt-out mechanisms, and transparent governance set a precedent for responsible development of code generation models. The BigCode OpenRAIL-M license framework has influenced licensing practices across the broader AI community.[10]

3. **Foundation for research:** StarCoder models have been widely used as base models for academic and industry research, including work on instruction tuning (OctoPack, WizardCoder), code understanding, and specialized code generation for particular domains.[3][5]

4. **Scaling insights:** The StarCoder2 family provided valuable data points on how code model performance scales with training data size and model parameters. The finding that StarCoder2-3B matches StarCoderBase-15B highlights the importance of data quality and training methodology alongside raw model size.[2]

5. **Community building:** With over 1,200 contributors, BigCode has fostered one of the largest open scientific collaborations in AI research, demonstrating a viable model for community-driven development of large AI systems.[1]

## Limitations

Despite their capabilities, StarCoder models have several known limitations:

- **Not instruction-tuned by default:** The base StarCoder and StarCoder2 models are designed for code completion, not for following natural language instructions. Instruction-tuned variants like OctoCoder or StarCoder2-15B-Instruct are needed for conversational or instruction-following use cases.[3]
- **No guarantee of correctness:** Generated code may contain bugs, security vulnerabilities, or inefficiencies. The models should not be used as the sole source of production code without human review.[1]
- **Training data memorization:** StarCoder models can occasionally reproduce code verbatim from their training data. The BigCode project provides a search index tool to help users identify when generated code matches training data and apply proper attribution.[1]
- **Language coverage imbalance:** While StarCoder2-15B covers over 600 programming languages, performance varies significantly across languages. High-resource languages like Python, JavaScript, and Java generally see much better performance than low-resource languages.[2]

## When were the StarCoder models released?

The following table summarizes the key milestones in the StarCoder project:

| Date | Milestone |
|---|---|
| September 2022 | BigCode project launched by Hugging Face and ServiceNow |
| October 2022 | The Stack v1 dataset released |
| December 2022 | SantaCoder (1.1B) released |
| May 4, 2023 | StarCoder and StarCoderBase (15.5B) released |
| May 2023 | "StarCoder: may the source be with you!" paper published (arXiv:2305.06161) |
| June 2023 | WizardCoder fine-tune achieves 57.3% on HumanEval |
| August 2023 | OctoPack: instruction tuning with CommitPack released |
| February 28, 2024 | StarCoder2 (3B, 7B, 15B) and The Stack v2 released |
| February 2024 | "StarCoder 2 and The Stack v2: The Next Generation" paper published (arXiv:2402.19173) |
| April 2024 | StarCoder2-15B-Instruct-v0.1 released (72.6% HumanEval) |
| 2024 | StarCoder paper published at COLM 2024 conference |

## See Also

- [Code Llama](/wiki/code_llama)
- [Codex](/wiki/openai_codex)
- [DeepSeek](/wiki/deepseek)
- [GitHub Copilot](/wiki/github_copilot)
- [Hugging Face](/wiki/hugging_face)
- [Fill-in-the-Middle](/wiki/fill_in_the_middle)
- [HumanEval](/wiki/humaneval)
- [MBPP](/wiki/mbpp)

## References

1. Li, R., Allal, L. B., Zi, Y., et al. (2023). "StarCoder: may the source be with you!" *arXiv preprint arXiv:2305.06161*. Published at COLM 2024.
2. Lozhkov, A., Li, R., Allal, L. B., et al. (2024). "StarCoder 2 and The Stack v2: The Next Generation." *arXiv preprint arXiv:2402.19173*.
3. Muennighoff, N., Liu, Q., Zebaze, A., et al. (2023). "OctoPack: Instruction Tuning Code Large Language Models." *arXiv preprint arXiv:2308.07124*. Published at ICLR 2024.
4. Allal, L. B., Li, R., Kocetkov, D., et al. (2023). "SantaCoder: don't reach for the stars!" *arXiv preprint arXiv:2301.03988*.
5. Luo, Z., Xu, C., Zhao, P., et al. (2023). "WizardCoder: Empowering Code Large Language Models with Evol-Instruct." *arXiv preprint arXiv:2306.08568*. Published at ICLR 2024.
6. Rozière, B., Gehring, J., Gloeckle, F., et al. (2023). "Code Llama: Open Foundation Models for Code." *arXiv preprint arXiv:2308.12950*.
7. Guo, D., Zhu, Q., Yang, D., et al. (2024). "DeepSeek-Coder: When the Large Language Model Meets Programming." *arXiv preprint arXiv:2401.14196*.
8. Chen, M., Tworek, J., Jun, H., et al. (2021). "Evaluating Large Language Models Trained on Code." *arXiv preprint arXiv:2107.03374*.
9. Cassano, F., Gouwar, J., Nguyen, D., et al. (2023). "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation." *IEEE Transactions on Software Engineering*.
10. BigCode Project. "BigCode OpenRAIL-M License." *bigcode-project.org*. Accessed March 2026.
11. Hugging Face. "StarCoder Model Card." *huggingface.co/bigcode/starcoder*. Accessed March 2026.
12. Hugging Face. "StarCoder2-15B Model Card." *huggingface.co/bigcode/starcoder2-15b*. Accessed March 2026.
13. ServiceNow Newsroom. "ServiceNow and Hugging Face release StarCoder." May 4, 2023.
14. NVIDIA Newsroom. "ServiceNow, Hugging Face, and NVIDIA Release New Open-Access LLMs." February 28, 2024.
15. Wei, Y., Cassano, F., Liu, J., et al. (2024). "StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation." *Hugging Face Blog, huggingface.co/blog/sc2-instruct*. April 2024. (SelfCodeAlign, arXiv:2410.24198.)
