StarCoder is a family of open-access large language models designed specifically for code generation and code understanding tasks. Developed by the BigCode project, a collaborative initiative led by Hugging Face and ServiceNow, the StarCoder models represent one of the most significant open-source contributions to the field of AI-assisted programming. The original StarCoder, released on May 4, 2023, featured 15.5 billion parameters and was trained on over 80 programming languages. Its successor, StarCoder2, launched on February 28, 2024, expanded support to more than 600 programming languages and introduced three model sizes (3B, 7B, and 15B parameters) trained on a substantially larger dataset. Both model generations are released under responsible AI licenses that permit commercial use while enforcing ethical usage restrictions.
The BigCode project was officially launched in September 2022 as an open scientific collaboration between Hugging Face and ServiceNow Research. Its stated goal is the responsible development of large language models for code. The project operates under open governance, with a steering committee jointly led by ServiceNow and Hugging Face, and has attracted more than 1,200 members from institutions and companies across 62 countries.
BigCode distinguishes itself from proprietary alternatives like Codex and GitHub Copilot through its commitment to transparency, reproducibility, and ethical data practices. The organizers ensure that only files from repositories with permissive licenses go into the training datasets, and the project has established multiple working groups covering topics such as licensing, attribution of generated code, the handling of personally identifiable information (PII), and risks of malicious code.
Before releasing StarCoder, the BigCode community produced SantaCoder in December 2022, a 1.1 billion parameter model trained on the Python, Java, and JavaScript subsets of The Stack (v1.1). Despite its relatively small size, SantaCoder outperformed larger open-source multilingual code models available at the time, including InCoder-6.7B and CodeGen-Multi-2.7B, in both left-to-right generation and infilling tasks. SantaCoder served as a proof of concept and technical stepping stone for the much larger StarCoder models that followed.
StarCoder uses a decoder-only Transformer architecture with learned absolute positional embeddings. The model consists of 40 layers, a hidden dimension of 6,144, and 48 attention heads. A defining architectural choice is the use of Multi-Query Attention (MQA), which shares key and value projections across all attention heads. This approach significantly reduces the memory footprint during inference and increases throughput, making the model more practical for deployment compared to standard multi-head attention.
The model supports a context window of 8,192 tokens, which was among the longest context lengths for code models at the time of release. It uses a byte-level byte pair encoding (BPE) vocabulary of 49,152 tokens.
Both StarCoderBase and StarCoder were trained on data from The Stack (v1.2), a large-scale dataset of permissively licensed source code. The Stack was first released on October 27, 2022, and contains over 6 terabytes of permissively licensed source code files covering 358 programming languages collected from public GitHub repositories.
The BigCode team applied careful curation to the dataset. They filtered repositories based on license type, selecting only those with permissive licenses that impose minimal restrictions on copying, modifying, and redistributing the code. The dataset also includes an opt-out mechanism that allows code owners to request the removal of their code from the training data. Near-deduplication and comment-to-code ratio filtering were applied as additional quality controls.
The StarCoder model family from the first generation consists of two variants:
| Model | Parameters | Training Data | Training Tokens | Context Length | Attention Type |
|---|---|---|---|---|---|
| StarCoderBase | 15.5B | The Stack v1.2 (80+ languages) | 1 trillion | 8,192 | Multi-Query Attention |
| StarCoder | 15.5B | StarCoderBase + Python fine-tuning | 1 trillion + 35B Python tokens | 8,192 | Multi-Query Attention |
StarCoderBase is the foundational model, pretrained on 1 trillion tokens sourced from over 80 programming languages in The Stack. The pretraining was conducted over approximately 250,000 steps across 24 days, using 512 NVIDIA Tesla A100 GPUs. This amounted to roughly 320,256 GPU hours for pretraining alone.
StarCoder was created by further fine-tuning StarCoderBase on 35 billion tokens of Python code. This additional training was performed for two more epochs on the same Python data from the pretraining set, adding approximately 11,208 GPU hours. The result is a model that retains its multilingual code generation capabilities while excelling particularly at Python tasks.
Both models were trained using the Fill-in-the-Middle (FIM) objective, which allows the model to complete code not only at the end of a sequence but also in the middle. This is accomplished through special sentinel tokens (<fim_prefix>, <fim_suffix>, <fim_middle>) that indicate where the model should insert code, making the models suitable for code completion, insertion, and infilling tasks commonly needed in integrated development environments.
The following table summarizes the performance of StarCoderBase and StarCoder on key code generation benchmarks, measured by pass@1 (the probability that the first generated sample passes all unit tests):
| Model | Parameters | HumanEval (pass@1) | MBPP (pass@1) |
|---|---|---|---|
| StarCoderBase | 15.5B | 30.4% | 49.0% |
| StarCoder | 15.5B | 33.6% | 52.7% |
| StarCoder (prompted) | 15.5B | 40.8% | -- |
StarCoder also demonstrated strong multilingual performance on the MultiPL-E benchmark, which translates HumanEval problems into 18 additional programming languages. StarCoder's pass@1 scores on selected languages include:
| Language | StarCoder pass@1 | StarCoderBase pass@1 |
|---|---|---|
| Python | 33.6% | 30.3% |
| JavaScript | 30.8% | 31.7% |
| C++ | 31.6% | 30.6% |
| Java | 30.2% | 28.5% |
| Rust | 21.8% | 24.5% |
| Go | 17.6% | 21.5% |
| PHP | 26.1% | 26.8% |
| Scala | 27.6% | 28.8% |
At the time of release, StarCoder matched or outperformed OpenAI's code-cushman-001 model across multiple programming languages, making it one of the most capable open-access code generation models available.
StarCoder2 represents the second generation of the StarCoder model family and was released on February 28, 2024. This generation introduced three model sizes to serve different computational budgets, and it brought NVIDIA on board as a training partner alongside Hugging Face and ServiceNow. Each organization took responsibility for training one of the three model variants.
StarCoder2 models use an updated architecture compared to the original StarCoder. Key changes include:
| Model | Parameters | Trained By | Training Framework | Languages | Training Tokens |
|---|---|---|---|---|---|
| StarCoder2-3B | 3 billion | ServiceNow | Fast LLM | 17 | 3+ trillion |
| StarCoder2-7B | 7 billion | Hugging Face | nanotron | 17 | 3.5+ trillion |
| StarCoder2-15B | 15 billion | NVIDIA | NeMo | 600+ | 4+ trillion |
The StarCoder2-15B model was trained on NVIDIA's Eos Supercomputer using 1,024 NVIDIA H100 GPUs over approximately 1 million training steps. It supports more than 600 programming languages, while the smaller 3B and 7B variants were trained on 17 carefully selected high-resource programming languages.
StarCoder2 models are trained on The Stack v2, a significantly expanded dataset built in partnership with Software Heritage, a nonprofit initiative founded by Inria in partnership with UNESCO to collect, preserve, and share all publicly available source code.
| Metric | The Stack v1 | The Stack v2 |
|---|---|---|
| Full Size | 6.4 TB | 67.5 TB |
| Deduplicated Size | 2.9 TB | 32.1 TB |
| Training Tokens | ~200B | ~900B |
| Programming Languages | 358 | 619 |
The Stack v2 is roughly 4x larger than The Stack v1 in terms of training data. Beyond raw source code, The Stack v2 includes additional high-quality data sources such as GitHub pull requests, Kaggle notebooks, and code documentation. The dataset features improved language and license detection, better filtering heuristics, and repository-grouped training data that helps models learn contextual relationships between files in the same project.
StarCoder2 models show substantial improvements over the first generation across all benchmarks:
| Model | Parameters | HumanEval (pass@1) | HumanEval+ (pass@1) | Context Length |
|---|---|---|---|---|
| StarCoder2-3B | 3B | 31.7% | 27.4% | 16,384 |
| StarCoder2-7B | 7B | 35.4% | 29.9% | 16,384 |
| StarCoder2-15B | 15B | 46.3% | 37.8% | 16,384 |
Notably, StarCoder2-3B matches the performance of the original StarCoderBase-15B (30.4% on HumanEval), despite being roughly 5x smaller. StarCoder2-15B significantly outperforms models of comparable size and matches or outperforms Code Llama-34B on multiple benchmarks, including surpassing it on both MBPP and MBPP+ and matching or exceeding it on 10 out of 18 programming languages in MultiPL-E.
Additional benchmark results for StarCoder2-15B include:
| Benchmark | Metric | Score |
|---|---|---|
| DS-1000 | pass@1 | 33.8% |
| GSM8K (PAL) | accuracy | 65.1% |
| RepoBench v1.1 | edit-similarity | 74.08% |
| CruxEval-I | pass@1 | 48.1% |
The following table compares StarCoder models with other prominent code generation models on standard benchmarks. All scores represent pass@1 using greedy decoding on the HumanEval and MBPP Python code generation benchmarks for base (non-instruction-tuned) models:
| Model | Organization | Parameters | HumanEval (pass@1) | MBPP (pass@1) | Training Data | Release Date |
|---|---|---|---|---|---|---|
| Codex | OpenAI | 12B | 28.8% | -- | GitHub code (proprietary) | August 2021 |
| StarCoderBase | BigCode | 15.5B | 30.4% | 49.0% | The Stack v1 (1T tokens) | May 2023 |
| StarCoder | BigCode | 15.5B | 33.6% | 52.7% | The Stack v1 + Python FT | May 2023 |
| Code Llama | Meta | 7B | 33.5% | 41.4% | Code-heavy dataset (500B tokens) | August 2023 |
| Code Llama | Meta | 13B | 36.0% | 47.0% | Code-heavy dataset (500B tokens) | August 2023 |
| Code Llama | Meta | 34B | 48.8% | 55.0% | Code-heavy dataset (500B tokens) | August 2023 |
| DeepSeek-Coder | DeepSeek | 6.7B | 47.6% | 60.6% | 87% code + 13% NL (2T tokens) | November 2023 |
| DeepSeek-Coder | DeepSeek | 33B | 56.1% | 66.0% | 87% code + 13% NL (2T tokens) | November 2023 |
| StarCoder2-3B | BigCode | 3B | 31.7% | -- | The Stack v2 (3T+ tokens) | February 2024 |
| StarCoder2-7B | BigCode | 7B | 35.4% | -- | The Stack v2 (3.5T+ tokens) | February 2024 |
| StarCoder2-15B | BigCode | 15B | 46.3% | -- | The Stack v2 (4T+ tokens) | February 2024 |
Several observations stand out from this comparison. StarCoder2-15B closes the gap between open-access code models and proprietary systems, achieving 46.3% on HumanEval with only 15 billion parameters compared to Code Llama-34B's 48.8% with more than double the parameters. DeepSeek-Coder demonstrates particularly strong performance for its size, with its 6.7B model outperforming much larger alternatives. The progression from the original Codex (28.8%) to StarCoder2-15B (46.3%) over roughly two and a half years illustrates the rapid advancement in open code generation models.
One of the most practically useful features of the StarCoder models is their support for the Fill-in-the-Middle (FIM) objective. Traditional language models can only generate text from left to right, appending tokens to the end of a given prefix. FIM allows the model to generate code that fills in a gap between a prefix and a suffix, which is critical for real-world code editing workflows.
During training, a fraction of the training sequences are randomly split into three parts: a prefix, a middle section, and a suffix. These parts are rearranged using special tokens so the model learns to predict the middle given the prefix and suffix. At inference time, a developer can specify code before and after a cursor position, and the model will generate the appropriate code to bridge the gap.
This capability makes StarCoder models particularly well-suited for integration into code editors and IDEs, where developers frequently need completions at arbitrary positions rather than just at the end of a file. Both StarCoder v1 and StarCoder2 support FIM through sentinel tokens.
OctoPack, introduced in August 2023, is the BigCode project's approach to instruction tuning code language models. The work leverages the natural structure of Git commits, which pair code changes with human-written commit messages, to create training data for instruction following.
The core dataset behind OctoPack is CommitPack, a 4-terabyte collection of Git commits spanning 350 programming languages. CommitPack was compiled by extracting commit diffs and their associated commit messages from public repositories. A filtered subset called CommitPackFT was curated for higher quality instruction-response pairs.
Two instruction-tuned models were produced from this work:
| Model | Base Model | Parameters | Fine-tuning Data | HumanEval (pass@1) |
|---|---|---|---|---|
| OctoCoder | StarCoder | 15.5B | CommitPackFT + OASST | 46.2% |
| OctoGeeX | CodeGeeX2 | 6B | CommitPackFT + OASST | -- |
OctoCoder achieved 46.2% pass@1 on HumanEval, which at the time represented state-of-the-art performance among models not trained on outputs from OpenAI models. The OctoPack paper also introduced HumanEvalPack, an expanded benchmark that covers three coding tasks (code synthesis, code repair, and code explanation) across six programming languages (Python, JavaScript, Java, Go, C++, and Rust).
For the StarCoder2 generation, an instruction-tuned variant called StarCoder2-15B-Instruct-v0.1 was released. This model was created through a novel self-alignment process, where instruction-response pairs were generated by the StarCoder2-15B base model itself rather than relying on outputs from a separate teacher model. StarCoder2-15B-Instruct achieved 72.6% pass@1 on HumanEval, a significant jump over the base model's 46.3%.
StarCoder's open-access licensing and strong base performance have made it a popular foundation for community fine-tuning efforts. Several notable derivative models have emerged from the StarCoder ecosystem:
StarChat (also known as StarChat-Beta) is a conversational coding assistant built by fine-tuning StarCoderBase on the Dolly and OpenAssistant datasets. It transforms the base code completion model into one that can follow conversational instructions, answer coding questions, and provide explanations. StarChat demonstrated that relatively straightforward fine-tuning on dialogue data could produce a usable coding chatbot from the StarCoder foundation.
WizardCoder, developed by researchers at Microsoft and Hong Kong Baptist University, applies the Evol-Instruct method to StarCoder. Evol-Instruct generates increasingly complex instruction-response pairs through iterative "evolution" of simpler prompts. WizardCoder achieved 57.3% pass@1 on HumanEval, a dramatic improvement of +22.3 percentage points over StarCoder's base score of 35.0%. This result was remarkable because it surpassed several closed-source models, including Anthropic's Claude and Google's Bard, on the HumanEval benchmark at the time. The WizardCoder paper was published in June 2023 and later presented at ICLR 2024.
The broader community has produced numerous additional fine-tunes and adaptations of StarCoder models, including quantized versions for running on consumer hardware, domain-specific fine-tunes for particular programming languages or frameworks, and integrations with popular code editors. The availability of StarCoder on platforms like Ollama and through GGUF quantized formats has made it accessible for local deployment.
The StarCoder models use a specialized licensing framework developed by the BigCode project in collaboration with the Responsible AI Licenses (RAIL) initiative.
Both StarCoder v1 and StarCoder2 are released under the BigCode OpenRAIL-M v1 license. "OpenRAIL-M" stands for Open Responsible AI License for Models. This license is designed to balance open access with responsible use:
The BigCode OpenRAIL-M license is not considered an "open source" license under the Open Source Initiative's definition because it includes behavioral restrictions. However, it is substantially more permissive than many proprietary model licenses and allows for broad commercial adoption.
The license document itself is released under a CC-BY-4.0 license, allowing other organizations to adapt it for their own models.
A distinguishing feature of the StarCoder project is its strict approach to training data licensing. The Stack and The Stack v2 are composed exclusively of source code from repositories with permissive licenses (such as MIT, Apache 2.0, and BSD). This stands in contrast to some other code models that train on broader collections of code without license filtering. The BigCode project also provides opt-out tools that allow code authors to remove their code from the training dataset.
The BigCode project maintains a full-text search tool (BigCode Search v2) that allows users to check whether specific code snippets appear in the training data. This transparency tool helps users identify potential attribution requirements and understand the provenance of generated code.
The primary use case for StarCoder models is code completion, where the model generates the continuation of a given code prefix. Both StarCoder v1 and StarCoder2 support this capability across all languages in their respective training sets. The models can generate function implementations from signatures and docstrings, complete partial code blocks, and extend existing code with new functionality.
Through the FIM objective, StarCoder models can fill in missing code between a prefix and suffix. This is particularly useful in IDE integrations where a developer positions their cursor in the middle of existing code and requests a completion. The model considers both the preceding and following context to generate appropriate code.
StarCoder2 models benefit from repository-grouped training data, meaning they have some understanding of cross-file dependencies and project structure. The RepoBench v1.1 benchmark measures this capability, and StarCoder2-15B achieves an edit-similarity score of 74.08%, indicating meaningful ability to work with repository-level context.
While primarily trained on code, StarCoder models also incorporate natural language text from GitHub issues, commits, and documentation. This gives them the ability to understand and generate natural language descriptions of code, respond to code-related questions when instruction-tuned, and process docstrings and comments as context for code generation.
StarCoder models are evaluated on several established code generation benchmarks:
HumanEval is a benchmark created by OpenAI consisting of 164 hand-crafted Python programming problems. Each problem includes a function signature, a docstring, and an average of 7.7 unit tests. The pass@1 metric measures the probability that a single generated code sample passes all unit tests. HumanEval has become the de facto standard for comparing code generation models.
The Mostly Basic Programming Problems (MBPP) benchmark contains 974 programming tasks designed to be solvable by entry-level programmers. Each problem provides a natural language description and three input/output test cases written as assert statements. MBPP differs from HumanEval in its formatting and the level of difficulty of the tasks.
MultiPL-E extends HumanEval and MBPP by translating the Python problems into 18 additional programming languages, enabling multilingual evaluation of code models. This benchmark is particularly important for StarCoder, which supports a wide range of programming languages. StarCoder v1 was evaluated on MultiPL-E and showed competitive performance across languages including JavaScript, C++, Java, Rust, Go, and others.
HumanEval+ and MBPP+ are stricter versions of the original benchmarks that include additional test cases to reduce the rate of false positives (solutions that pass the original tests but contain bugs). StarCoder2-15B scores 37.8% on HumanEval+, compared to 46.3% on the original HumanEval, illustrating how additional test cases catch edge case failures.
StarCoder models are available through multiple channels:
AutoModelForCausalLM and AutoTokenizer.For the StarCoder2-15B model, memory requirements vary by precision:
| Precision | Memory Footprint |
|---|---|
| Full (bfloat16) | ~32 GB |
| 8-bit Quantized | ~17 GB |
| 4-bit Quantized | ~9 GB |
The availability of 4-bit quantized versions means that StarCoder2-15B can run on consumer-grade GPUs with 12 GB or more of VRAM, making advanced code generation accessible without enterprise hardware.
StarCoder has had a broad impact on the open-source AI ecosystem for code generation:
Demonstrating open-access viability: StarCoder proved that open-access code models trained on permissively licensed data could match or exceed the performance of proprietary models like OpenAI's code-cushman-001, challenging the assumption that competitive code generation required proprietary training data.
Responsible AI practices: The BigCode project's approach to data licensing, opt-out mechanisms, and transparent governance set a precedent for responsible development of code generation models. The BigCode OpenRAIL-M license framework has influenced licensing practices across the broader AI community.
Foundation for research: StarCoder models have been widely used as base models for academic and industry research, including work on instruction tuning (OctoPack, WizardCoder), code understanding, and specialized code generation for particular domains.
Scaling insights: The StarCoder2 family provided valuable data points on how code model performance scales with training data size and model parameters. The finding that StarCoder2-3B matches StarCoderBase-15B highlights the importance of data quality and training methodology alongside raw model size.
Community building: With over 1,200 contributors, BigCode has fostered one of the largest open scientific collaborations in AI research, demonstrating a viable model for community-driven development of large AI systems.
Despite their capabilities, StarCoder models have several known limitations:
The following table summarizes the key milestones in the StarCoder project:
| Date | Milestone |
|---|---|
| September 2022 | BigCode project launched by Hugging Face and ServiceNow |
| October 2022 | The Stack v1 dataset released |
| December 2022 | SantaCoder (1.1B) released |
| May 4, 2023 | StarCoder and StarCoderBase (15.5B) released |
| May 2023 | "StarCoder: may the source be with you!" paper published (arXiv:2305.06161) |
| June 2023 | WizardCoder fine-tune achieves 57.3% on HumanEval |
| August 2023 | OctoPack: instruction tuning with CommitPack released |
| February 28, 2024 | StarCoder2 (3B, 7B, 15B) and The Stack v2 released |
| February 2024 | "StarCoder 2 and The Stack v2: The Next Generation" paper published (arXiv:2402.19173) |
| 2024 | StarCoder paper published at COLM 2024 conference |