SantaCoder
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,473 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,473 words
Add missing citations, update stale details, or suggest a clearer explanation.
SantaCoder is a 1.1 billion parameter large language model for code generation, released in early 2023 by the BigCode project, an open scientific collaboration co-led by Hugging Face and ServiceNow Research. It was trained on the Python, Java, and JavaScript portions of The Stack, a large corpus of permissively licensed source code, and it is built on a GPT-2 style decoder-only transformer augmented with Multi-Query Attention and a Fill-in-the-Middle training objective. SantaCoder can both write code from a natural-language prompt or function signature and fill in missing code inside an existing file, and despite its modest size it matched or beat substantially larger open code models of its day on the MultiPL-E benchmark.
The model is best understood less as a finished product and more as a public proof of concept. It was the first model the BigCode community trained, and the accompanying technical report, titled "SantaCoder: don't reach for the stars!", documents the experiments the team ran to de-risk the architecture and to refine how training data is filtered and cleaned. Many of the decisions validated through SantaCoder, from the choice of Multi-Query Attention to the data-governance pipeline for redacting personal information, were carried directly into StarCoder, the much larger 15.5 billion parameter model BigCode released a few months later. SantaCoder is distributed on the Hugging Face Hub as bigcode/santacoder under the BigCode OpenRAIL-M license.
SantaCoder belongs to the category of code language models, sometimes called code LLMs, which are transformer networks trained on large bodies of source code so that they can autocomplete, generate, and edit programs. By the time SantaCoder appeared, this space already included commercial systems such as OpenAI's Codex (the engine behind GitHub Copilot) and several open releases including Salesforce's CodeGen and Meta AI's InCoder. What set SantaCoder apart was not raw capability but the principles behind it: full transparency about the training data, an explicit opt-out mechanism for developers whose code might be included, an effort to remove personally identifiable information before training, and the open release of weights, data, and preprocessing methods so that the work could be reproduced and scrutinized.
Because SantaCoder is a base model rather than an instruction-tuned assistant, it does not respond well to conversational commands. The model card is explicit that a prompt like "Write a function that computes the square root" will not work reliably. Instead the model expects input that looks like code in progress: a comment such as # the following function computes the sqrt, or a function signature together with a docstring, after which the model continues the program. This reflects its training objective, which is simply to predict the next token of source code, optionally with a region of the file masked out for infilling.
The model handles three languages, Python, Java, and JavaScript, and has a context window of 2,048 tokens. It is released in float16 precision and is small enough to run on a single modern GPU, which made it attractive for experimentation, fine-tuning, and as a teaching example of how a responsible code model can be built end to end.
SantaCoder is a product of BigCode, which describes itself as an open scientific collaboration working on the responsible development and use of large language models for code. The project was launched on September 26, 2022, jointly by Hugging Face and ServiceNow Research, and it was explicitly modeled on BigScience, the earlier open collaboration that culminated in July 2022 with the multilingual BLOOM language model. Like BigScience, BigCode is organized around working groups that each own a slice of the problem: collecting and curating datasets, building the training and evaluation infrastructure, and, importantly, a Legal, Ethics, and Governance working group that studies questions such as code licensing, attribution of generated code to its original authors, the redaction of personal data, and the risk of models emitting malicious or insecure code.
The collaboration was open to anyone with a professional research background who could commit time, and the SantaCoder technical report carries a long author list of more than 40 contributors drawn from Hugging Face, ServiceNow Research, and a wide range of universities and companies including IBM Research, Carnegie Mellon University, Northeastern University, EleutherAI, and others. The work was led by Loubna Ben Allal and Harm de Vries, with Leandro von Werra also among the corresponding authors. This breadth was deliberate. Part of the point of BigCode was to show that a capable code model could be produced in the open, with governance decisions made transparently and with input from the broader community, rather than behind the closed doors of a single company.
The motivation was partly a response to the controversy surrounding GitHub Copilot, which had been trained on public code without a clear consent or opt-out mechanism and which raised unresolved questions about licensing and attribution. BigCode set out to do the opposite: build only from permissively licensed code, give developers tools to see whether their code was included and to remove it, and publish the methods rather than keep them proprietary.
SantaCoder was trained on a subset of The Stack version 1.1, the dataset BigCode released in 2022 containing permissively licensed source code gathered from GitHub across 384 programming languages. For SantaCoder the team used only the Python, Java, and JavaScript files. After applying opt-out removals, near-deduplication, personal-information redaction, and filtering based on line length and the fraction of alphanumeric characters, the base training set came to roughly 268 GB of code. The corpus was also decontaminated by removing files that contained test samples from the benchmarks used for evaluation, namely HumanEval, APPS, MBPP, and MultiPL-E, so that the model could not simply memorize answers.
Data governance was a central concern rather than an afterthought, and two mechanisms stand out. The first is the opt-out process. Alongside The Stack, BigCode published a tool called "Am I in The Stack" that lets developers check whether files from their repositories appear in the dataset, together with a form for requesting removal. The SantaCoder report records that the project received nine opt-out requests before the October 31, 2022 cutoff for the data used in the paper, covering 299 individuals; of these, 161 were already excluded because their repositories lacked a permissive license, and 138 were present in The Stack v1.0 and were removed for the next iteration.
The second mechanism is the redaction of personally identifiable information (PII). Source code frequently contains sensitive strings such as email addresses, IP addresses, and secret keys, and BigCode built a pipeline to detect and mask them. To measure how well that pipeline worked, the team constructed a PII benchmark by hand-annotating 400 code files, which together contained 214 emails, 99 IP addresses, and 34 secret keys. For this first iteration they focused on those three categories and left names, usernames, and passwords for future work. Emails were found with a regular expression and replaced with a randomized address at example.com; IP addresses were detected with regular expressions and validated with a Python library, with private addresses and common DNS servers deliberately left untouched, then replaced with randomly generated values; secret keys were caught using the open-source detect-secrets tool augmented with extra filters, including a gibberish detector, to cut down on false positives. Email and IP detection performed well, with precision and recall above 90 percent for emails and above 80 percent for IP addresses, while key detection achieved nearly 80 percent precision but only roughly 50 percent recall, which the authors attributed to the difficult precision-recall trade-off inherent in spotting secrets. The named-entity model later developed from this line of work, StarPII, became a reusable PII detector for code with six target classes.
SantaCoder is a decoder-only transformer in the GPT-2 family, meaning it generates code left to right by repeatedly predicting the next token. The base configuration has 24 layers, 16 attention heads, and a hidden size of 2,048, for a total of about 1.1 billion parameters, and it uses a context length of 2,048 tokens. Inputs are encoded with a Byte-Pair Encoding tokenizer trained on raw bytes with a vocabulary of 49,152 tokens; the tokenizer was trained on 600,000 rows of code, 200,000 for each language, pre-tokenized with a digit splitter and the default GPT-2 pre-tokenizer regular expression. Two architectural choices distinguish SantaCoder from a vanilla GPT-2 model, and the report devotes careful ablation experiments to each.
The first is Multi-Query Attention (MQA), an idea introduced by Noam Shazeer in 2019 in which all attention heads share a single set of key and value projections rather than each head having its own. This sharing sharply lowers the memory bandwidth needed during generation, which speeds up inference at large batch sizes, a property that matters a great deal for serving a code-completion model to many users at once. The same technique had already been adopted in DeepMind's AlphaCode and Google's PaLM. The BigCode team compared an MQA model against an otherwise identical model using ordinary Multi-Head Attention (MHA), and found that MHA scored 1 to 4 percent higher on HumanEval pass@100 and 1 to 3 percent higher on MBPP. They judged this drop acceptable, noting that the MHA model was actually larger at 1.3 billion parameters because of its extra projection weights, so the comparison was not perfectly fair, and that the inference speedups of MQA outweighed the modest accuracy cost.
The second choice is Fill-in-the-Middle (FIM), a training transformation that lets a left-to-right model learn to infill. Following the approach of Bavarian and colleagues at OpenAI, each training document is split at random into three parts, a prefix, a middle, and a suffix; the pieces are each tagged with a sentinel token and rearranged so the middle is moved to the end, after which ordinary next-token prediction teaches the model to reconstruct the missing middle given the surrounding context. SantaCoder applies FIM at the character level with a rate of 0.5 using a joint SPM and PSM formulation. Comparing the FIM model with a no-FIM baseline, the team again observed only a small and consistent cost on left-to-right benchmarks, with FIM scoring 2 to 4 percent lower on HumanEval pass@100 and about 1 percent lower on MBPP. The trade is worthwhile because FIM is what gives the model its practical infilling ability, letting it complete code inside a file rather than only appending to the end.
SantaCoder was evaluated on MultiPL-E, a benchmark that extends the popular Python HumanEval and MBPP problem sets to additional programming languages by automatically translating the docstrings, function signatures, and hidden unit tests; for SantaCoder only the Java, JavaScript, and Python portions were used. Performance on the text-to-code task is reported with pass@k, the probability that at least one of k sampled completions passes all the hidden tests, following the methodology established for Codex.
The central result is that SantaCoder, at just 1.1 billion parameters, matched or outperformed the larger open multilingual code models of its time. The table below reproduces the headline comparison from the report, covering left-to-right generation (HumanEval pass@100) and Fill-in-the-Middle single-line exact match across the three languages. The Codex figures are quoted from the original Codex paper.
| Model | Size | L2R Java | L2R JavaScript | L2R Python | FIM Java | FIM JavaScript | FIM Python |
|---|---|---|---|---|---|---|---|
| InCoder | 6.7B | 0.36 | 0.38 | 0.47 | 0.49 | 0.51 | 0.31 |
| CodeGen-multi | 2.7B | 0.42 | 0.39 | 0.39 | n/a | n/a | n/a |
| CodeGen-mono | 2.7B | n/a | n/a | 0.57 | n/a | n/a | n/a |
| Codex | 2.5B | n/a | n/a | 0.60 | n/a | n/a | n/a |
| SantaCoder | 1.1B | 0.41 | 0.47 | 0.49 | 0.62 | 0.60 | 0.44 |
On left-to-right generation SantaCoder leads InCoder-6.7B on all three languages and edges out CodeGen-multi-2.7B on Java and JavaScript while trailing the larger Python-specialized CodeGen-mono and Codex on Python alone. On infilling the gap is wider in SantaCoder's favor: it beats InCoder, which was itself designed for infilling, on every language tested, while the CodeGen and Codex models could not perform the FIM task at all. The Hugging Face model card additionally reports self-measured MultiPL-E numbers for the released model, including a Python HumanEval pass@1 of about 0.18 rising to 0.49 at pass@100, and a Python MBPP pass@1 of about 0.35.
Perhaps the most cited finding from the project is captured in the report's playful title. The team tested four data filters, and the one that selected files only from repositories with five or more GitHub stars, a heuristic that earlier work had used as a proxy for code quality, actually hurt the model. The stars filter removed more than 60 percent of the data by volume and dropped text-to-code pass@100 by 3 to 6 percent and infilling performance by 5 to 11 percent. Crucially, the HumanEval curve for the starred subset diverged early in training, before the smaller dataset could be blamed, which indicated that the loss was due to data quality and not merely quantity. By contrast, more aggressive near-deduplication of the training data gave a consistent gain of 1 to 3 percent, and filtering on the comment-to-code ratio helped slightly. The final SantaCoder model was therefore trained with MQA, FIM, near-deduplication, and the comment-to-code filter, but with no stars filter; "don't reach for the stars" is both a joke and a genuine empirical recommendation.
SantaCoder was explicitly a stepping stone toward a larger model. The BigCode collaboration had set itself the goal of training a roughly 15 billion parameter code model, and SantaCoder was the vehicle for de-risking that effort at one-tenth the scale before committing the much larger compute budget. Nearly every design decision validated on SantaCoder reappears in StarCoder: the GPT-2 style decoder, Multi-Query Attention for fast inference, the Fill-in-the-Middle objective, training on The Stack with opt-out and PII redaction, and release under the BigCode OpenRAIL-M license.
StarCoder, released in May 2023, scaled the recipe up dramatically. It has 15.5 billion parameters across 40 layers with a hidden size of 6,144 and 48 attention heads, was trained on roughly one trillion tokens spanning more than 80 programming languages from The Stack, and uses FlashAttention to extend the context window to about 8,000 tokens, far beyond SantaCoder's 2,048. Where SantaCoder covered three languages and served mainly as a research artifact, StarCoder was positioned as a genuinely useful coding assistant competitive with closed models, and it in turn was followed in 2024 by StarCoder2 trained on the larger, more carefully governed The Stack v2. In this lineage SantaCoder occupies a clear and important place: it is the early experiment that proved the BigCode approach worked, both technically and from a governance standpoint, and that gave the team the confidence to scale up.
As a base code model, SantaCoder is suited to the standard tasks of program synthesis and completion in its three supported languages. The most direct use is autocompletion: given a partial function or a comment describing intent, the model continues the code. Because it was trained with Fill-in-the-Middle, it can also perform infilling, generating code to bridge a gap between an existing prefix and suffix, which mirrors how a developer edits inside a file rather than only appending to the end. This makes it a natural backbone for editor and IDE completion features and for tools that insert code at the cursor.
Beyond direct use, SantaCoder found a substantial role as a research and engineering platform. Its small size relative to commercial code models meant that academics and practitioners could fine-tune it on their own codebases or on new languages, study code-model behavior, or use it as a baseline in experiments, all without the cost of training from scratch or the access restrictions of proprietary systems. The openness of its weights, data, and methods made it especially valuable for reproducible research, and it served as a reference implementation for the responsible-development practices, opt-out and PII redaction in particular, that BigCode wanted to promote. It is also commonly cited as a teaching example of a code LLM built transparently from end to end.
SantaCoder carries the limitations typical of an early, small, base code model, and its own documentation is candid about them. It is not an instruction-tuned model, so it does not follow conversational requests; prompts must be phrased as code in progress, such as a comment or a function signature with a docstring. It supports only Python, Java, and JavaScript, so it is not useful for code written in other languages. Its context window of 2,048 tokens is short by later standards, which constrains how much surrounding code it can take into account.
More fundamentally, the model card warns that generated code is not guaranteed to work as intended and may be inefficient, contain bugs, or include security vulnerabilities or exploits. Because it was trained on a large volume of public code, it can also reproduce patterns from its training data, and generated output should be reviewed before use rather than trusted blindly. The PII redaction pipeline, while a meaningful step, was a first iteration that targeted only emails, IP addresses, and keys, and key detection in particular had limited recall, so the training data was not perfectly scrubbed of sensitive strings.
SantaCoder is released under the BigCode OpenRAIL-M license, a Responsible AI License variant. OpenRAIL-M permits broad use, including commercial use, and allows redistribution and the creation of derivatives, while attaching a set of use-based restrictions that prohibit harmful applications and require those restrictions to be passed along to downstream users. This licensing choice was itself part of the project's governance philosophy: keep the model genuinely open and usable, but place ethical guardrails on how it may be deployed. The license terms are published as an agreement on the Hugging Face Hub.
The path to SantaCoder began with the formation of BigCode on September 26, 2022, when Hugging Face and ServiceNow Research announced the collaboration with the stated aim of building state-of-the-art code-generation systems openly and responsibly. The project drew direct inspiration from BigScience, which had finished a few months earlier, and it adopted a similar working-group structure. In October 2022 the community released The Stack v1.1, the permissively licensed multi-language code dataset, along with the "Am I in The Stack" inspection tool and the opt-out form that would underpin the project's data governance.
With the dataset in hand, the team turned to model experiments through late 2022, running the architecture and data-filtering ablations that the technical report describes. The base ablation models were 1.1 billion parameter decoder-only transformers trained in float16 for 300,000 iterations, seeing about 118 billion tokens, with a global batch size of 192 using the Adam optimizer and a learning rate of 2e-4 on a cosine decay schedule after a short warmup; each such run took about 3.1 days on 96 NVIDIA Tesla V100 GPUs. Having identified the best combination of choices, MQA, FIM, near-deduplication, and the comment-to-code filter with no stars filter, the team trained the final SantaCoder model for 600,000 iterations, doubling the compute and exposing the model to roughly 236 billion tokens, which substantially improved its benchmark scores.
The technical report, "SantaCoder: don't reach for the stars!", was posted to arXiv on January 9, 2023 (arXiv:2301.03988), summarizing the collaboration's progress through December 2022, and the model was published on the Hugging Face Hub as bigcode/santacoder. Within months the lessons learned fed directly into StarCoder, released in May 2023, and then StarCoder2 in 2024, establishing SantaCoder as the foundational first step in one of the most prominent open efforts to build code language models responsibly.