The Stack v2
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,512 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,512 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Stack v2 is a large open dataset of source code assembled by BigCode, an open scientific collaboration on code language models led by Hugging Face and ServiceNow. It is the training dataset behind the StarCoder2 family of code models, released in February 2024. The Stack v2 is built on top of the Software Heritage archive, the largest public archive of software source code, and in its full form contains roughly 3.28 billion unique files spanning 658 programming languages, totaling about 67.5 terabytes of uncompressed code [1][2][3]. A curated training subset, augmented with sources such as GitHub pull requests, Kaggle notebooks, and code documentation, produced a pretraining corpus about four times larger than the one used for the original StarCoder, on the order of 900 billion tokens [1][2].
The Stack v2 is the successor to The Stack v1, the 6.4-terabyte permissively-licensed code dataset that trained StarCoder and SantaCoder in 2022 and 2023 [4][5]. Like its predecessor, it was developed with an emphasis on responsible data governance, including a public "Am I in The Stack" tool and an opt-out process for developers who do not want their code used to train models [6][7].
BigCode was launched in September 2022 as an open scientific collaboration co-stewarded by Hugging Face and ServiceNow Research, with the stated goal of the responsible development of large language models for code (Code LLMs) [4][8]. The project releases its datasets, models, and tooling openly, and it organizes data governance practices, including licensing checks and developer opt-out, as part of the research process [7][8].
The collaboration's first major dataset was The Stack, introduced in the 2022 paper "The Stack: 3 TB of permissively licensed source code" by Denis Kocetkov, Raymond Li, Loubna Ben Allal, and collaborators [4]. The Stack v1 was sourced from public GitHub repositories and filtered down to files under permissive licenses. It grew across versions: the v1.1 release covered 358 programming languages and the v1.2 release reached about 6.4 terabytes of code [4][5]. The Stack v1 served as the pretraining corpus for SantaCoder, a 1.1-billion-parameter model focused on Python, Java, and JavaScript, and then for StarCoder and StarCoderBase, 15.5-billion-parameter models trained in 2023 on a large multilingual slice of the dataset [5]. The Stack v2 was conceived as the next generation of that effort, expanding both the scale and the breadth of the source material by moving from a GitHub-derived collection to the much larger Software Heritage archive [1][2].
The defining change in The Stack v2 is its data source. Rather than recollecting code from GitHub, BigCode partnered with Software Heritage, a nonprofit initiative launched by the French research institute Inria that maintains the largest public archive of software source code and its development history [1][6]. Building on this "digital commons" gave the project access to a far deeper and more stable corpus than a fresh GitHub crawl. The full Stack v2 dataset is assembled from Software Heritage and reflects on the order of 104 million GitHub repositories, with repository metadata drawn from GitHub Archive data through September 2023 [2].
In its complete form, The Stack v2 holds about 3.28 billion unique files across 658 programming languages, totaling roughly 67.5 terabytes uncompressed [2][3]. BigCode also published a near-deduplicated variant of about 32 terabytes, along with the curated training subsets used for StarCoder2 [2].
For the StarCoder2 training data, BigCode selected code in 619 programming languages from the Software Heritage repositories and combined it with several other high-quality sources to broaden coverage beyond raw source files [1][2]. These additional sources include GitHub pull requests, Kaggle notebooks, and code documentation, as well as other code-adjacent corpora such as issues, natural-language text related to code, and small math and reasoning datasets that the StarCoder2 report describes assembling alongside the main code [1]. The dataset is distributed in two main training configurations: a "full" set spanning the 600-plus languages, and a smaller "smol" set restricted to 17 widely used languages, allowing the smaller StarCoder2 models to train on a more focused mixture [2].
The table below summarizes the principal components of The Stack v2 as reported by BigCode.
| Aspect | The Stack v2 |
|---|---|
| Steward | BigCode (Hugging Face and ServiceNow) |
| Primary source | Software Heritage archive |
| Full set, files | About 3.28 billion unique files |
| Full set, languages | 658 |
| Full set, size | About 67.5 TB uncompressed |
| Languages used for StarCoder2 | 619 |
| Extra sources | GitHub pull requests, Kaggle notebooks, code documentation, issues, math and text corpora |
| Training tokens (approx.) | About 900 billion in the full training set |
| Models trained | StarCoder2 3B, 7B, 15B |
| Release | February 2024 |
The Stack v2 training set is roughly four times larger than the corpus used to train the first StarCoder, a jump driven by the move to Software Heritage and the inclusion of new data sources [1][2]. The curated full training set amounts to approximately 900 billion tokens of code [2].
The dataset was built specifically to train StarCoder2, released by BigCode in February 2024 in three sizes: 3 billion, 7 billion, and 15 billion parameters [1]. The models were trained on between 3.3 and 4.3 trillion tokens, drawing repeatedly from The Stack v2 training mixtures, and then evaluated across a comprehensive set of Code LLM benchmarks [1]. BigCode reported that the smallest model, StarCoder2-3B, outperformed other code models of comparable size on most benchmarks and even surpassed the earlier 15-billion-parameter StarCoderBase, illustrating how the larger and more diverse Stack v2 data improved downstream performance per parameter [1].
The dataset and the models were described together in the technical report "StarCoder 2 and The Stack v2: The Next Generation," first posted to arXiv on February 29, 2024, with Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, and a large group of BigCode contributors as authors [1]. Both the dataset and the StarCoder2 models were released openly, the dataset on the Hugging Face Hub and the models under the BigCode OpenRAIL-M responsible AI license [1][2].
The Stack v2 continues the data governance approach that distinguished the BigCode project from many earlier code corpora [6][7]. Source code in the dataset is drawn from the Software Heritage archive and is centered on files with permissive licenses or no license, and BigCode documents the licensing and provenance of the material it includes [2][6].
A central element of this approach is developer agency over personal data. BigCode operates an "Am I in The Stack" web tool, first introduced for The Stack v1 in November 2022, that lets developers check whether code from their GitHub repositories appears in the dataset [6][7]. Developers who wish to be excluded can file an opt-out request: the tool can automatically generate an issue in a dedicated BigCode opt-out repository, and listed repositories are removed in the next iteration of The Stack [7]. This mechanism, together with the partnership with Software Heritage and the project's published governance card, reflects BigCode's stated goal of giving people the ability to decide whether their source code is used to develop and evaluate language models [6][7].
The Stack v2 is one of the largest openly documented pretraining datasets for AI code generation, and it marked a step change from its predecessor by grounding the corpus in the Software Heritage archive rather than a one-off GitHub crawl [1][2]. Its scale, breadth across more than 600 languages, and inclusion of pull requests, notebooks, and documentation made it a reference resource for training and studying open code models [1][2].
It sits alongside other large open corpora of the same era. Where general-purpose web datasets such as RedPajama, The Pile, RefinedWeb, and Dolma aim to reproduce broad language-model training mixtures, The Stack and The Stack v2 are specialized code datasets, and they remain among the most widely cited open sources of permissively-governed source code for large language model research [1][4]. By pairing a very large dataset with openly released models and a concrete opt-out path for developers, The Stack v2 became a prominent example of how a large code corpus can be built with explicit attention to provenance and consent [6][7].