The Stack v2

Data & Datasets Machine Learning

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,771 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Stack v2 is a large open dataset of source code released by BigCode in February 2024 as the training dataset behind the StarCoder2 family of code models. Built on the Software Heritage archive, the largest public archive of software source code, its full form contains about 3.28 billion unique files spanning 658 programming languages and totaling roughly 67.5 terabytes of uncompressed code, making it the largest openly documented code dataset assembled for large language model pretraining at the time of release ^[1]^[2]^[9]. BigCode, an open scientific collaboration led by Hugging Face and ServiceNow, paired the dataset with a developer opt-out process and the public "Am I in The Stack" tool, so people can check for and remove their code ^[6]^[7].

What is The Stack v2?

The Stack v2 is an open dataset of source code assembled by BigCode, an open scientific collaboration on code language models led by Hugging Face and ServiceNow. It is the training dataset behind the StarCoder2 family of code models, released in February 2024. The Stack v2 is built on top of the Software Heritage archive, the largest public archive of software source code, and in its full form contains roughly 3.28 billion unique files spanning 658 programming languages, totaling about 67.5 terabytes of uncompressed code ^[1]^[2]^[3]. A curated training subset, augmented with sources such as GitHub pull requests, Kaggle notebooks, and code documentation, produced a pretraining corpus about four times larger than the one used for the original StarCoder, on the order of 900 billion tokens ^[1]^[2].

The Stack v2 is the successor to The Stack v1, the 6.4-terabyte permissively-licensed code dataset that trained StarCoder and SantaCoder in 2022 and 2023 ^[4]^[5]. Like its predecessor, it was developed with an emphasis on responsible data governance, including a public "Am I in The Stack" tool and an opt-out process for developers who do not want their code used to train models ^[6]^[7].

Where did The Stack v2 come from? Lineage from BigCode and The Stack v1

BigCode was launched in September 2022 as an open scientific collaboration co-stewarded by Hugging Face and ServiceNow Research, with the stated goal of the responsible development of large language models for code (Code LLMs) ^[4]^[8]. The project releases its datasets, models, and tooling openly, and it organizes data governance practices, including licensing checks and developer opt-out, as part of the research process ^[7]^[8].

The collaboration's first major dataset was The Stack, introduced in the 2022 paper "The Stack: 3 TB of permissively licensed source code" by Denis Kocetkov, Raymond Li, Loubna Ben Allal, and collaborators ^[4]. The Stack v1 was sourced from public GitHub repositories and filtered down to files under permissive licenses. It grew across versions: the v1.1 release covered 358 programming languages and the v1.2 release reached about 6.4 terabytes of code ^[4]^[5]. The Stack v1 served as the pretraining corpus for SantaCoder, a 1.1-billion-parameter model focused on Python, Java, and JavaScript, and then for StarCoder and StarCoderBase, 15.5-billion-parameter models trained in 2023 on a large multilingual slice of the dataset ^[5]. The Stack v2 was conceived as the next generation of that effort, expanding both the scale and the breadth of the source material by moving from a GitHub-derived collection to the much larger Software Heritage archive ^[1]^[2].

What does The Stack v2 contain?

Why does The Stack v2 use Software Heritage as its foundation?

The defining change in The Stack v2 is its data source. Rather than recollecting code from GitHub, BigCode partnered with Software Heritage, a nonprofit initiative launched by the French research institute Inria that maintains the largest public archive of software source code and its development history ^[1]^[6]. As the StarCoder2 announcement puts it, "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history" ^[9]. Building on this "digital commons" gave the project access to a far deeper and more stable corpus than a fresh GitHub crawl. The full Stack v2 dataset is assembled from Software Heritage and reflects on the order of 104 million GitHub repositories, with repository metadata drawn from GitHub Archive data through September 2023 ^[2].

In its complete form, The Stack v2 holds about 3.28 billion unique files across 658 programming languages, totaling roughly 67.5 terabytes uncompressed ^[2]^[3]. BigCode also published a near-deduplicated variant of about 32 terabytes, along with the curated training subsets used for StarCoder2 ^[2]^[9]. BigCode describes the result plainly: "The Stack v2 is the largest open code dataset suitable for LLM pretraining" ^[9].

How many languages and sources does The Stack v2 cover?

For the StarCoder2 training data, BigCode selected code in 619 programming languages from the Software Heritage repositories and combined it with several other high-quality sources to broaden coverage beyond raw source files ^[1]^[2]. These additional sources include GitHub pull requests, Kaggle notebooks, and code documentation, as well as other code-adjacent corpora such as issues, natural-language text related to code, and small math and reasoning datasets that the StarCoder2 report describes assembling alongside the main code ^[1]. The dataset is distributed in two main training configurations: a "full" set spanning the 600-plus languages, and a smaller "smol" set restricted to 17 widely used languages, allowing the smaller StarCoder2 models to train on a more focused mixture ^[2].

The table below summarizes the principal components of The Stack v2 as reported by BigCode.

Aspect	The Stack v2
Steward	BigCode (Hugging Face and ServiceNow)
Primary source	Software Heritage archive
Full set, files	About 3.28 billion unique files
Full set, languages	658
Full set, size	About 67.5 TB uncompressed
Near-deduplicated set	About 32 TB
Languages used for StarCoder2	619
Extra sources	GitHub pull requests, Kaggle notebooks, code documentation, issues, math and text corpora
Training tokens (approx.)	About 900 billion in the full training set
Models trained	StarCoder2 3B, 7B, 15B
Release	February 2024

How big is The Stack v2 and how does it relate to StarCoder2?

The Stack v2 training set is roughly four times larger than the corpus used to train the first StarCoder, a jump driven by the move to Software Heritage and the inclusion of new data sources. The StarCoder2 paper states plainly that "this results in a training set that is 4x larger than the first StarCoder dataset" ^[1]^[2]. The curated full training set amounts to approximately 900 billion tokens of code ^[2].

The dataset was built specifically to train StarCoder2, released by BigCode in February 2024 in three sizes: 3 billion, 7 billion, and 15 billion parameters ^[1]. The models were trained on between 3.3 and 4.3 trillion tokens, drawing repeatedly from The Stack v2 training mixtures, and then evaluated across a comprehensive set of Code LLM benchmarks ^[1]. BigCode reported that the smallest model, StarCoder2-3B, outperformed other code models of comparable size on most benchmarks and even surpassed the earlier 15-billion-parameter StarCoderBase, illustrating how the larger and more diverse Stack v2 data improved downstream performance per parameter ^[1]^[9].

The dataset and the models were described together in the technical report "StarCoder 2 and The Stack v2: The Next Generation," first posted to arXiv on February 29, 2024, with Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, and a large group of BigCode contributors as authors ^[1]. Both the dataset and the StarCoder2 models were released openly, the dataset on the Hugging Face Hub and the models under the BigCode OpenRAIL-M responsible AI license ^[1]^[2].

How does The Stack v2 handle code licensing and opt-out?

The Stack v2 continues the data governance approach that distinguished the BigCode project from many earlier code corpora ^[6]^[7]. Source code in the dataset is drawn from the Software Heritage archive and is centered on files with permissive licenses or no license, and BigCode documents the licensing and provenance of the material it includes ^[2]^[6].

A central element of this approach is developer agency over personal data. BigCode operates an "Am I in The Stack" web tool, first introduced for The Stack v1 in November 2022, that lets developers check whether code from their GitHub repositories appears in the dataset ^[6]^[7]. Developers who wish to be excluded can file an opt-out request: the tool can automatically generate an issue in a dedicated BigCode opt-out repository, and listed repositories are removed in the next iteration of The Stack ^[7]. This mechanism, together with the partnership with Software Heritage and the project's published governance card, reflects BigCode's stated goal of giving people the ability to decide whether their source code is used to develop and evaluate language models ^[6]^[7].

Why does The Stack v2 matter?

The Stack v2 is one of the largest openly documented pretraining datasets for AI code generation, and it marked a step change from its predecessor by grounding the corpus in the Software Heritage archive rather than a one-off GitHub crawl ^[1]^[2]. Its scale, breadth across more than 600 languages, and inclusion of pull requests, notebooks, and documentation made it a reference resource for training and studying open code models ^[1]^[2].

It sits alongside other large open corpora of the same era. Where general-purpose web datasets such as RedPajama, The Pile, RefinedWeb, and Dolma aim to reproduce broad language-model training mixtures, The Stack and The Stack v2 are specialized code datasets, and they remain among the most widely cited open sources of permissively-governed source code for large language model research ^[1]^[4]. By pairing a very large dataset with openly released models and a concrete opt-out path for developers, The Stack v2 became a prominent example of how a large code corpus can be built with explicit attention to provenance and consent ^[6]^[7].

References

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, et al. (BigCode). "StarCoder 2 and The Stack v2: The Next Generation." arXiv:2402.19173, February 29, 2024. https://arxiv.org/abs/2402.19173 ↩
BigCode. "bigcode/the-stack-v2." Hugging Face Datasets. https://huggingface.co/datasets/bigcode/the-stack-v2 ↩
BigCode. "bigcode/the-stack-v2-dedup." Hugging Face Datasets. https://huggingface.co/datasets/bigcode/the-stack-v2-dedup ↩
Denis Kocetkov, Raymond Li, Loubna Ben Allal, et al. "The Stack: 3 TB of permissively licensed source code." arXiv:2211.15533, November 2022. https://arxiv.org/abs/2211.15533 ↩
Raymond Li, Loubna Ben Allal, Yangtian Zi, et al. "StarCoder: may the source be with you!" arXiv:2305.06161, May 2023. https://arxiv.org/abs/2305.06161 ↩
BigCode. "Datasets" and project documentation. https://www.bigcode-project.org/docs/about/the-stack/ ↩
BigCode. "opt-out-v2: Repository for opt-out requests." GitHub. https://github.com/bigcode-project/opt-out-v2 ↩
BigCode Project. "The BigCode Project Governance Card." arXiv:2312.03872, December 2023. https://arxiv.org/abs/2312.03872 ↩
BigCode (Hugging Face). "StarCoder2 and The Stack v2." Hugging Face Blog, February 28, 2024. https://huggingface.co/blog/starcoder2 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

SantaCoder SmolLM SmolLM 2

What is The Stack v2?

Where did The Stack v2 come from? Lineage from BigCode and The Stack v1

What does The Stack v2 contain?

Why does The Stack v2 use Software Heritage as its foundation?

How many languages and sources does The Stack v2 cover?

How big is The Stack v2 and how does it relate to StarCoder2?

How does The Stack v2 handle code licensing and opt-out?

Why does The Stack v2 matter?

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here