The Stack (BigCode dataset)

AI Code Generation Data & Datasets Open Source AI

23 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 4,632 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Stack is a family of large, permissively-licensed source-code datasets built by the BigCode project, an open scientific collaboration jointly led by Hugging Face and ServiceNow Research, to train and evaluate code large language models. The first release (The Stack v1, described in Kocetkov et al., arXiv:2211.15533, November 2022) gathered roughly 6.4 TB of code in 358 programming languages from public GitHub repositories filtered for permissive open-source licences such as MIT, Apache-2.0 and BSD, and shipped with an "Am I in The Stack?" inspection tool and a public opt-out workflow so developers could remove their code ^[1]^[2]. The Stack v2 (introduced alongside StarCoder2 in Lozhkov et al., arXiv:2402.19173, February 2024) rebuilt the pipeline on the Software Heritage archive instead of direct GitHub scraping, expanding raw size to 67.5 TB across 658 languages and yielding a 32.1 TB deduplicated training pool ^[3]^[4].

The two corpora trained the BigCode model family, SantaCoder and the StarCoder/StarCoder2 models, and became one of the most widely-cited open foundations for code large language models in 2023 to 2024, as well as a focal point for debates about scraped-code provenance, attribution, and developer consent in AI code generation.

What is The Stack?

The Stack is the largest publicly-released source-code corpus with a documented licence-filtering pipeline, assembled specifically so that open code models can be trained on inspectable, permissively-licensed data rather than undisclosed GitHub scrapes. It is published on the Hugging Face Hub, openly documented in two arXiv papers, and paired with a developer-facing consent mechanism (the "Am I in The Stack?" tool and a GitHub opt-out registry). BigCode states the dataset's governance goal directly: "BigCode aims to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate LLMs" ^[14].

Before The Stack, public source-code corpora for training code language models were either small, narrowly scoped, or opaque about their licence filtering. The CodeParrot dataset, released by Hugging Face in 2021, contained roughly 50 GB of deduplicated Python files from GitHub Code, comprising about 5.36 million files and 22 million Python entries in total ^[5]. The GitHub Code dataset (also hosted on Hugging Face) extended this to roughly 1 TB across 32 languages and 60 extensions but did not enforce a permissive-licence filter at the file level ^[5]. Industrial code models such as OpenAI's Codex, DeepMind's AlphaCode, and Salesforce CodeGen were trained on private, much larger crawls of GitHub whose composition, deduplication, and licence handling were not fully documented ^[1].

The BigCode collaboration was launched in 2022 by ServiceNow Research and Hugging Face explicitly to build an open counterpart: an inspectable code dataset of comparable scale with a published licence-filtering pipeline, a public removal mechanism, and an associated open-weights model family ^[2]^[6]. The first concrete deliverable was The Stack v1, posted to arXiv on 20 November 2022 with the title "The Stack: 3 TB of permissively licensed source code", authored by Denis Kocetkov, Raymond Li, Loubna Ben Allal and colleagues at ServiceNow Research and Hugging Face ^[1]. The initial v1.0 release covered 30 languages at 3.1 TB after deduplication; v1.1 expanded coverage to 358 languages and 193 permissive licences at roughly 6.4 TB; v1.2 incorporated opt-out requests received through 9 February 2023 and removed malicious files flagged during preparation of StarCoder ^[2]^[6].

A second-generation effort began in mid-2023 in partnership with Software Heritage, an INRIA-affiliated non-profit that maintains the largest public archive of software source code. The Stack v2 and the accompanying StarCoder2 paper were posted to arXiv on 29 February 2024, with 66 contributing authors and three model sizes (3B, 7B, 15B) trained by ServiceNow, Hugging Face and Nvidia respectively ^[3]^[4]^[7].

What is in The Stack? (dataset overview)

Property	The Stack v1.1 / v1.2	The Stack v2
Paper	arXiv:2211.15533 (2022-11-20) ^[1]	arXiv:2402.19173 (2024-02-29) ^[3]
Source	Public GitHub repositories	Software Heritage archive (snapshot 2023-09-06) ^[4]
Raw size	~6.4 TB ^[2]	67.5 TB uncompressed ^[4]^[7]
Deduplicated size	~2.9 TB ^[7]	32.1 TB ^[4]^[7]
Programming languages	358 ^[2]	658 (619 from SWH plus additional sources) ^[4]
Unique files	5.28 billion (from 51.76 billion crawled) ^[2]	3.28 billion ^[4]
Permissive licences	193 SPDX identifiers ^[2]	Blue Oak Council-approved permissive licences plus public domain ^[4]
Hugging Face repo	`bigcode/the-stack` ^[2]	`bigcode/the-stack-v2`, `the-stack-v2-dedup` ^[4]
Downstream models	SantaCoder, StarCoder, StarCoderBase, Replit Code, CodeGen2 / CodeGen2.5 ^[6]^[8]^[9]^[10]	StarCoder2 (3B / 7B / 15B) ^[3]^[7]

The v1 figures come from the v1.1 update card on the bigcode/the-stack dataset page, which is the version cited by most downstream model releases. The v2 figures come jointly from the dataset card and the StarCoder2 paper ^[2]^[4]^[3].

What was in The Stack v1 (2022)?

Why was The Stack v1 built?

The v1 paper opens by arguing that "openness around the pre-training data is important given the ongoing legal discussions around the use of open-source code repositories for training" code models, citing the absence of a published, licence-filtered code corpus at the scale used by Codex or AlphaCode ^[1]. The authors set three explicit goals: collect a permissively-licensed corpus large enough to train a competitive code model, document the pipeline in enough detail to be reproducible, and provide a governance mechanism through which developers can inspect and opt out ^[1]^[2].

How was The Stack v1 collected?

To find candidate repositories, BigCode pulled the event archives published on GHArchive between 1 January 2015 and 31 March 2022 and extracted 220.92 million unique repository names ^[1]. Of these, 137.36 million repositories were still public and reachable on GitHub; the remainder had been deleted or made private by their owners ^[1]. Files were then cloned from these public repositories between November 2021 and June 2022, yielding 51.76 billion files in total, of which 5.28 billion were unique by content hash ^[1]^[2].

License metadata was obtained primarily from GHArchive's repository-level licence field and cross-checked with the go-license-detector library where repository-level metadata was missing ^[2]. The v1.0 release accepted 18 permissive licences; v1.1 expanded the allow-list to 193 SPDX identifiers covering MIT, Apache-2.0, BSD-2/3-Clause, ISC, Unlicense and similar licences, while explicitly removing weak-copyleft licences such as MPL, EPL and LGPL after community feedback ^[2]^[6]. The full SPDX list is shipped in the licenses.json file inside the dataset repository ^[2].

How was The Stack v1 deduplicated?

Exact deduplication by SHA-1 content hash collapses the 51.76 billion crawled files to 5.28 billion unique files ^[1]. On top of this the team applied near-deduplication using MinHash with 256 permutations combined with Locality-Sensitive Hashing (LSH), declaring two files near-duplicates when their estimated Jaccard similarity over 5-gram token shingles exceeded 0.85 ^[11]. For the permissive Python subset reported in the paper, near-deduplication removed 38.6 percent of files representing 53.7 percent of dataset volume, reducing the permissive segment from 3.1 TB to 1.45 TB ^[11]. Ablations in the v1 paper showed that near-dedup gave a roughly ten-percentage-point boost in HumanEval pass@100 for a 350M-parameter decoder, confirming results that had earlier been observed on natural-language corpora ^[1]^[11]. The reference MinHash + LSH implementation used by the project is open-sourced as near_deduplication/minhash_deduplication.py in the bigcode-project/bigcode-dataset repository ^[11].

Which languages does The Stack v1 cover?

The Stack v1.1 spans 358 languages, but the bulk of the byte volume is concentrated in a small number of widely-used languages. After near-deduplication of the permissive subset, the largest single language is HTML at 746 GB, followed by JavaScript at 486 GB, Java at 271 GB and C at 222 GB; these four languages together comprise over 55 percent of the permissive corpus ^[11]. The dataset card additionally lists prominent secondary languages including C++, C#, CSS, Dockerfile, Fortran, Go, Haskell, Julia, Lua, Markdown, Perl, PHP, PowerShell, Python, Ruby, Rust, Scala, Shell, SQL, TeX, TypeScript and Visual Basic ^[2]. Approximately 40 natural languages appear in source comments and docstrings, including English, Chinese, French, Portuguese, Spanish, Russian, German, Korean and Japanese ^[2].

How is The Stack v1 distributed on Hugging Face?

The Stack v1 is hosted on the Hugging Face Hub under bigcode/the-stack, with an additional bigcode/the-stack-dedup repository providing the near-deduplicated split that most downstream models consumed ^[2]. The card exposes content, size, lang, ext, avg_line_length, max_line_length, alphanum_fraction, hexsha, and per-file repository metadata such as max_stars_repo_name and max_forks_count, and supports both bulk and streaming loading via the datasets library ^[2]. The dataset reported approximately 20,223 downloads in the month prior to the snapshot we consulted, and accumulated over 50,000 downloads by the time of StarCoder's release on 4 May 2023 ^[2]^[12].

How well do models trained on The Stack v1 perform?

The v1 paper benchmarks the dataset by training 350M-parameter decoder-only language models on the permissive Python subset and evaluating on HumanEval and MBPP. Models trained on the near-deduplicated Stack reach 37 percent pass@100 on HumanEval and 54.69 percent pass@100 on MBPP, matching or exceeding the published Codex and CodeGen baselines at the same parameter count, while using only permissively-licensed training data ^[1]. The authors frame this as feasibility evidence: an openly-licensed corpus of this scale is enough to recover the performance previously reported by closed models trained on broader, less-filtered GitHub scrapes ^[1].

What changed in The Stack v2 (2024)?

Why does The Stack v2 use the Software Heritage archive?

The most substantive architectural change between v1 and v2 is the move from direct GitHub crawling to using the Software Heritage (SWH) graph as the canonical source. SWH is a non-profit digital archive that ingests public source code from GitHub, GitLab, Bitbucket, Debian and other forges, producing a Merkle-DAG of files (blobs), directories, revisions and snapshots identified by SoftWare Heritage persistent IDentifiers (SWHIDs) ^[4]. BigCode extracted The Stack v2 from the 2023-09-06 snapshot of the SWH graph dataset, focusing on the latest commit on the main branch of each repository ^[4]. Using SWH rather than re-scraping GitHub gives The Stack v2 two properties that v1 lacked: every file is addressable by a permanent, content-based identifier that downstream researchers can resolve independently, and the dataset can be reconstructed from the SWH archive without re-hitting GitHub's rate limits ^[4]^[13].

The base archive covers 104.2 million GitHub repositories and 619 programming languages from SWH; on top of this, the v2 release adds supplementary high-quality sources including GitHub pull requests, Jupyter and Kaggle notebooks, package-manager documentation from 13 ecosystems, and curated Stack Exchange, arXiv and Wikipedia text used for reasoning-friendly pretraining ^[4]^[7]. GitHub Archive metadata up to 2023-09-14 supplies repository-level signals such as star counts and licence labels used during filtering ^[4].

How does The Stack v2 filter licences?

For licence labelling, v2 adopts the Blue Oak Council's published list of permissively-licensed open-source licences as the canonical permissive allow-list, supplemented by repository-level metadata from GitHub Archive and file-level fallback detection using the ScanCode Toolkit when repository-level licence information is missing ^[4]. The dataset distinguishes three licence buckets: permissively-licensed files (always included), public-domain files (included), and unlicensed files. A notable change from v1 is that v2 explicitly includes unlicensed files alongside permissive ones, while excluding copyleft and commercial licences. The unlicensed bucket was included to widen language and domain coverage, on the rationale that absence of a licence does not necessarily preclude research use under the SWH framework, but this choice has been contested in subsequent governance discussions (see Criticisms below) ^[4]^[14]. Users of v2 are required to follow Software Heritage's October 2023 statement of principles for language-model training on archived code ^[4]^[13].

How big is The Stack v2?

The Stack v2 contains 3.28 billion unique files across 658 languages in its undeduplicated form, totalling 67.5 TB uncompressed ^[4]. After near-deduplication (bigcode/the-stack-v2-dedup) the corpus shrinks to 32.1 TB, indicating that roughly 40 percent of permissively-licensed source files in the corpus are near-duplicates of each other ^[4]^[7]. Near-deduplication is performed with the same MinHash + LSH approach as v1, scaled up using the BigCode dataset pipeline ^[11]^[7]. The StarCoder2 training mixture additionally undergoes more aggressive per-source deduplication: for example, Jupyter notebook ingestion eliminates roughly 75 percent of candidates, reducing 11 million notebooks to 4 million scripts, and Kaggle notebooks are reduced from 3.6 million to 580,000 ^[7].

The training corpus actually fed to StarCoder2-15B is reported as approximately 900 billion unique tokens for the full multi-language set, with the language-restricted slice used for the 3B and 7B models being smaller ^[7]. The StarCoder2 paper notes that the training token budgets (3.3 to 4.3 trillion tokens across 3B, 7B and 15B respectively) substantially exceed compute-optimal recommendations under Chinchilla scaling laws, following a strategy that trades compute for smaller, more deployable models ^[3]^[7]. Hugging Face describes the result as "the largest open code dataset suitable for LLM pretraining" ^[7].

How is The Stack v2 distributed on Hugging Face?

The Stack v2 is hosted at bigcode/the-stack-v2 and bigcode/the-stack-v2-dedup, with a language_stats.csv listing per-language file counts and a license_stats.csv listing per-licence file counts ^[4]. The card distributes file content via Software Heritage's blob-storage endpoints rather than redistributing raw bytes from GitHub, again leaning on the SWH archive as the canonical store; bulk users can contact datasets@softwareheritage.org for access to the underlying graph snapshot ^[4]. The exact version used to train StarCoder2 is v2.0.1, which integrates opt-out requests received before 20 October 2023 ^[4].

How can developers opt out of The Stack?

What is the "Am I in The Stack?" tool?

A central design choice of the BigCode project is to surface dataset membership to the developer community rather than relying on after-the-fact takedowns. The "Am I in The Stack?" Hugging Face Space lets any user enter a GitHub username and receive a list of their repositories present in the current dataset version, with one-click links to the opt-out workflow ^[2]^[14]. The card frames this as a deliberate consent mechanism: "BigCode aims to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate LLMs" ^[14].

How does the opt-out workflow work?

Opt-out requests are filed as GitHub issues in the bigcode-project/opt-out-v2 repository, listing the repositories (and optionally the commits or issues) the requester wants excluded ^[15]. The BigCode team verifies ownership through GitHub's account graph, queues validated requests for the next dataset iteration, and retains the request list to prevent flagged repositories from reappearing in future versions ^[14]^[15]. Dataset updates are issued roughly quarterly, and users of the dataset are required by its terms of use to switch to the most recent compliant version when notified ^[2]^[14].

By the time of StarCoder's release on 4 May 2023, 44 developers had opted out of The Stack and their repositories had been excluded from training ^[14]. The opt-out repository has since accumulated substantially more activity: by 2024 to 2026 the open-issues count grew to several hundred, with one community estimate citing more than 1,700 historical opt-out issues filed across The Stack's lifetime ^[15]^[16].

How is personal data (PII) handled?

Personally identifiable information was treated as a separate problem from licence filtering. The StarCoder paper documents a dedicated PII detection model trained with Toloka on annotations from 1,399 crowd workers across 35 countries, on a corpus of 22,950 secrets, and reports a roughly 90 percent F1 score against regex-based baselines, with particularly large gains on secret keys (API tokens, private keys) ^[12]^[14]. PII redaction is applied to the training set fed to StarCoder rather than to the published Stack v1 itself; v2 inherits the same pipeline for StarCoder2 training ^[3]^[12].

How are malicious files removed?

Stack v1.2 added removal of files flagged as malicious during preparation of StarCoder, on the basis that publishing malware samples in pretraining data risks producing models capable of regenerating known exploits ^[2]. The v2 dataset card states the same policy is inherited in the SWH-based pipeline ^[4].

Which models were trained on The Stack?

The Stack has been the explicit pretraining corpus or a major component for an unusually wide range of open code language models. The table below summarises the most prominent downstream uses.

Model	Year	Parameters	Pretraining corpus	Stack version
SantaCoder	2023	1.1B	Java + JavaScript + Python subset of The Stack	v1.1 ^[9]
StarCoder / StarCoderBase	2023	15.5B	1 trillion tokens from The Stack across 80+ languages	v1.2 ^[12]
Replit Code v1	2023	2.7B	Subset of The Stack near-deduplicated	v1.2 ^[10]
Salesforce CodeGen2	2023	1B, 3.7B, 7B, 16B	Permissive subset of dedup Stack	v1.1 ^[8]
Salesforce CodeGen2.5	2023	7B	StarCoderData (derived from The Stack)	v1.2 derivative ^[8]
StarCoder2 (3B / 7B / 15B)	2024	3 / 7 / 15B	900B-token training mix from Stack v2 plus supplementary sources	v2.0.1 ^[3]^[7]

SantaCoder, released January 2023, was BigCode's first technical demonstration that The Stack could train a competitive 1.1B-parameter Multi-Query-Attention decoder on the Java/JavaScript/Python slice of v1.1, outperforming the previous open Python baselines on MultiPL-E ^[9]. StarCoder (May 2023) scaled this to 15.5B parameters trained on 1 trillion tokens from v1.2, with 8K-token context, fill-in-the-middle objective, and the PII-redaction pipeline described above ^[12]. StarCoder2 (February 2024) replaced the v1-based corpus with the Stack v2 mixture and grew the largest model's training budget to 4.3 trillion tokens, with the 3B trained by ServiceNow, the 7B by Hugging Face, and the 15B by Nvidia ^[3]^[7].

External adopters include Replit, whose Replit-Code-v1-3B is trained on a near-deduplicated subset of Stack v1.2 ^[10], and Salesforce, whose CodeGen2 series and CodeGen2.5 use respectively the strict permissive subset of dedup Stack v1.1 and the StarCoderData derivative built from v1.2 ^[8]. The Stack also serves as the substrate for community fine-tunes such as santacoder-finetuned-the-stack-clojure, which extends SantaCoder to languages outside the original Java/JavaScript/Python slice using The Stack's per-language splits ^[17].

How does The Stack compare with other code datasets?

The Stack sits in a small family of public source-code corpora used to train code large language models. Differences in scale, licence stringency, and data source shape how each is used.

Dataset	Year	Size	Languages	Licence filter	Source
CodeParrot	2021	~50 GB (180 GB raw)	Python only	None enforced	GitHub Code crawl ^[5]
GitHub Code	2021	~1 TB	32	None enforced	GitHub crawl ^[5]
AlphaCode pretraining set	2022	~715 GB	Multiple	Permissive (described, not released)	GitHub ^[1]
CodeGen pretraining set	2022	Multiple TB (private)	Multiple	Permissive (described, not released)	GitHub + BigQuery ^[1]
The Stack v1.1	2022	6.4 TB raw, 2.9 TB dedup	358	193 SPDX permissive licences	GitHub ^[2]
The Stack v2	2024	67.5 TB raw, 32.1 TB dedup	658	Blue Oak permissive + public domain + unlicensed	Software Heritage ^[4]
Common Pile (code split)	2024	Within ~8 TB total mixed-domain	Multiple	Permissive + public domain	Mixed ^[18]

Common Crawl is also occasionally used to extract code-like content for general-purpose pretraining, but its code coverage is incidental rather than curated and is not subject to a licence filter ^[1]. The Stack therefore occupies a distinctive niche: it is the only publicly-released, file-level licence-filtered source-code corpus at multi-terabyte scale, and the only one with a structured opt-out and inspection workflow. Surveys of code LLMs published in 2023 to 2024 consistently treat it as the de facto open analogue of the unreleased AlphaCode and Codex training sets ^[19].

What are the limitations and criticisms of The Stack?

Why is licence detection imperfect?

The v1 licence filter relies on a mixture of GHArchive repository metadata and file-level scanning, neither of which is perfect. Repository-level licence labels can disagree with the licence stated inside individual files, especially in large monorepos that vendor third-party code under non-permissive licences; the file-level fallback detector can miss licences expressed in non-standard headers ^[2]. The v2 card moves to file-level ScanCode detection with the Blue Oak Council allow-list, but the same class of false positives and false negatives still applies ^[4].

Why are unlicensed files in v2 controversial?

The most-discussed governance change between v1 and v2 is the explicit inclusion of unlicensed files in v2. Critics on the bigcode/the-stack-v2 discussion forum have argued that under most jurisdictions absence of a licence means the default copyright reservation applies, so including unlicensed files conflicts with the BigCode project's earlier framing of permissive-only training data ^[4]^[16]. The same threads observe that some files carrying attribution-requiring licences (for example, BSD) are present in the dataset without an accompanying attribution mechanism in downstream models trained on it ^[16].

How well does opt-out actually work?

Although the opt-out mechanism has processed dozens of requests since 2023, community reports note that the volume of incoming requests has at times outstripped the cadence of dataset re-releases, and some early requests waited multiple releases for full removal ^[15]^[16]. Discussions on the v2 dataset page also raise the concern that opt-outs filed against v1 repositories did not always propagate cleanly into the v2 pipeline, because v2 is reconstructed from the Software Heritage graph rather than the original GitHub repository identifiers used by the v1 opt-out registry ^[4]^[16].

Does The Stack contain vulnerable or low-quality code?

A 2025 follow-on study, "Cracks in The Stack" (arXiv:2501.02628), audits a subset of The Stack for known-vulnerable code patterns and licence-attribution failures, reporting that a substantial fraction of files contain CWE-mapped vulnerabilities or non-standard licence headers that the BigCode pipeline did not flag ^[20]. The study does not call for the dataset to be withdrawn but argues that pretraining on it without an additional vulnerability filter risks teaching code models to reproduce known exploits, echoing the rationale for the v1.2 malicious-file removal ^[20].

What about memorisation and attribution?

Like any pretraining-scale code corpus, The Stack contains many near-duplicates of widely-copied code, increasing the risk that downstream models memorise and reproduce identifiable spans verbatim. The StarCoder release ships an attribution-tracing tool that lets users search a generated snippet against the training set, partly to mitigate this concern, but the tool requires opt-in use and does not absolve downstream applications of attribution responsibilities under the licences of the original files ^[12].

Why is The Stack significant?

The Stack is significant in several overlapping ways. First, it is the largest publicly-released code training dataset with a documented licence-filtering pipeline, making it the de facto reproducibility baseline for any open code large language model released after late 2022; the entire BigCode model family (SantaCoder, StarCoder, StarCoder2) and most third-party open code models from Salesforce and Replit in 2023 to 2024 list it as their primary corpus ^[6]^[8]^[10]^[12].

Second, the "Am I in The Stack?" tool and the GitHub-based opt-out workflow set a widely-imitated template for dataset-level consent in machine-learning pretraining, predating most regulatory consent mandates for training-data use; the workflow has since been studied as a governance case by the Turing Way and cited in surveys of AI code generation practice ^[14]^[19]. Third, by relocating from raw GitHub scraping to the Software Heritage archive in v2, BigCode demonstrated that an existing open-source preservation infrastructure can serve as a stable, addressable substrate for AI training data, an architecture pattern that competing open code datasets and parts of the broader open-data community have begun to emulate ^[4]^[13].

Finally, the explicit comparison between models trained on The Stack and earlier closed corpora helped establish that openly-licensed pretraining data is sufficient to recover competitive code-generation performance on standard benchmarks such as HumanEval and MBPP, removing a key empirical argument against transparent code-LLM development ^[1]^[7].

StarCoder and the StarCoderBase / StarCoder2 model family, the canonical BigCode models trained on The Stack ^[3]^[12]
SantaCoder, the 1.1B-parameter precursor model demonstrating feasibility of training on Stack v1.1 ^[9]
The Pile, the EleutherAI general-purpose pretraining corpus, which includes a Github split that predates The Stack but has no permissive licence filter
Common Pile, a 2024 successor to The Pile aiming for permissively-licensed content across modalities, which overlaps in goals with The Stack's licence-filtering approach ^[18]
RedPajama, an open reproduction of the LLaMA pretraining mix, used together with The Stack in several mixed-domain pretraining recipes
CodeContests and HumanEval / MBPP, standard evaluation suites for models trained on The Stack
AI code generation, the broader application area for which The Stack is the dominant open training resource

References

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Munoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries, "The Stack: 3 TB of permissively licensed source code", arXiv preprint, 2022-11-20. https://arxiv.org/abs/2211.15533. Accessed 2026-05-20. ↩
BigCode / Hugging Face, "bigcode/the-stack dataset card", Hugging Face Hub, 2023 (versions v1.0, v1.1, v1.2). https://huggingface.co/datasets/bigcode/the-stack. Accessed 2026-05-20. ↩
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei et al., "StarCoder 2 and The Stack v2: The Next Generation", arXiv preprint, 2024-02-29. https://arxiv.org/abs/2402.19173. Accessed 2026-05-20. ↩
BigCode / Hugging Face, "bigcode/the-stack-v2 dataset card", Hugging Face Hub, 2024. https://huggingface.co/datasets/bigcode/the-stack-v2. Accessed 2026-05-20. ↩
Hugging Face, "CodeParrot and GitHub Code dataset documentation", Hugging Face Hub, 2021 to 2022. https://huggingface.co/datasets/codeparrot/codeparrot-clean. Accessed 2026-05-20. ↩
BigCode project, "The Stack overview page", BigCode docs, 2023. https://www.bigcode-project.org/docs/about/the-stack/. Accessed 2026-05-20. ↩
Leandro von Werra and Loubna Ben Allal, "StarCoder2 and The Stack v2", Hugging Face blog, 2024-02-28. https://huggingface.co/blog/starcoder2. Accessed 2026-05-20. ↩
Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo Zhou, "CodeGen2.5: Small, but mighty", Salesforce blog and CodeGen GitHub repository, 2023-07. https://www.salesforce.com/blog/codegen25/. Accessed 2026-05-20. ↩
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey et al., "SantaCoder: don't reach for the stars!", arXiv preprint 2301.03988, 2023-01-09. https://arxiv.org/abs/2301.03988. Accessed 2026-05-20. ↩
Replit, "Replit-Code-v1-3B model card", Hugging Face Hub, 2023. https://huggingface.co/replit/replit-code-v1-3b. Accessed 2026-05-20. ↩
BigCode project, "bigcode-dataset / near_deduplication / minhash_deduplication.py", GitHub repository, 2022 to 2023. https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/minhash_deduplication.py. Accessed 2026-05-20. ↩
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim et al., "StarCoder: may the source be with you!", arXiv preprint 2305.06161, 2023-05-09. https://arxiv.org/abs/2305.06161. Accessed 2026-05-20. ↩
Software Heritage, "Statement on LLM for code", softwareheritage.org blog, 2023-10-19. https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/. Accessed 2026-05-20. ↩
The Turing Way, "BigCode data governance case study", The Turing Way handbook, 2023. https://book.the-turing-way.org/project-design/data-security/data-governance/bigcode-casestudy/. Accessed 2026-05-20. ↩
BigCode project, "opt-out-v2 repository", GitHub, 2023 to 2026. https://github.com/bigcode-project/opt-out-v2. Accessed 2026-05-20. ↩
Hugging Face community, "bigcode/the-stack-v2 Report: Legal issue(s) discussion", Hugging Face dataset discussion, 2024. https://huggingface.co/datasets/bigcode/the-stack-v2/discussions/31. Accessed 2026-05-20. ↩
Manuel Romero, "santacoder-finetuned-the-stack-clojure model card", Hugging Face Hub, 2023. https://huggingface.co/mrm8488/santacoder-finetuned-the-stack-clojure. Accessed 2026-05-20. ↩
EleutherAI, "Common Pile project page", 2024. https://github.com/EleutherAI/common-pile. Accessed 2026-05-20. ↩
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim, "A Survey on Large Language Models for Code Generation", arXiv preprint 2406.00515, 2024-06. https://arxiv.org/abs/2406.00515. Accessed 2026-05-20. ↩
"Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets", arXiv preprint 2501.02628, 2025-01. https://arxiv.org/abs/2501.02628. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

BigCodeBench DeepSeek-Coder RepoBench SantaCoder SmolLM The Stack v2

What is The Stack?

What is in The Stack? (dataset overview)

What was in The Stack v1 (2022)?

Why was The Stack v1 built?

How was The Stack v1 collected?

How was The Stack v1 deduplicated?

Which languages does The Stack v1 cover?

How is The Stack v1 distributed on Hugging Face?

How well do models trained on The Stack v1 perform?

What changed in The Stack v2 (2024)?

Why does The Stack v2 use the Software Heritage archive?

How does The Stack v2 filter licences?

How big is The Stack v2?

How is The Stack v2 distributed on Hugging Face?

How can developers opt out of The Stack?

What is the "Am I in The Stack?" tool?

How does the opt-out workflow work?

How is personal data (PII) handled?

How are malicious files removed?

Which models were trained on The Stack?

How does The Stack compare with other code datasets?

What are the limitations and criticisms of The Stack?

Why is licence detection imperfect?

Why are unlicensed files in v2 controversial?

How well does opt-out actually work?

Does The Stack contain vulnerable or low-quality code?

What about memorisation and attribution?

Why is The Stack significant?

Related work

See also

References

Improve this article

Related Articles

Code Llama

StarCoder

Continue (software)

Cline (AI coding agent)

Roo Code

Codestral

What links here

Related Articles

Code Llama

StarCoder

Continue (software)

Cline (AI coding agent)

Roo Code

Codestral

What links here