The Stack (BigCode dataset)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,364 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,364 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Stack is a family of permissively-licensed source-code datasets assembled by the BigCode project, an open scientific collaboration jointly led by ServiceNow Research and Hugging Face. The first release, described in Kocetkov et al. (arXiv:2211.15533, November 2022), gathered roughly 6.4 TB of code in 358 programming languages from public GitHub repositories filtered for permissive open-source licences such as MIT, Apache-2.0 and BSD, and shipped with an "Am I in The Stack?" inspection tool and a public opt-out workflow [1][2]. The Stack v2, introduced alongside StarCoder2 in Lozhkov et al. (arXiv:2402.19173, February 2024), reorganises the pipeline around the Software Heritage archive rather than direct GitHub scraping, expanding raw size to 67.5 TB across 658 languages and producing a 32.1 TB deduplicated training pool that underlies the StarCoder-2 model family [3][4]. Together the two corpora became one of the most widely-cited foundations for openly-trained code large language models in 2023 to 2024, and a focal point for debates about scraped-code provenance, attribution, and developer consent in AI code generation.
Before The Stack, public source-code corpora for training code language models were either small, narrowly scoped, or opaque about their licence filtering. The CodeParrot dataset, released by Hugging Face in 2021, contained roughly 50 GB of deduplicated Python files from GitHub Code, comprising about 5.36 million files and 22 million Python entries in total [5]. The GitHub Code dataset (also hosted on Hugging Face) extended this to roughly 1 TB across 32 languages and 60 extensions but did not enforce a permissive-licence filter at the file level [5]. Industrial code models such as OpenAI's Codex, DeepMind's AlphaCode, and Salesforce CodeGen were trained on private, much larger crawls of GitHub whose composition, deduplication, and licence handling were not fully documented [1].
The BigCode collaboration was launched in 2022 by ServiceNow Research and Hugging Face explicitly to build an open counterpart: an inspectable code dataset of comparable scale with a published licence-filtering pipeline, a public removal mechanism, and an associated open-weights model family [2][6]. The first concrete deliverable was The Stack v1, posted to arXiv on 20 November 2022 with the title "The Stack: 3 TB of permissively licensed source code", authored by Denis Kocetkov, Raymond Li, Loubna Ben Allal and colleagues at ServiceNow Research and Hugging Face [1]. The initial v1.0 release covered 30 languages at 3.1 TB after deduplication; v1.1 expanded coverage to 358 languages and 193 permissive licences at roughly 6.4 TB; v1.2 incorporated opt-out requests received through 9 February 2023 and removed malicious files flagged during preparation of StarCoder [2][6].
A second-generation effort began in mid-2023 in partnership with Software Heritage, an INRIA-affiliated non-profit that maintains the largest public archive of software source code. The Stack v2 and the accompanying StarCoder2 paper were posted to arXiv on 29 February 2024, with 66 contributing authors and three model sizes (3B, 7B, 15B) trained by ServiceNow, Hugging Face and Nvidia respectively [3][4][7].
| Property | The Stack v1.1 / v1.2 | The Stack v2 |
|---|---|---|
| Paper | arXiv:2211.15533 (2022-11-20) [1] | arXiv:2402.19173 (2024-02-29) [3] |
| Source | Public GitHub repositories | Software Heritage archive (snapshot 2023-09-06) [4] |
| Raw size | ~6.4 TB [2] | 67.5 TB uncompressed [4][7] |
| Deduplicated size | ~2.9 TB [7] | 32.1 TB [4][7] |
| Programming languages | 358 [2] | 658 (619 from SWH plus additional sources) [4] |
| Unique files | 5.28 billion (from 51.76 billion crawled) [2] | 3.28 billion [4] |
| Permissive licences | 193 SPDX identifiers [2] | Blue Oak Council-approved permissive licences plus public domain [4] |
| Hugging Face repo | bigcode/the-stack [2] | bigcode/the-stack-v2, the-stack-v2-dedup [4] |
| Downstream models | SantaCoder, StarCoder, StarCoderBase, Replit Code, CodeGen2 / CodeGen2.5 [6][8][9][10] | StarCoder2 (3B / 7B / 15B) [3][7] |
The v1 figures come from the v1.1 update card on the bigcode/the-stack dataset page, which is the version cited by most downstream model releases. The v2 figures come jointly from the dataset card and the StarCoder2 paper [2][4][3].
The v1 paper opens by arguing that "openness around the pre-training data is important given the ongoing legal discussions around the use of open-source code repositories for training" code models, citing the absence of a published, licence-filtered code corpus at the scale used by Codex or AlphaCode [1]. The authors set three explicit goals: collect a permissively-licensed corpus large enough to train a competitive code model, document the pipeline in enough detail to be reproducible, and provide a governance mechanism through which developers can inspect and opt out [1][2].
To find candidate repositories, BigCode pulled the event archives published on GHArchive between 1 January 2015 and 31 March 2022 and extracted 220.92 million unique repository names [1]. Of these, 137.36 million repositories were still public and reachable on GitHub; the remainder had been deleted or made private by their owners [1]. Files were then cloned from these public repositories between November 2021 and June 2022, yielding 51.76 billion files in total, of which 5.28 billion were unique by content hash [1][2].
License metadata was obtained primarily from GHArchive's repository-level licence field and cross-checked with the go-license-detector library where repository-level metadata was missing [2]. The v1.0 release accepted 18 permissive licences; v1.1 expanded the allow-list to 193 SPDX identifiers covering MIT, Apache-2.0, BSD-2/3-Clause, ISC, Unlicense and similar licences, while explicitly removing weak-copyleft licences such as MPL, EPL and LGPL after community feedback [2][6]. The full SPDX list is shipped in the licenses.json file inside the dataset repository [2].
Exact deduplication by SHA-1 content hash collapses the 51.76 billion crawled files to 5.28 billion unique files [1]. On top of this the team applied near-deduplication using MinHash with 256 permutations combined with Locality-Sensitive Hashing (LSH), declaring two files near-duplicates when their estimated Jaccard similarity over 5-gram token shingles exceeded 0.85 [11]. For the permissive Python subset reported in the paper, near-deduplication removed 38.6 percent of files representing 53.7 percent of dataset volume, reducing the permissive segment from 3.1 TB to 1.45 TB [11]. Ablations in the v1 paper showed that near-dedup gave a roughly ten-percentage-point boost in HumanEval pass@100 for a 350M-parameter decoder, confirming results that had earlier been observed on natural-language corpora [1][11]. The reference MinHash + LSH implementation used by the project is open-sourced as near_deduplication/minhash_deduplication.py in the bigcode-project/bigcode-dataset repository [11].
The Stack v1.1 spans 358 languages, but the bulk of the byte volume is concentrated in a small number of widely-used languages. After near-deduplication of the permissive subset, the largest single language is HTML at 746 GB, followed by JavaScript at 486 GB, Java at 271 GB and C at 222 GB; these four languages together comprise over 55 percent of the permissive corpus [11]. The dataset card additionally lists prominent secondary languages including C++, C#, CSS, Dockerfile, Fortran, Go, Haskell, Julia, Lua, Markdown, Perl, PHP, PowerShell, Python, Ruby, Rust, Scala, Shell, SQL, TeX, TypeScript and Visual Basic [2]. Approximately 40 natural languages appear in source comments and docstrings, including English, Chinese, French, Portuguese, Spanish, Russian, German, Korean and Japanese [2].
The Stack v1 is hosted on the Hugging Face Hub under bigcode/the-stack, with an additional bigcode/the-stack-dedup repository providing the near-deduplicated split that most downstream models consumed [2]. The card exposes content, size, lang, ext, avg_line_length, max_line_length, alphanum_fraction, hexsha, and per-file repository metadata such as max_stars_repo_name and max_forks_count, and supports both bulk and streaming loading via the datasets library [2]. The dataset reported approximately 20,223 downloads in the month prior to the snapshot we consulted, and accumulated over 50,000 downloads by the time of StarCoder's release on 4 May 2023 [2][12].
The v1 paper benchmarks the dataset by training 350M-parameter decoder-only language models on the permissive Python subset and evaluating on HumanEval and MBPP. Models trained on the near-deduplicated Stack reach 37 percent pass@100 on HumanEval and 54.69 percent pass@100 on MBPP, matching or exceeding the published Codex and CodeGen baselines at the same parameter count, while using only permissively-licensed training data [1]. The authors frame this as feasibility evidence: an openly-licensed corpus of this scale is enough to recover the performance previously reported by closed models trained on broader, less-filtered GitHub scrapes [1].
The most substantive architectural change between v1 and v2 is the move from direct GitHub crawling to using the Software Heritage (SWH) graph as the canonical source. SWH is a non-profit digital archive that ingests public source code from GitHub, GitLab, Bitbucket, Debian and other forges, producing a Merkle-DAG of files (blobs), directories, revisions and snapshots identified by SoftWare Heritage persistent IDentifiers (SWHIDs) [4]. BigCode extracted The Stack v2 from the 2023-09-06 snapshot of the SWH graph dataset, focusing on the latest commit on the main branch of each repository [4]. Using SWH rather than re-scraping GitHub gives The Stack v2 two properties that v1 lacked: every file is addressable by a permanent, content-based identifier that downstream researchers can resolve independently, and the dataset can be reconstructed from the SWH archive without re-hitting GitHub's rate limits [4][13].
The base archive covers 104.2 million GitHub repositories and 619 programming languages from SWH; on top of this, the v2 release adds supplementary high-quality sources including GitHub pull requests, Jupyter and Kaggle notebooks, package-manager documentation from 13 ecosystems, and curated Stack Exchange, arXiv and Wikipedia text used for reasoning-friendly pretraining [4][7]. GitHub Archive metadata up to 2023-09-14 supplies repository-level signals such as star counts and licence labels used during filtering [4].
For licence labelling, v2 adopts the Blue Oak Council's published list of permissively-licensed open-source licences as the canonical permissive allow-list, supplemented by repository-level metadata from GitHub Archive and file-level fallback detection using the ScanCode Toolkit when repository-level licence information is missing [4]. The dataset distinguishes three licence buckets: permissively-licensed files (always included), public-domain files (included), and unlicensed files. A notable change from v1 is that v2 explicitly includes unlicensed files alongside permissive ones, while excluding copyleft and commercial licences. The unlicensed bucket was included to widen language and domain coverage, on the rationale that absence of a licence does not necessarily preclude research use under the SWH framework, but this choice has been contested in subsequent governance discussions (see Criticisms below) [4][14]. Users of v2 are required to follow Software Heritage's October 2023 statement of principles for language-model training on archived code [4][13].
The Stack v2 contains 3.28 billion unique files across 658 languages in its undeduplicated form, totalling 67.5 TB uncompressed [4]. After near-deduplication (bigcode/the-stack-v2-dedup) the corpus shrinks to 32.1 TB, indicating that roughly 40 percent of permissively-licensed source files in the corpus are near-duplicates of each other [4][7]. Near-deduplication is performed with the same MinHash + LSH approach as v1, scaled up using the BigCode dataset pipeline [11][7]. The StarCoder2 training mixture additionally undergoes more aggressive per-source deduplication: for example, Jupyter notebook ingestion eliminates roughly 75 percent of candidates, reducing 11 million notebooks to 4 million scripts, and Kaggle notebooks are reduced from 3.6 million to 580,000 [7].
The training corpus actually fed to StarCoder2-15B is reported as approximately 900 billion unique tokens for the full multi-language set, with the language-restricted slice used for the 3B and 7B models being smaller [7]. The StarCoder2 paper notes that the training token budgets (3.3 to 4.3 trillion tokens across 3B, 7B and 15B respectively) substantially exceed compute-optimal recommendations under Chinchilla scaling laws, following a strategy that trades compute for smaller, more deployable models [3][7].
The Stack v2 is hosted at bigcode/the-stack-v2 and bigcode/the-stack-v2-dedup, with a language_stats.csv listing per-language file counts and a license_stats.csv listing per-licence file counts [4]. The card distributes file content via Software Heritage's blob-storage endpoints rather than redistributing raw bytes from GitHub, again leaning on the SWH archive as the canonical store; bulk users can contact datasets@softwareheritage.org for access to the underlying graph snapshot [4]. The exact version used to train StarCoder2 is v2.0.1, which integrates opt-out requests received before 20 October 2023 [4].
A central design choice of the BigCode project is to surface dataset membership to the developer community rather than relying on after-the-fact takedowns. The "Am I in The Stack?" Hugging Face Space lets any user enter a GitHub username and receive a list of their repositories present in the current dataset version, with one-click links to the opt-out workflow [2][14]. The card frames this as a deliberate consent mechanism: "BigCode aims to give people agency over their source code by letting them decide whether or not it should be used to develop and evaluate LLMs" [14].
Opt-out requests are filed as GitHub issues in the bigcode-project/opt-out-v2 repository, listing the repositories (and optionally the commits or issues) the requester wants excluded [15]. The BigCode team verifies ownership through GitHub's account graph, queues validated requests for the next dataset iteration, and retains the request list to prevent flagged repositories from reappearing in future versions [14][15]. Dataset updates are issued roughly quarterly, and users of the dataset are required by its terms of use to switch to the most recent compliant version when notified [2][14].
By the time of StarCoder's release on 4 May 2023, 44 developers had opted out of The Stack and their repositories had been excluded from training [14]. The opt-out repository has since accumulated substantially more activity: by 2024 to 2026 the open-issues count grew to several hundred, with one community estimate citing more than 1,700 historical opt-out issues filed across The Stack's lifetime [15][16].
Personally identifiable information was treated as a separate problem from licence filtering. The StarCoder paper documents a dedicated PII detection model trained with Toloka on annotations from 1,399 crowd workers across 35 countries, on a corpus of 22,950 secrets, and reports a roughly 90 percent F1 score against regex-based baselines, with particularly large gains on secret keys (API tokens, private keys) [12][14]. PII redaction is applied to the training set fed to StarCoder rather than to the published Stack v1 itself; v2 inherits the same pipeline for StarCoder2 training [3][12].
Stack v1.2 added removal of files flagged as malicious during preparation of StarCoder, on the basis that publishing malware samples in pretraining data risks producing models capable of regenerating known exploits [2]. The v2 dataset card states the same policy is inherited in the SWH-based pipeline [4].
The Stack has been the explicit pretraining corpus or a major component for an unusually wide range of open code language models. The table below summarises the most prominent downstream uses.
| Model | Year | Parameters | Pretraining corpus | Stack version |
|---|---|---|---|---|
| SantaCoder | 2023 | 1.1B | Java + JavaScript + Python subset of The Stack | v1.1 [9] |
| StarCoder / StarCoderBase | 2023 | 15.5B | 1 trillion tokens from The Stack across 80+ languages | v1.2 [12] |
| Replit Code v1 | 2023 | 2.7B | Subset of The Stack near-deduplicated | v1.2 [10] |
| Salesforce CodeGen2 | 2023 | 1B, 3.7B, 7B, 16B | Permissive subset of dedup Stack | v1.1 [8] |
| Salesforce CodeGen2.5 | 2023 | 7B | StarCoderData (derived from The Stack) | v1.2 derivative [8] |
| StarCoder2 (3B / 7B / 15B) | 2024 | 3 / 7 / 15B | 900B-token training mix from Stack v2 plus supplementary sources | v2.0.1 [3][7] |
SantaCoder, released January 2023, was BigCode's first technical demonstration that The Stack could train a competitive 1.1B-parameter Multi-Query-Attention decoder on the Java/JavaScript/Python slice of v1.1, outperforming the previous open Python baselines on MultiPL-E [9]. StarCoder (May 2023) scaled this to 15.5B parameters trained on 1 trillion tokens from v1.2, with 8K-token context, fill-in-the-middle objective, and the PII-redaction pipeline described above [12]. StarCoder2 (February 2024) replaced the v1-based corpus with the Stack v2 mixture and grew the largest model's training budget to 4.3 trillion tokens [3][7].
External adopters include Replit, whose Replit-Code-v1-3B is trained on a near-deduplicated subset of Stack v1.2 [10], and Salesforce, whose CodeGen2 series and CodeGen2.5 use respectively the strict permissive subset of dedup Stack v1.1 and the StarCoderData derivative built from v1.2 [8]. The Stack also serves as the substrate for community fine-tunes such as santacoder-finetuned-the-stack-clojure, which extends SantaCoder to languages outside the original Java/JavaScript/Python slice using The Stack's per-language splits [17].
The Stack sits in a small family of public source-code corpora used to train code large language models. Differences in scale, licence stringency, and data source shape how each is used.
| Dataset | Year | Size | Languages | Licence filter | Source |
|---|---|---|---|---|---|
| CodeParrot | 2021 | ~50 GB (180 GB raw) | Python only | None enforced | GitHub Code crawl [5] |
| GitHub Code | 2021 | ~1 TB | 32 | None enforced | GitHub crawl [5] |
| AlphaCode pretraining set | 2022 | ~715 GB | Multiple | Permissive (described, not released) | GitHub [1] |
| CodeGen pretraining set | 2022 | Multiple TB (private) | Multiple | Permissive (described, not released) | GitHub + BigQuery [1] |
| The Stack v1.1 | 2022 | 6.4 TB raw, 2.9 TB dedup | 358 | 193 SPDX permissive licences | GitHub [2] |
| The Stack v2 | 2024 | 67.5 TB raw, 32.1 TB dedup | 658 | Blue Oak permissive + public domain + unlicensed | Software Heritage [4] |
| Common Pile (code split) | 2024 | Within ~8 TB total mixed-domain | Multiple | Permissive + public domain | Mixed [18] |
Common Crawl is also occasionally used to extract code-like content for general-purpose pretraining, but its code coverage is incidental rather than curated and is not subject to a licence filter [1]. The Stack therefore occupies a distinctive niche: it is the only publicly-released, file-level licence-filtered source-code corpus at multi-terabyte scale, and the only one with a structured opt-out and inspection workflow. Surveys of code LLMs published in 2023 to 2024 consistently treat it as the de facto open analogue of the unreleased AlphaCode and Codex training sets [19].
The v1 licence filter relies on a mixture of GHArchive repository metadata and file-level scanning, neither of which is perfect. Repository-level licence labels can disagree with the licence stated inside individual files, especially in large monorepos that vendor third-party code under non-permissive licences; the file-level fallback detector can miss licences expressed in non-standard headers [2]. The v2 card moves to file-level ScanCode detection with the Blue Oak Council allow-list, but the same class of false positives and false negatives still applies [4].
The most-discussed governance change between v1 and v2 is the explicit inclusion of unlicensed files in v2. Critics on the bigcode/the-stack-v2 discussion forum have argued that under most jurisdictions absence of a licence means the default copyright reservation applies, so including unlicensed files conflicts with the BigCode project's earlier framing of permissive-only training data [4][16]. The same threads observe that some files carrying attribution-requiring licences (for example, BSD) are present in the dataset without an accompanying attribution mechanism in downstream models trained on it [16].
Although the opt-out mechanism has processed dozens of requests since 2023, community reports note that the volume of incoming requests has at times outstripped the cadence of dataset re-releases, and some early requests waited multiple releases for full removal [15][16]. Discussions on the v2 dataset page also raise the concern that opt-outs filed against v1 repositories did not always propagate cleanly into the v2 pipeline, because v2 is reconstructed from the Software Heritage graph rather than the original GitHub repository identifiers used by the v1 opt-out registry [4][16].
A 2025 follow-on study, "Cracks in The Stack" (arXiv:2501.02628), audits a subset of The Stack for known-vulnerable code patterns and licence-attribution failures, reporting that a substantial fraction of files contain CWE-mapped vulnerabilities or non-standard licence headers that the BigCode pipeline did not flag [20]. The study does not call for the dataset to be withdrawn but argues that pretraining on it without an additional vulnerability filter risks teaching code models to reproduce known exploits, echoing the rationale for the v1.2 malicious-file removal [20].
Like any pretraining-scale code corpus, The Stack contains many near-duplicates of widely-copied code, increasing the risk that downstream models memorise and reproduce identifiable spans verbatim. The StarCoder release ships an attribution-tracing tool that lets users search a generated snippet against the training set, partly to mitigate this concern, but the tool requires opt-in use and does not absolve downstream applications of attribution responsibilities under the licences of the original files [12].
The Stack is significant in several overlapping ways. First, it is the largest publicly-released code training dataset with a documented licence-filtering pipeline, making it the de facto reproducibility baseline for any open code large language model released after late 2022; the entire BigCode model family (SantaCoder, StarCoder, StarCoder2) and most third-party open code models from Salesforce and Replit in 2023 to 2024 list it as their primary corpus [6][8][10][12].
Second, the "Am I in The Stack?" tool and the GitHub-based opt-out workflow set a widely-imitated template for dataset-level consent in machine-learning pretraining, predating most regulatory consent mandates for training-data use; the workflow has since been studied as a governance case by the Turing Way and cited in surveys of AI code generation practice [14][19]. Third, by relocating from raw GitHub scraping to the Software Heritage archive in v2, BigCode demonstrated that an existing open-source preservation infrastructure can serve as a stable, addressable substrate for AI training data, an architecture pattern that competing open code datasets and parts of the broader open-data community have begun to emulate [4][13].
Finally, the explicit comparison between models trained on The Stack and earlier closed corpora helped establish that openly-licensed pretraining data is sufficient to recover competitive code-generation performance on standard benchmarks such as HumanEval and MBPP, removing a key empirical argument against transparent code-LLM development [1][7].
Github split that predates The Stack but has no permissive licence filter