SlimPajama
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,248 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,248 words
Add missing citations, update stale details, or suggest a clearer explanation.
SlimPajama is a 627-billion-token English-language pre-training corpus for large language models, produced by Cerebras Systems in collaboration with the Opentensor Foundation by extensively cleaning and deduplicating the 1.21-trillion-token RedPajama corpus from Together AI.[1][2] The dataset was released on 9 June 2023 on the Hugging Face Hub at cerebras/SlimPajama-627B, accompanied by an open-source preprocessing pipeline in the Cerebras modelzoo repository and validation and test holdouts of 500 million tokens each.[1][2] SlimPajama removes 49.6% of the bytes in RedPajama by combining NFC Unicode normalization, low-quality short-document filtering, and document-level MinHash Locality-Sensitive Hashing (MinHashLSH) deduplication with a Jaccard similarity threshold of 0.8.[1][2] The release came with the stated motivation that fewer but higher-quality tokens train better language models than larger but heavily duplicated corpora, a position that builds on prior deduplication research and which Cerebras subsequently validated by training the 3-billion-parameter BTLM-3B-8K language model on a single epoch of SlimPajama.[1][3] SlimPajama was the largest extensively deduplicated open multi-source corpus available at the time of its release and has since been adopted as the pre-training mix or as a study substrate by several open language-model projects, including TinyLlama, Crystal/LLM360, and the SlimPajama-DC ablation series.[1][4][5]
| Field | Value |
|---|---|
| Full name | SlimPajama-627B |
| Type | English-language LLM pre-training corpus |
| Tokens | 627 billion (after dedup and cleaning) |
| Source corpus | RedPajama 1.21T (Together AI replication of the LLaMA data mix) |
| Sources retained | CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange |
| Release date | 9 June 2023 |
| Released by | Cerebras Systems with Opentensor Foundation |
| Hosting | Hugging Face Hub (cerebras/SlimPajama-627B) |
| Validation / test | 500M tokens each, decontaminated against train |
| Deduplication | Document-level MinHashLSH, 13-gram signatures, Jaccard threshold 0.8 |
| License | Per-source (Common Crawl Foundation Terms; C4 license; GitHub MIT/BSD/Apache only; ArXiv ToU; Wikipedia license; StackExchange Internet Archive license; Books from The Pile and PG-19); Cerebras tooling released under Apache 2.0 |
| Notable trained models | BTLM-3B-8K, TinyLlama, Crystal/LLM360, SlimPajama-DC 1.3B and 7B ablations |
| Authors of release post | Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, Nolan Dey (Cerebras); Robert Myers, Jacob Robert Steeves (Opentensor) |
SlimPajama is best understood inside the chain of open-data efforts that followed Meta's February 2023 release of the LLaMA paper. LLaMA was trained on a private mix of 1.0 to 1.4 trillion tokens drawn from CommonCrawl, C4, GitHub, Wikipedia, books, ArXiv, and Stack Exchange, with the per-source proportions disclosed in the paper but the resulting tokens themselves never released.[6] In April 2023, Together AI launched RedPajama-Data-1T, a community project with Ontocord.ai, ETH DS3Lab, Stanford University CRFM, Hazy Research, and the MILA Quebec AI Institute, whose goal was to reproduce the LLaMA training mix by extracting roughly the same number of tokens from each public source and releasing them on the Hugging Face Hub.[6][7] RedPajama followed LLaMA's pre-processing steps closely, including the CCNet pipeline over five CommonCrawl dumps and a Wikipedia-like classifier as the final quality filter for the CommonCrawl shards.[6][7] By the spring of 2023 RedPajama had become the de facto open substitute for the LLaMA mix and was already powering early downstream training runs such as RedPajama-INCITE, the joint effort with the EleutherAI community to train a Pythia-style model on the new corpus.[6][7]
The shift from RedPajama toward SlimPajama was driven by two observations that were widely discussed in the open-LLM research community during the first half of 2023. The first was the empirical finding from Lee et al. and follow-up work that web-scale pre-training corpora contain large amounts of near-duplicate text, that models trained on duplicated data emit memorized strings far more often, and that aggressive deduplication can reach the same or better validation loss with markedly fewer training steps.[8] The second was the Chinchilla scaling result from DeepMind, which argued that for a fixed compute budget the optimal training corpus is large in tokens but does not have to be larger than the compute-optimal point, sharpening the importance of token quality once the token quantity needs are satisfied.[9] Against this backdrop Cerebras decided to systematically clean the RedPajama corpus rather than scale it up. The company had already shipped its open Cerebras-GPT series in March 2023 and so had infrastructure and motivation to release another open artifact useful for studying scaling behavior on its CS-2 wafer-scale hardware.[1][3]
SlimPajama was announced on 9 June 2023 by Daria Soboleva, Faisal Al-Khateeb, Joel Hestness and Nolan Dey of Cerebras, together with Robert Myers and Jacob Robert Steeves of the Opentensor Foundation.[1] The accompanying blog post framed SlimPajama as the first open multi-source corpus to receive MinHashLSH deduplication at trillion-token scale, and it released both the dataset and the open-source preprocessing tooling so that other groups could reproduce or extend the procedure for new corpora.[1][2] The dataset card on Hugging Face describes the per-source proportions, license inheritance, byte-level deduplication rates, and the dataset-structure schema used by the released files, and it positions SlimPajama as a drop-in replacement for RedPajama for groups that want fewer-but-cleaner pre-training tokens.[2]
The central claim behind SlimPajama is that, for a fixed compute budget, training on a smaller but more uniformly clean and de-duplicated dataset produces a better language model than training on a larger but more redundant dataset. Cerebras presented this not as a novel scientific claim but as the disciplined operational consequence of two well-established results.
First, the Lee, Ippolito and Carlini line of work showed that large pre-training corpora are saturated with near-duplicate documents and long repeated substrings; for example, C4 contained a single 61-word English sentence repeated more than 60,000 times. Models trained on the deduplicated versions of these datasets memorize and regurgitate verbatim training text roughly ten times less often, and they reach the same or better validation loss with fewer training steps.[8] Subsequent work by Kandpal, Wallace and Raffel showed that duplicates concentrate privacy risk: a small fraction of high-duplication strings dominate the rate at which models can be induced to emit memorized content, so even modest deduplication delivers outsized privacy benefits.[10] Cerebras cited these results, and the SlimPajama blog post and dataset card both motivate the work in part as a way to reduce regurgitation risk by removing redundancy at source.[1][2][8][10]
Second, the Chinchilla paper from Hoffmann et al. (2022) argued that the compute-optimal balance between model parameters and training tokens for transformer language models was, at the scale of contemporary frontier models, more token-heavy than was typical practice in 2022. The conclusion was not that "more tokens is always better" but that tokens and parameters should be scaled together; if available tokens are redundant, the effective token count for learning is lower than the nominal count.[9] In that framing, deduplication directly raises the effective Chinchilla-relevant token count of a corpus.
Cerebras combined these observations into a single operational stance: rather than gathering more web text, the team chose to subtract the duplicated and trivially low-quality content from an already large open corpus. The SlimPajama post characterizes this as eliminating duplication "as a side effect of cleaning rather than as an intentional upsampling strategy," and presents the resulting 627 billion tokens as a higher-quality budget than the original 1.21 trillion.[1] The empirical defense of this position came later in the BTLM-3B-8K technical report and in the SlimPajama-DC ablation paper, both of which compare configurations trained on SlimPajama against equally-sized configurations trained on RedPajama or on alternative mixes and report better downstream accuracy from the cleaned data at matched training tokens.[3][5]
The SlimPajama pipeline is documented in the Hugging Face dataset card, the Cerebras blog post, and the open-source preprocessing code in the Cerebras/modelzoo repository. It consists of six ordered stages, each implemented to run at trillion-token scale on commodity multi-core hardware.[1][2][11]
The first stage applies Unicode NFC (Normalization Form C) normalization to every document so that, as the dataset card puts it, "letters followed by combining characters become single combined characters, following GPT-2." The result is that variants such as a base letter plus a combining diacritic and a precomposed accented character collapse to one canonical byte sequence. This step removes spurious near-duplicates that differ only in their Unicode encoding and ensures that subsequent n-gram hashing operates on a single normalized form. The choice to follow GPT-2's normalization keeps the released tokens compatible with tokenizers trained on GPT-2-style normalized text.[1][2]
The second stage removes documents that are too short to carry useful signal. Cerebras filters out documents that contain fewer than 200 characters after punctuation, consecutive whitespace, newlines, and tabs are stripped. The dataset card reports the per-source rejection rates from this step, ranging from 0.00% on GitHub, Books, and Wikipedia, through 0.02% on CommonCrawl, 0.32% on StackExchange, and 0.62% on ArXiv, up to 4.70% on C4; the overall rate is 1.86% of bytes removed.[2] The relatively high C4 rate is consistent with the structure of C4, which already contains a large number of short crawled snippets, and the near-zero rate on Books and Wikipedia reflects the long-form nature of those sources.
The third stage performs document-level near-duplicate detection via MinHash Locality-Sensitive Hashing. For each document Cerebras computes a set of 13-grams over the lowercased text after stripping punctuation, consecutive whitespace, newlines, and tabs. The document signature is a MinHash sketch of that 13-gram set, and signatures are inserted into a MinHashLSH index configured to retrieve pairs of documents whose estimated Jaccard similarity is 0.8 or higher.[1][2] The implementation builds on the open-source datasketch library but rewrites the in-memory data structures so that the index fits on a single multi-core machine at trillion-token scale and is more efficient in a distributed setting.[1][11]
The deduplication runs in multiple passes: (i) building the MinHashLSH index over all signatures; (ii) querying the index to enumerate near-duplicate pairs; (iii) constructing a graph in which documents are nodes and near-duplicate pairs are edges, then computing the connected components of that graph; and (iv) within each connected component, keeping one representative document and discarding the rest. The pipeline performs deduplication both within each source and across sources, so a document that appears once in CommonCrawl and again in C4 is collapsed to a single copy.[1][2]
The reported per-source byte-deduplication rates make the case that duplication is heavily concentrated in the web crawls and in code:
| Source | Bytes removed by dedup |
|---|---|
| CommonCrawl | 63.76% |
| C4 | 6.85% |
| GitHub | 46.16% |
| Books | 2.01% |
| ArXiv | 0.06% |
| Wikipedia | 2.24% |
| StackExchange | 0.20% |
| Overall | 49.60% |
The two large numbers, 63.76% on CommonCrawl and 46.16% on GitHub, dominate the overall byte reduction. They reflect, respectively, the well-known fact that CommonCrawl contains many copies of the same canonical pages across snapshots and the equally well-known fact that public GitHub mirrors and re-publishes large swathes of identical source code through forks and vendored copies. The low number on ArXiv, 0.06%, is the operational confirmation that ArXiv is essentially deduplicated at source.[2]
Once duplicates are removed, the surviving documents from each source are interleaved according to target proportions and then shuffled with a two-pass shuffling algorithm adapted from The Pile to prevent residual ordering bias from the original RedPajama files. The interleaving step is what sets the final composition of the dataset by source.[1][11]
The shuffled corpus is split into a training set, a 500-million-token validation set, and a 500-million-token test set. The two holdouts are drawn before the final deduplication pass against the training set so that downstream evaluation is not contaminated by leakage from training.[1][2]
Finally, Cerebras runs a second MinHashLSH pass between the training shards and the held-out validation and test shards so that any near-duplicate of a held-out document is removed from training. The result is a 627-billion-token training set with decontaminated 500-million-token validation and test sets.[1][2]
The release notes report that the full pipeline runs on the 1.21-trillion-token RedPajama corpus in approximately 2.5 days on a single 64-core CPU machine, with peak memory consumption of approximately 1.4 terabytes during the duplicate-pair generation phase. Cerebras emphasizes that, to the best of the authors' knowledge, the released tooling is the first open implementation able to clean and deduplicate text at trillion-token scale on commodity hardware, and that it uses a producer-consumer schema in I/O-bound stages to keep memory usage bounded.[1][11]
After deduplication and short-document filtering, the SlimPajama-627B training set has the following composition by source. The two columns show the SlimPajama proportions alongside the RedPajama proportions for reference; the shift is the consequence of CommonCrawl losing most of its bytes to deduplication while sources such as C4, Books, ArXiv, Wikipedia, and StackExchange lose almost none.[2]
| Source | SlimPajama-627B | RedPajama-1T |
|---|---|---|
| CommonCrawl | 52.2% | 72.6% |
| C4 | 26.7% | 14.4% |
| GitHub | 5.2% | 4.9% |
| Books | 4.2% | 2.1% |
| ArXiv | 4.6% | 2.3% |
| Wikipedia | 3.8% | 2.0% |
| StackExchange | 3.3% | 1.7% |
CommonCrawl remains the single largest source but loses substantial share relative to RedPajama because of its high intra-source and cross-source duplication. C4 takes a much larger share even though it loses a small absolute amount of bytes, because the reduction in CommonCrawl is so much larger. GitHub remains essentially constant in share even after losing nearly half its bytes, because the GitHub partition is small relative to the web shards. Books, ArXiv, Wikipedia, and StackExchange roughly double their share because they were nearly free of duplicates to begin with and so were almost untouched by the dedup pass.[2]
The licensing of SlimPajama is inherited rather than uniform. The dataset card explicitly directs users to the Common Crawl Foundation terms for CommonCrawl, the C4 license for C4, the GitHub partition's permissive-only filter (MIT, BSD, or Apache repositories), the licenses of The Pile and PG-19 for the Books partition, the ArXiv Terms of Use for ArXiv, the relevant Wikipedia license for Wikipedia, and the StackExchange Internet Archive license for StackExchange. The Cerebras preprocessing code itself is released under the Apache 2.0 license.[2][11]
The released files use a uniform JSONL schema in which each document is an object with a text field containing the document content and a meta field containing a redpajama_set_name value identifying its origin (RedPajamaCommonCrawl, RedPajamaC4, RedPajamaGithub, and so on). This makes it trivial for downstream pipelines to filter, upsample, or re-mix by source without re-running the full SlimPajama pipeline.[2]
The first language model explicitly trained on SlimPajama-627B was the Bittensor Language Model BTLM-3B-8K, released by Cerebras and the Opentensor Foundation in July 2023 and described in an arXiv technical report by Dey, Soboleva, Al-Khateeb and colleagues.[3] BTLM-3B-8K was positioned as the direct empirical demonstration that fewer, cleaner tokens train better LLMs.
BTLM-3B-8K has 2.6 billion parameters in a decoder-only transformer with 32 layers, hidden size 2,560, head dimension 80, and feed-forward dimension 6,826. It uses SwiGLU feed-forward activations, ALiBi positional biases, and the maximal-update parameterization scheme muP with tuned multipliers including an embedding multiplier of 14.6 and an output-logit multiplier of 2.22.[3] The model is trained on a single epoch of the entire SlimPajama-627B training set (627 billion tokens) in a two-phase sequence-length curriculum: 470 billion tokens at sequence length 2,048 followed by 157 billion tokens at sequence length 8,192. Training used 64 Cerebras CS-2 wafer-scale accelerators in the Condor Galaxy-1 (CG-1) cluster in Santa Clara, with data parallelism only and no tensor or pipeline parallelism; the report attributes the smoothness of the loss curve, which experienced only two minor spikes, to muP.[3]
The downstream evaluation, reported using the EleutherAI evaluation harness in zero-shot mode (five-shot for MMLU), shows BTLM-3B-8K outperforming all contemporary 3-billion-parameter open models on common-sense reasoning, world knowledge, reading comprehension and code, and matching or surpassing several 7-billion-parameter open models including RedPajama-INCITE-7B-Base, OpenLLaMA-7B, and StableLM-Alpha-7B-v2 despite using 71% fewer training FLOPs than typical 7B runs.[3][12] On long-context tasks such as QMSum and GovReports at 8,192 input length, BTLM-3B-8K exceeds the much larger MPT-7B-8K and XGen-7B-8K, which Cerebras attributes both to the ALiBi extrapolation behavior of the model and to the absence of duplicate documents in the long-document partitions of SlimPajama.[3][12]
Because BTLM-3B-8K was the first end-to-end model trained on SlimPajama with publicly reported recipes, it became the canonical reference point for the claim that the SlimPajama recipe produces a stronger model per FLOP than the unprocessed RedPajama recipe at matched scale. The BTLM-3B-8K technical report explicitly attributes its compute efficiency to the combination of ALiBi, SwiGLU, muP, and "the extensively deduplicated and cleaned SlimPajama-627B dataset."[3]
A complementary investigation of how to mix SlimPajama's sources appeared in September 2023 in the arXiv preprint SlimPajama-DC: Understanding Data Combinations for LLM Training by Shen, Tao, Ma, Neiswanger, Liu, Wang, Tan, Hestness, Vassilieva, Soboleva and Xing.[5] The DC paper trains six different 1.3-billion-parameter Cerebras-GPT-style models on six different 330-billion-token subsets of SlimPajama and then extends the strongest configuration to a 7-billion-parameter model trained with large-batch strategies on the 16-CS-2 cluster delivering 80 PFLOP/s in bfloat16 mixed precision.[5]
The six 330B-token configurations (DC-1 through DC-6) vary the source proportions:
| Source | DC-1 | DC-2 | DC-3 | DC-4 | DC-5 | DC-6 |
|---|---|---|---|---|---|---|
| CommonCrawl | 100.0% | 90.9% | 75.8% | 75.8% | 75.8% | 52.2% |
| C4 | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 26.7% |
| GitHub | 0.0% | 9.1% | 24.2% | 0.0% | 9.1% | 5.2% |
| Books | 0.0% | 0.0% | 0.0% | 0.0% | 7.9% | 4.2% |
| ArXiv | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 4.6% |
| Wikipedia | 0.0% | 0.0% | 0.0% | 24.2% | 7.3% | 3.8% |
| StackExchange | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 3.3% |
DC-6 reproduces the SlimPajama-627B native proportions. The authors report that DC-6 attains the highest average accuracy across all SlimPajama configurations, with an average score of 40.0 over their evaluation suite, while a separately trained 330B configuration on RefinedWeb (DC-7) reaches 41.0. The headline conclusions are that (i) after rigorous global deduplication across sources, increasing diversity of sources becomes more important to downstream accuracy than further increasing the size of any single source, and (ii) the native SlimPajama mix is competitive with the alternative web-only RefinedWeb mix at the same compute budget.[5] The 7B-parameter extension validates the trend at larger scale and confirms that the relative ordering of configurations is consistent across the 1.3B and 7B regimes.[5]
In the same way that BTLM-3B-8K is the canonical demonstration that SlimPajama trains a better LLM than RedPajama at matched FLOPs, SlimPajama-DC is the canonical demonstration that source diversity becomes more, not less, important once a corpus has been globally deduplicated. The two artifacts together turn SlimPajama into a study substrate rather than a one-off corpus.
By late 2023 and into 2024 SlimPajama had become one of the standard open pre-training corpora for academic and open-source efforts that wanted a high-quality multi-source mix without running their own deduplication pipeline. Notable adopters include the following.
A subtler form of adoption is by reference: open-data initiatives such as Dolma, FineWeb and DCLM cite SlimPajama as one of the canonical multi-source baselines against which to compare new English pre-training corpora, and the SlimPajama deduplication recipe (MinHashLSH on 13-grams with Jaccard threshold 0.8) is now a frequently used reference configuration in pipelines for new corpora.[2][5]
Conversely, some prominent open models that use RedPajama-derived training data did not adopt SlimPajama. The Together AI RedPajama-INCITE models were trained directly on RedPajama-Data-1T because they were trained alongside that corpus' release. MosaicML's MPT-7B, also released in May 2023 before SlimPajama existed, used a custom MosaicML mix that includes RedPajama elements but is not identical to either RedPajama or SlimPajama; subsequent MPT variants have used a variety of mixes.[16] These cases illustrate that SlimPajama supplements, rather than uniformly replaces, the original RedPajama corpus in the open-LLM ecosystem.
SlimPajama belongs to a family of openly released multi-source English pre-training corpora that emerged in 2022 to 2024. The most relevant peers are RedPajama (the parent corpus), The Pile, RefinedWeb, Dolma, FineWeb, and DCLM.
| Corpus | Approximate size | Multi-source? | Deduplication | Open code |
|---|---|---|---|---|
| The Pile | ~825 GiB | Yes (22 sources) | Source-level only | Yes |
| RedPajama-Data-1T | 1.21T tokens | Yes (LLaMA mix) | Lightweight | Yes |
| SlimPajama-627B | 627B tokens | Yes (LLaMA mix) | MinHashLSH, J=0.8 | Yes |
| RefinedWeb | ~600B tokens (released subset) | No (web only) | URL + MinHash + fuzzy | Yes |
| Dolma | 3T tokens | Yes | URL + MinHash | Yes |
| FineWeb | 15T tokens | No (web only) | URL + MinHash | Yes |
| DCLM-Baseline | 4T tokens | Mostly web | Multi-stage | Yes |
Within this family SlimPajama occupies a specific niche: it is multi-source (matching the LLaMA mix) and aggressively deduplicated, but at 627B tokens it is much smaller than the more recent web-only corpora such as FineWeb and DCLM-Baseline. Its appeal for academic and small-team training runs is precisely that it is small enough to be tractable on modest hardware and yet diverse enough to mirror the LLaMA-style training mix.[1][2][5]
Compared to RedPajama, SlimPajama is roughly half the size in bytes but is widely reported (by Cerebras and by independent groups including the SlimPajama-DC authors) to produce stronger downstream language models at matched training tokens, with the benefit attributed primarily to deduplication.[1][3][5] Compared to The Pile, SlimPajama follows the LLaMA mix rather than EleutherAI's 22-source mix and includes substantially more web content and less academic and technical content; Cerebras' own preprocessing pipeline draws explicitly on The Pile's two-pass shuffling algorithm.[1][11] Compared to RefinedWeb and FineWeb, SlimPajama retains non-web sources such as Books, ArXiv, Wikipedia and StackExchange, which makes it more suitable for models that need explicit exposure to long-form academic and reference text rather than only crawled web text.[17][18]
SlimPajama has three durable contributions to the open-LLM ecosystem.
First, the dataset itself: a 627-billion-token, multi-source, globally deduplicated, openly licensed corpus that is small enough to be trained on by single-cluster academic groups and large enough to train models in the 1B-to-7B-parameter range to competitive quality. It has become a stable reference corpus for projects ranging from BTLM-3B-8K and TinyLlama to SlimLM, Crystal and the SlimPajama-DC ablation series.[1][3][4][13][15]
Second, the open-source tooling for cleaning and MinHashLSH deduplication at trillion-token scale, including a re-implementation of MinHashLSH on top of datasketch engineered to fit in memory on a 64-core machine and to run end-to-end in about 2.5 days. Because the code is released under Apache 2.0 in the Cerebras/modelzoo repository, subsequent corpora can adopt the same dedup recipe with the same parameters (13-gram signatures, Jaccard threshold 0.8) and report comparable numbers. This is reflected in the fact that 13-gram MinHash with J=0.8 has become a common reference configuration in the documentation of newer corpora.[1][11][17][18]
Third, an empirical case for the "fewer-but-cleaner" position. The BTLM-3B-8K release demonstrates a 3-billion-parameter model trained on a single epoch of SlimPajama that matches several 7-billion-parameter open models on standard benchmarks, and the SlimPajama-DC paper demonstrates that within SlimPajama itself, source diversity matters more than source size once global deduplication has been performed. Together these two results have been cited in subsequent open-data work as a primary justification for investing in aggressive deduplication and source diversity rather than in larger, lightly cleaned corpora.[3][5]
Several limitations of SlimPajama have been noted in primary and secondary sources and are worth recording in an encyclopedic discussion.
These limitations are not so much defects as design choices: SlimPajama is explicitly the deduplicated and minimally filtered cousin of RedPajama, not an attempt to apply every subsequent data-curation technique. They do, however, mark out where the corpus is and is not appropriate to use in 2026.