SlimPajama

Data & Datasets Large Language Models Open Source AI

26 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 5,246 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SlimPajama is a 627-billion-token English-language pre-training corpus for large language models, produced by Cerebras Systems in collaboration with the Opentensor Foundation by extensively cleaning and deduplicating the 1.21-trillion-token RedPajama corpus from Together AI.^[1]^[2] The dataset was released on 9 June 2023 on the Hugging Face Hub at cerebras/SlimPajama-627B, accompanied by an open-source preprocessing pipeline in the Cerebras modelzoo repository and validation and test holdouts of 500 million tokens each.^[1]^[2] SlimPajama removes 49.6% of the bytes in RedPajama by combining NFC Unicode normalization, low-quality short-document filtering, and document-level MinHash Locality-Sensitive Hashing (MinHashLSH) deduplication with a Jaccard similarity threshold of 0.8.^[1]^[2] The release came with the stated motivation that fewer but higher-quality tokens train better language models than larger but heavily duplicated corpora, a position that builds on prior deduplication research and which Cerebras subsequently validated by training the 3-billion-parameter BTLM-3B-8K language model on a single epoch of SlimPajama.^[1]^[3] SlimPajama was the largest extensively deduplicated open multi-source corpus available at the time of its release and has since been adopted as the pre-training mix or as a study substrate by several open language-model projects, including TinyLlama, Crystal/LLM360, and the SlimPajama-DC ablation series.^[1]^[4]^[5]

Infobox

Field	Value
Full name	SlimPajama-627B
Type	English-language LLM pre-training corpus
Tokens	627 billion (after dedup and cleaning)
Source corpus	RedPajama 1.21T (Together AI replication of the LLaMA data mix)
Sources retained	CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange
Release date	9 June 2023
Released by	Cerebras Systems with Opentensor Foundation
Hosting	Hugging Face Hub (`cerebras/SlimPajama-627B`)
Validation / test	500M tokens each, decontaminated against train
Deduplication	Document-level MinHashLSH, 13-gram signatures, Jaccard threshold 0.8
License	Per-source (Common Crawl Foundation Terms; C4 license; GitHub MIT/BSD/Apache only; ArXiv ToU; Wikipedia license; StackExchange Internet Archive license; Books from The Pile and PG-19); Cerebras tooling released under Apache 2.0
Notable trained models	BTLM-3B-8K, TinyLlama, Crystal/LLM360, SlimPajama-DC 1.3B and 7B ablations
Authors of release post	Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, Nolan Dey (Cerebras); Robert Myers, Jacob Robert Steeves (Opentensor)

Background

SlimPajama is best understood inside the chain of open-data efforts that followed Meta's February 2023 release of the LLaMA paper. LLaMA was trained on a private mix of 1.0 to 1.4 trillion tokens drawn from CommonCrawl, C4, GitHub, Wikipedia, books, ArXiv, and Stack Exchange, with the per-source proportions disclosed in the paper but the resulting tokens themselves never released.^[6] In April 2023, Together AI launched RedPajama-Data-1T, a community project with Ontocord.ai, ETH DS3Lab, Stanford University CRFM, Hazy Research, and the MILA Quebec AI Institute, whose goal was to reproduce the LLaMA training mix by extracting roughly the same number of tokens from each public source and releasing them on the Hugging Face Hub.^[6]^[7] RedPajama followed LLaMA's pre-processing steps closely, including the CCNet pipeline over five CommonCrawl dumps and a Wikipedia-like classifier as the final quality filter for the CommonCrawl shards.^[6]^[7] By the spring of 2023 RedPajama had become the de facto open substitute for the LLaMA mix and was already powering early downstream training runs such as RedPajama-INCITE, the joint effort with the EleutherAI community to train a Pythia-style model on the new corpus.^[6]^[7]

The shift from RedPajama toward SlimPajama was driven by two observations that were widely discussed in the open-LLM research community during the first half of 2023. The first was the empirical finding from Lee et al. and follow-up work that web-scale pre-training corpora contain large amounts of near-duplicate text, that models trained on duplicated data emit memorized strings far more often, and that aggressive deduplication can reach the same or better validation loss with markedly fewer training steps.^[8] The second was the Chinchilla scaling result from DeepMind, which argued that for a fixed compute budget the optimal training corpus is large in tokens but does not have to be larger than the compute-optimal point, sharpening the importance of token quality once the token quantity needs are satisfied.^[9] Against this backdrop Cerebras decided to systematically clean the RedPajama corpus rather than scale it up. The company had already shipped its open Cerebras-GPT series in March 2023 and so had infrastructure and motivation to release another open artifact useful for studying scaling behavior on its CS-2 wafer-scale hardware.^[1]^[3]

SlimPajama was announced on 9 June 2023 by Daria Soboleva, Faisal Al-Khateeb, Joel Hestness and Nolan Dey of Cerebras, together with Robert Myers and Jacob Robert Steeves of the Opentensor Foundation.^[1] The accompanying blog post framed SlimPajama as the first open multi-source corpus to receive MinHashLSH deduplication at trillion-token scale, and it released both the dataset and the open-source preprocessing tooling so that other groups could reproduce or extend the procedure for new corpora.^[1]^[2] The dataset card on Hugging Face describes the per-source proportions, license inheritance, byte-level deduplication rates, and the dataset-structure schema used by the released files, and it positions SlimPajama as a drop-in replacement for RedPajama for groups that want fewer-but-cleaner pre-training tokens.^[2]

Motivation: data quality versus data quantity

The central claim behind SlimPajama is that, for a fixed compute budget, training on a smaller but more uniformly clean and de-duplicated dataset produces a better language model than training on a larger but more redundant dataset. Cerebras presented this not as a novel scientific claim but as the disciplined operational consequence of two well-established results.

First, the Lee, Ippolito and Carlini line of work showed that large pre-training corpora are saturated with near-duplicate documents and long repeated substrings; for example, C4 contained a single 61-word English sentence repeated more than 60,000 times. Models trained on the deduplicated versions of these datasets memorize and regurgitate verbatim training text roughly ten times less often, and they reach the same or better validation loss with fewer training steps.^[8] Subsequent work by Kandpal, Wallace and Raffel showed that duplicates concentrate privacy risk: a small fraction of high-duplication strings dominate the rate at which models can be induced to emit memorized content, so even modest deduplication delivers outsized privacy benefits.^[10] Cerebras cited these results, and the SlimPajama blog post and dataset card both motivate the work in part as a way to reduce regurgitation risk by removing redundancy at source.^[1]^[2]^[8]^[10]

Second, the Chinchilla paper from Hoffmann et al. (2022) argued that the compute-optimal balance between model parameters and training tokens for transformer language models was, at the scale of contemporary frontier models, more token-heavy than was typical practice in 2022. The conclusion was not that "more tokens is always better" but that tokens and parameters should be scaled together; if available tokens are redundant, the effective token count for learning is lower than the nominal count.^[9] In that framing, deduplication directly raises the effective Chinchilla-relevant token count of a corpus.

Cerebras combined these observations into a single operational stance: rather than gathering more web text, the team chose to subtract the duplicated and trivially low-quality content from an already large open corpus. The SlimPajama post characterizes this as eliminating duplication "as a side effect of cleaning rather than as an intentional upsampling strategy," and presents the resulting 627 billion tokens as a higher-quality budget than the original 1.21 trillion.^[1] The empirical defense of this position came later in the BTLM-3B-8K technical report and in the SlimPajama-DC ablation paper, both of which compare configurations trained on SlimPajama against equally-sized configurations trained on RedPajama or on alternative mixes and report better downstream accuracy from the cleaned data at matched training tokens.^[3]^[5]

The cleaning and deduplication pipeline

The SlimPajama pipeline is documented in the Hugging Face dataset card, the Cerebras blog post, and the open-source preprocessing code in the Cerebras/modelzoo repository. It consists of six ordered stages, each implemented to run at trillion-token scale on commodity multi-core hardware.^[1]^[2]^[11]

Stage 1: NFC normalization

The first stage applies Unicode NFC (Normalization Form C) normalization to every document so that, as the dataset card puts it, "letters followed by combining characters become single combined characters, following GPT-2." The result is that variants such as a base letter plus a combining diacritic and a precomposed accented character collapse to one canonical byte sequence. This step removes spurious near-duplicates that differ only in their Unicode encoding and ensures that subsequent n-gram hashing operates on a single normalized form. The choice to follow GPT-2's normalization keeps the released tokens compatible with tokenizers trained on GPT-2-style normalized text.^[1]^[2]

Stage 2: Short-document filtering

The second stage removes documents that are too short to carry useful signal. Cerebras filters out documents that contain fewer than 200 characters after punctuation, consecutive whitespace, newlines, and tabs are stripped. The dataset card reports the per-source rejection rates from this step, ranging from 0.00% on GitHub, Books, and Wikipedia, through 0.02% on CommonCrawl, 0.32% on StackExchange, and 0.62% on ArXiv, up to 4.70% on C4; the overall rate is 1.86% of bytes removed.^[2] The relatively high C4 rate is consistent with the structure of C4, which already contains a large number of short crawled snippets, and the near-zero rate on Books and Wikipedia reflects the long-form nature of those sources.

Stage 3: MinHashLSH near-deduplication

The third stage performs document-level near-duplicate detection via MinHash Locality-Sensitive Hashing. For each document Cerebras computes a set of 13-grams over the lowercased text after stripping punctuation, consecutive whitespace, newlines, and tabs. The document signature is a MinHash sketch of that 13-gram set, and signatures are inserted into a MinHashLSH index configured to retrieve pairs of documents whose estimated Jaccard similarity is 0.8 or higher.^[1]^[2] The implementation builds on the open-source datasketch library but rewrites the in-memory data structures so that the index fits on a single multi-core machine at trillion-token scale and is more efficient in a distributed setting.^[1]^[11]

The deduplication runs in multiple passes: (i) building the MinHashLSH index over all signatures; (ii) querying the index to enumerate near-duplicate pairs; (iii) constructing a graph in which documents are nodes and near-duplicate pairs are edges, then computing the connected components of that graph; and (iv) within each connected component, keeping one representative document and discarding the rest. The pipeline performs deduplication both within each source and across sources, so a document that appears once in CommonCrawl and again in C4 is collapsed to a single copy.^[1]^[2]

The reported per-source byte-deduplication rates make the case that duplication is heavily concentrated in the web crawls and in code:

Source	Bytes removed by dedup
CommonCrawl	63.76%
C4	6.85%
GitHub	46.16%
Books	2.01%
ArXiv	0.06%
Wikipedia	2.24%
StackExchange	0.20%
Overall	49.60%

The two large numbers, 63.76% on CommonCrawl and 46.16% on GitHub, dominate the overall byte reduction. They reflect, respectively, the well-known fact that CommonCrawl contains many copies of the same canonical pages across snapshots and the equally well-known fact that public GitHub mirrors and re-publishes large swathes of identical source code through forks and vendored copies. The low number on ArXiv, 0.06%, is the operational confirmation that ArXiv is essentially deduplicated at source.^[2]

Stage 4: Interleaving and shuffling

Once duplicates are removed, the surviving documents from each source are interleaved according to target proportions and then shuffled with a two-pass shuffling algorithm adapted from The Pile to prevent residual ordering bias from the original RedPajama files. The interleaving step is what sets the final composition of the dataset by source.^[1]^[11]

Stage 5: Train/holdout split

The shuffled corpus is split into a training set, a 500-million-token validation set, and a 500-million-token test set. The two holdouts are drawn before the final deduplication pass against the training set so that downstream evaluation is not contaminated by leakage from training.^[1]^[2]

Stage 6: Train-against-holdout deduplication

Finally, Cerebras runs a second MinHashLSH pass between the training shards and the held-out validation and test shards so that any near-duplicate of a held-out document is removed from training. The result is a 627-billion-token training set with decontaminated 500-million-token validation and test sets.^[1]^[2]

Engineering metrics

The release notes report that the full pipeline runs on the 1.21-trillion-token RedPajama corpus in approximately 2.5 days on a single 64-core CPU machine, with peak memory consumption of approximately 1.4 terabytes during the duplicate-pair generation phase. Cerebras emphasizes that, to the best of the authors' knowledge, the released tooling is the first open implementation able to clean and deduplicate text at trillion-token scale on commodity hardware, and that it uses a producer-consumer schema in I/O-bound stages to keep memory usage bounded.^[1]^[11]

Composition by source

After deduplication and short-document filtering, the SlimPajama-627B training set has the following composition by source. The two columns show the SlimPajama proportions alongside the RedPajama proportions for reference; the shift is the consequence of CommonCrawl losing most of its bytes to deduplication while sources such as C4, Books, ArXiv, Wikipedia, and StackExchange lose almost none.^[2]

Source	SlimPajama-627B	RedPajama-1T
CommonCrawl	52.2%	72.6%
C4	26.7%	14.4%
GitHub	5.2%	4.9%
Books	4.2%	2.1%
ArXiv	4.6%	2.3%
Wikipedia	3.8%	2.0%
StackExchange	3.3%	1.7%

CommonCrawl remains the single largest source but loses substantial share relative to RedPajama because of its high intra-source and cross-source duplication. C4 takes a much larger share even though it loses a small absolute amount of bytes, because the reduction in CommonCrawl is so much larger. GitHub remains essentially constant in share even after losing nearly half its bytes, because the GitHub partition is small relative to the web shards. Books, ArXiv, Wikipedia, and StackExchange roughly double their share because they were nearly free of duplicates to begin with and so were almost untouched by the dedup pass.^[2]

The licensing of SlimPajama is inherited rather than uniform. The dataset card explicitly directs users to the Common Crawl Foundation terms for CommonCrawl, the C4 license for C4, the GitHub partition's permissive-only filter (MIT, BSD, or Apache repositories), the licenses of The Pile and PG-19 for the Books partition, the ArXiv Terms of Use for ArXiv, the relevant Wikipedia license for Wikipedia, and the StackExchange Internet Archive license for StackExchange. The Cerebras preprocessing code itself is released under the Apache 2.0 license.^[2]^[11]

The released files use a uniform JSONL schema in which each document is an object with a text field containing the document content and a meta field containing a redpajama_set_name value identifying its origin (RedPajamaCommonCrawl, RedPajamaC4, RedPajamaGithub, and so on). This makes it trivial for downstream pipelines to filter, upsample, or re-mix by source without re-running the full SlimPajama pipeline.^[2]

BTLM-3B-8K: proof of value

The first language model explicitly trained on SlimPajama-627B was the Bittensor Language Model BTLM-3B-8K, released by Cerebras and the Opentensor Foundation in July 2023 and described in an arXiv technical report by Dey, Soboleva, Al-Khateeb and colleagues.^[3] BTLM-3B-8K was positioned as the direct empirical demonstration that fewer, cleaner tokens train better LLMs.

BTLM-3B-8K has 2.6 billion parameters in a decoder-only transformer with 32 layers, hidden size 2,560, head dimension 80, and feed-forward dimension 6,826. It uses SwiGLU feed-forward activations, ALiBi positional biases, and the maximal-update parameterization scheme muP with tuned multipliers including an embedding multiplier of 14.6 and an output-logit multiplier of 2.22.^[3] The model is trained on a single epoch of the entire SlimPajama-627B training set (627 billion tokens) in a two-phase sequence-length curriculum: 470 billion tokens at sequence length 2,048 followed by 157 billion tokens at sequence length 8,192. Training used 64 Cerebras CS-2 wafer-scale accelerators in the Condor Galaxy-1 (CG-1) cluster in Santa Clara, with data parallelism only and no tensor or pipeline parallelism; the report attributes the smoothness of the loss curve, which experienced only two minor spikes, to muP.^[3]

The downstream evaluation, reported using the EleutherAI evaluation harness in zero-shot mode (five-shot for MMLU), shows BTLM-3B-8K outperforming all contemporary 3-billion-parameter open models on common-sense reasoning, world knowledge, reading comprehension and code, and matching or surpassing several 7-billion-parameter open models including RedPajama-INCITE-7B-Base, OpenLLaMA-7B, and StableLM-Alpha-7B-v2 despite using 71% fewer training FLOPs than typical 7B runs.^[3]^[12] On long-context tasks such as QMSum and GovReports at 8,192 input length, BTLM-3B-8K exceeds the much larger MPT-7B-8K and XGen-7B-8K, which Cerebras attributes both to the ALiBi extrapolation behavior of the model and to the absence of duplicate documents in the long-document partitions of SlimPajama.^[3]^[12]

Because BTLM-3B-8K was the first end-to-end model trained on SlimPajama with publicly reported recipes, it became the canonical reference point for the claim that the SlimPajama recipe produces a stronger model per FLOP than the unprocessed RedPajama recipe at matched scale. The BTLM-3B-8K technical report explicitly attributes its compute efficiency to the combination of ALiBi, SwiGLU, muP, and "the extensively deduplicated and cleaned SlimPajama-627B dataset."^[3]

SlimPajama-DC: data-combination ablations

A complementary investigation of how to mix SlimPajama's sources appeared in September 2023 in the arXiv preprint SlimPajama-DC: Understanding Data Combinations for LLM Training by Shen, Tao, Ma, Neiswanger, Liu, Wang, Tan, Hestness, Vassilieva, Soboleva and Xing.^[5] The DC paper trains six different 1.3-billion-parameter Cerebras-GPT-style models on six different 330-billion-token subsets of SlimPajama and then extends the strongest configuration to a 7-billion-parameter model trained with large-batch strategies on the 16-CS-2 cluster delivering 80 PFLOP/s in bfloat16 mixed precision.^[5]

The six 330B-token configurations (DC-1 through DC-6) vary the source proportions:

Source	DC-1	DC-2	DC-3	DC-4	DC-5	DC-6
CommonCrawl	100.0%	90.9%	75.8%	75.8%	75.8%	52.2%
C4	0.0%	0.0%	0.0%	0.0%	0.0%	26.7%
GitHub	0.0%	9.1%	24.2%	0.0%	9.1%	5.2%
Books	0.0%	0.0%	0.0%	0.0%	7.9%	4.2%
ArXiv	0.0%	0.0%	0.0%	0.0%	0.0%	4.6%
Wikipedia	0.0%	0.0%	0.0%	24.2%	7.3%	3.8%
StackExchange	0.0%	0.0%	0.0%	0.0%	0.0%	3.3%

DC-6 reproduces the SlimPajama-627B native proportions. The authors report that DC-6 attains the highest average accuracy across all SlimPajama configurations, with an average score of 40.0 over their evaluation suite, while a separately trained 330B configuration on RefinedWeb (DC-7) reaches 41.0. The headline conclusions are that (i) after rigorous global deduplication across sources, increasing diversity of sources becomes more important to downstream accuracy than further increasing the size of any single source, and (ii) the native SlimPajama mix is competitive with the alternative web-only RefinedWeb mix at the same compute budget.^[5] The 7B-parameter extension validates the trend at larger scale and confirms that the relative ordering of configurations is consistent across the 1.3B and 7B regimes.^[5]

In the same way that BTLM-3B-8K is the canonical demonstration that SlimPajama trains a better LLM than RedPajama at matched FLOPs, SlimPajama-DC is the canonical demonstration that source diversity becomes more, not less, important once a corpus has been globally deduplicated. The two artifacts together turn SlimPajama into a study substrate rather than a one-off corpus.

Downstream adoption

By late 2023 and into 2024 SlimPajama had become one of the standard open pre-training corpora for academic and open-source efforts that wanted a high-quality multi-source mix without running their own deduplication pipeline. Notable adopters include the following.

TinyLlama. The TinyLlama project trained a 1.1-billion-parameter decoder-only model on roughly 3 trillion tokens, sampled at a ratio of approximately 7:3 from SlimPajama (natural language) and StarCoderData (source code). The TinyLlama paper cites SlimPajama (Soboleva et al., 2023) as its natural-language data source and notes that the deduplicated nature of SlimPajama is what makes a multi-epoch 3T-token regime over a 627B-token corpus a reasonable choice without overfitting on duplicates.^[4]
Crystal / LLM360. The Crystal 7B language model, released as part of the LLM360 open community-driven LLM initiative, was trained distinctively on SlimPajama and StarCoderData and uses SlimPajama as its English text source.^[13]
SlimPajama-DC. The MBZUAI-led 1.3B and 7B ablation models discussed above are themselves a class of SlimPajama-trained models and have been re-used by other researchers as off-the-shelf baselines for studying data mixing.^[5]^[14]
SlimLM. SlimLM, an efficient small language model targeted at on-device document assistance, was pre-trained on SlimPajama-627B and fine-tuned for summarization, question answering and suggestion tasks on a document-assistance dataset called DocAssist.^[15]

A subtler form of adoption is by reference: open-data initiatives such as Dolma, FineWeb and DCLM cite SlimPajama as one of the canonical multi-source baselines against which to compare new English pre-training corpora, and the SlimPajama deduplication recipe (MinHashLSH on 13-grams with Jaccard threshold 0.8) is now a frequently used reference configuration in pipelines for new corpora.^[2]^[5]

Conversely, some prominent open models that use RedPajama-derived training data did not adopt SlimPajama. The Together AI RedPajama-INCITE models were trained directly on RedPajama-Data-1T because they were trained alongside that corpus' release. MosaicML's MPT-7B, also released in May 2023 before SlimPajama existed, used a custom MosaicML mix that includes RedPajama elements but is not identical to either RedPajama or SlimPajama; subsequent MPT variants have used a variety of mixes.^[16] These cases illustrate that SlimPajama supplements, rather than uniformly replaces, the original RedPajama corpus in the open-LLM ecosystem.

SlimPajama belongs to a family of openly released multi-source English pre-training corpora that emerged in 2022 to 2024. The most relevant peers are RedPajama (the parent corpus), The Pile, RefinedWeb, Dolma, FineWeb, and DCLM.

Corpus	Approximate size	Multi-source?	Deduplication	Open code
The Pile	~825 GiB	Yes (22 sources)	Source-level only	Yes
RedPajama-Data-1T	1.21T tokens	Yes (LLaMA mix)	Lightweight	Yes
SlimPajama-627B	627B tokens	Yes (LLaMA mix)	MinHashLSH, J=0.8	Yes
RefinedWeb	~600B tokens (released subset)	No (web only)	URL + MinHash + fuzzy	Yes
Dolma	3T tokens	Yes	URL + MinHash	Yes
FineWeb	15T tokens	No (web only)	URL + MinHash	Yes
DCLM-Baseline	4T tokens	Mostly web	Multi-stage	Yes

Within this family SlimPajama occupies a specific niche: it is multi-source (matching the LLaMA mix) and aggressively deduplicated, but at 627B tokens it is much smaller than the more recent web-only corpora such as FineWeb and DCLM-Baseline. Its appeal for academic and small-team training runs is precisely that it is small enough to be tractable on modest hardware and yet diverse enough to mirror the LLaMA-style training mix.^[1]^[2]^[5]

Compared to RedPajama, SlimPajama is roughly half the size in bytes but is widely reported (by Cerebras and by independent groups including the SlimPajama-DC authors) to produce stronger downstream language models at matched training tokens, with the benefit attributed primarily to deduplication.^[1]^[3]^[5] Compared to The Pile, SlimPajama follows the LLaMA mix rather than EleutherAI's 22-source mix and includes substantially more web content and less academic and technical content; Cerebras' own preprocessing pipeline draws explicitly on The Pile's two-pass shuffling algorithm.^[1]^[11] Compared to RefinedWeb and FineWeb, SlimPajama retains non-web sources such as Books, ArXiv, Wikipedia and StackExchange, which makes it more suitable for models that need explicit exposure to long-form academic and reference text rather than only crawled web text.^[17]^[18]

Significance

SlimPajama has three durable contributions to the open-LLM ecosystem.

First, the dataset itself: a 627-billion-token, multi-source, globally deduplicated, openly licensed corpus that is small enough to be trained on by single-cluster academic groups and large enough to train models in the 1B-to-7B-parameter range to competitive quality. It has become a stable reference corpus for projects ranging from BTLM-3B-8K and TinyLlama to SlimLM, Crystal and the SlimPajama-DC ablation series.^[1]^[3]^[4]^[13]^[15]

Second, the open-source tooling for cleaning and MinHashLSH deduplication at trillion-token scale, including a re-implementation of MinHashLSH on top of datasketch engineered to fit in memory on a 64-core machine and to run end-to-end in about 2.5 days. Because the code is released under Apache 2.0 in the Cerebras/modelzoo repository, subsequent corpora can adopt the same dedup recipe with the same parameters (13-gram signatures, Jaccard threshold 0.8) and report comparable numbers. This is reflected in the fact that 13-gram MinHash with J=0.8 has become a common reference configuration in the documentation of newer corpora.^[1]^[11]^[17]^[18]

Third, an empirical case for the "fewer-but-cleaner" position. The BTLM-3B-8K release demonstrates a 3-billion-parameter model trained on a single epoch of SlimPajama that matches several 7-billion-parameter open models on standard benchmarks, and the SlimPajama-DC paper demonstrates that within SlimPajama itself, source diversity matters more than source size once global deduplication has been performed. Together these two results have been cited in subsequent open-data work as a primary justification for investing in aggressive deduplication and source diversity rather than in larger, lightly cleaned corpora.^[3]^[5]

Limitations and criticisms

Several limitations of SlimPajama have been noted in primary and secondary sources and are worth recording in an encyclopedic discussion.

English-only. SlimPajama is essentially English and inherits from RedPajama, which itself targets the LLaMA English-data recipe. It is therefore not directly suitable for multilingual pre-training without supplementation. The Cerebras blog post and dataset card describe SlimPajama as an English dataset and make no claims of multilingual coverage.^[1]^[2]
Bound to the LLaMA-1 mix. SlimPajama's source list (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange) and source ratios derive from the LLaMA paper. Subsequent advances in pre-training data, including the rise of code-heavy mixes, math-heavy mixes, and synthetic instruction data, are not represented. SlimPajama is therefore best understood as a high-quality reproduction of the 2023 LLaMA-style corpus, not as a state-of-the-art current corpus for frontier models.^[2]^[6]
Smaller than current web-only corpora. At 627B tokens, SlimPajama is roughly 20 to 25 times smaller than FineWeb (~15T tokens) and 6 to 7 times smaller than DCLM-Baseline (~4T tokens). For training compute-optimal models at the 7B-parameter range or above at modern token-per-parameter ratios, SlimPajama alone is too small; recent practice typically combines SlimPajama with code corpora (such as StarCoderData) and additional web corpora to reach multi-trillion-token budgets.^[4]^[17]^[18]
Per-source licensing complexity. The dataset is not released under a single permissive license. Each source partition carries its own upstream license terms (Common Crawl Foundation, C4, GitHub permissive-only, The Pile/PG-19 books, ArXiv ToU, Wikipedia license, StackExchange Internet Archive license), so commercial users must evaluate license compatibility partition by partition.^[2]
Document-level dedup leaves substring duplication. MinHashLSH at the document level with a Jaccard threshold of 0.8 captures near-duplicate documents but does not specifically target shorter repeated substrings within otherwise distinct documents. Subsequent research (for example, on FineWeb and DCLM) has explored substring-level and semantic-level deduplication on top of document-level dedup; SlimPajama does not include such layers.^[17]^[18]
No human-quality filtering. Beyond the 200-character minimum and the source-specific quality filters inherited from RedPajama, SlimPajama does not apply additional learned-classifier or model-based quality filtering of the kind used in RefinedWeb, FineWeb-Edu, or DCLM-Baseline. The dataset is therefore "cleaner than RedPajama" but not "filtered for educational quality" in the sense of those later corpora.^[17]^[18]

These limitations are not so much defects as design choices: SlimPajama is explicitly the deduplicated and minimally filtered cousin of RedPajama, not an attempt to apply every subsequent data-curation technique. They do, however, mark out where the corpus is and is not appropriate to use in 2026.

RedPajama: the parent 1.21-trillion-token corpus from Together AI.
The Pile: the EleutherAI 22-source corpus that influenced SlimPajama's shuffling design.
C4 (Colossal Clean Crawled Corpus): a component of both RedPajama and SlimPajama.
Common Crawl: the underlying web archive for the largest SlimPajama partition.
RefinedWeb: the contemporaneous web-only deduplicated corpus from TII Falcon.
FineWeb: a 2024 large-scale web-only corpus from Hugging Face.
DCLM (DataComp for Language Models): a 2024 benchmark and corpus for data curation.
Dolma: a 2024 3T-token multi-source corpus from AI2.
Chinchilla and Chinchilla scaling laws: the compute-optimal motivation for caring about token quality.
Scaling Laws for Neural Language Models: the underlying scaling-law framework.
muP (Maximal Update Parametrization), ALiBi, SwiGLU: the three architectural ingredients used together with SlimPajama in BTLM-3B-8K.
Falcon (language model): a contemporary open 7B and 40B model trained on RefinedWeb.

References

Soboleva, Daria; Al-Khateeb, Faisal; Myers, Robert; Steeves, Jacob Robert; Hestness, Joel; Dey, Nolan. "SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama", Cerebras Systems blog, 2023-06-09. https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama. Accessed 2026-05-20. ↩
Cerebras Systems. "cerebras/SlimPajama-627B dataset card", Hugging Face Hub, 2023-06-09. https://huggingface.co/datasets/cerebras/SlimPajama-627B. Accessed 2026-05-20. ↩
Dey, Nolan; Soboleva, Daria; Al-Khateeb, Faisal; Yang, Bowen; Pathria, Ribhu; Khachane, Hemant; Muhammad, Shaheer; Chen, Zhiming; Myers, Robert; Steeves, Jacob Robert; Vassilieva, Natalia; Tom, Marvin; Hestness, Joel. "BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model", arXiv:2309.11568, 2023-09-20. https://arxiv.org/abs/2309.11568. Accessed 2026-05-20. ↩
Zhang, Peiyuan; Zeng, Guangtao; Wang, Tianduo; Lu, Weizhu. "TinyLlama: An Open-Source Small Language Model", arXiv:2401.02385, 2024-01-04. https://arxiv.org/abs/2401.02385. Accessed 2026-05-20. ↩
Shen, Zhiqiang; Tao, Tianhua; Ma, Liqun; Neiswanger, Willie; Liu, Zhengzhong; Wang, Hongyi; Tan, Bowen; Hestness, Joel; Vassilieva, Natalia; Soboleva, Daria; Xing, Eric. "SlimPajama-DC: Understanding Data Combinations for LLM Training", arXiv:2309.10818, 2023-09-19. https://arxiv.org/abs/2309.10818. Accessed 2026-05-20. ↩
Together AI. "RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens", Together AI blog, 2023-04-17. https://www.together.ai/blog/redpajama. Accessed 2026-05-20. ↩
Together Computer. "togethercomputer/RedPajama-Data-1T dataset", Hugging Face Hub, 2023-04. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T. Accessed 2026-05-20. ↩
Lee, Katherine; Ippolito, Daphne; Nystrom, Andrew; Zhang, Chiyuan; Eck, Douglas; Callison-Burch, Chris; Carlini, Nicholas. "Deduplicating Training Data Makes Language Models Better", arXiv:2107.06499, 2021-07-14. https://arxiv.org/abs/2107.06499. Accessed 2026-05-20. ↩
Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; et al. "Training Compute-Optimal Large Language Models", arXiv:2203.15556, 2022-03-29. https://arxiv.org/abs/2203.15556. Accessed 2026-05-20. ↩
Kandpal, Nikhil; Wallace, Eric; Raffel, Colin. "Deduplicating Training Data Mitigates Privacy Risks in Language Models", Proceedings of ICML 2022, 2022-07. https://proceedings.mlr.press/v162/kandpal22a/kandpal22a.pdf. Accessed 2026-05-20. ↩
Cerebras Systems. "Cerebras/modelzoo: SlimPajama data preparation scripts", GitHub repository, 2023-06. https://github.com/Cerebras/modelzoo/tree/main/src/cerebras/modelzoo/data_preparation/nlp/slimpajama. Accessed 2026-05-20. ↩
Cerebras Systems. "BTLM-3B-8K: 7B Performance in a 3 Billion Parameter Model", Cerebras Systems blog, 2023-07-24. https://www.cerebras.ai/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model. Accessed 2026-05-20. ↩
LLM360 project. "LLM360: Open-Source LLMs towards Community-Driven AGI", project site, 2023-12. https://www.llm360.ai/. Accessed 2026-05-20. ↩
MBZUAI-LLM. "SlimPajama-DC datasets and models", Hugging Face Hub, 2023-09. https://huggingface.co/MBZUAI-LLM/SlimPajama-DC. Accessed 2026-05-20. ↩
SlimLM authors. "SlimLM: An Efficient Small Language Model for On-Device Document Assistance", arXiv:2411.09944, 2024-11. https://arxiv.org/abs/2411.09944. Accessed 2026-05-20. ↩
MosaicML / Databricks. "Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs", Databricks blog, 2023-05-05. https://www.databricks.com/blog/mpt-7b. Accessed 2026-05-20. ↩
Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien. "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only", arXiv:2306.01116, 2023-06-01. https://arxiv.org/abs/2306.01116. Accessed 2026-05-20. ↩
Penedo, Guilherme; Kydlicek, Hynek; Lozhkov, Anton; Mitchell, Margaret; Raffel, Colin; Werra, Leandro von; Wolf, Thomas. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale", arXiv:2406.17557, 2024-06-25. https://arxiv.org/abs/2406.17557. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

FineWeb TxT360

Infobox

Background

Motivation: data quality versus data quantity

The cleaning and deduplication pipeline

Stage 1: NFC normalization

Stage 2: Short-document filtering

Stage 3: MinHashLSH near-deduplication

Stage 4: Interleaving and shuffling

Stage 5: Train/holdout split

Stage 6: Train-against-holdout deduplication

Engineering metrics

Composition by source

BTLM-3B-8K: proof of value

SlimPajama-DC: data-combination ablations

Downstream adoption

Comparison with related corpora

Significance

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

Dolma

RefinedWeb

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here

Related Articles

Dolma

RefinedWeb

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here