RedPajama is a family of large-scale open datasets for training large language models (LLMs), developed by Together AI in collaboration with academic and open-source research groups. First released in April 2023, the project aimed to create a fully transparent and openly licensed reproduction of the training data used by Meta AI's LLaMA model. The project has since expanded into one of the largest publicly available pretraining corpora in existence, with RedPajama-Data-v2 providing over 30 trillion tokens drawn from 84 Common Crawl snapshots in five languages. RedPajama datasets have been widely adopted by the open-source AI community and serve as foundational training data for several production language models, including Snowflake Arctic, Salesforce XGen, AI2's OLMo, and Apple's OpenELM.
The RedPajama paper was accepted to the Datasets and Benchmarks Track at NeurIPS 2024, formally documenting the dataset's design, quality signals, and downstream impact.
The release of Meta AI's LLaMA models in February 2023 represented a turning point for open-source language model development. LLaMA demonstrated that smaller, well-trained models could match or exceed the performance of much larger models when given sufficient high-quality training data. However, while Meta released the model weights (initially under a research-only license), the training dataset itself remained proprietary. This created a gap: researchers and developers could study the model architecture but could not reproduce the training pipeline or build upon the same data foundation.
Together AI, a company founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, set out to address this gap. The company's core mission centers on making AI development more open and accessible by reducing the concentration of training resources among a small number of well-funded organizations. Together AI assembled a coalition of academic and research partners to build an open reproduction of the LLaMA training data recipe. This coalition included Ontocord.ai, the ETH Zurich DS3Lab, Stanford's Center for Research on Foundation Models (CRFM), Stanford's Hazy Research group, and MILA (the Quebec AI Institute).
The project was named "RedPajama" as a playful nod to LLaMA (the animal), with the red pajama imagery referencing the goal of creating an open, community-driven alternative.
RedPajama-Data-v1 (also known as RedPajama-Data-1T) was released on April 17, 2023. It is a 1.2 trillion token dataset that closely follows the data recipe described in the original LLaMA paper by Touvron et al. (2023). The dataset was designed to replicate both the composition and scale of LLaMA's training corpus, using only publicly available data sources. The full dataset occupies approximately 5 TB of disk space when uncompressed (roughly 3 TB compressed) and is hosted on Hugging Face.
RedPajama-v1 draws from seven distinct data sources, each processed with filters tuned to approximate the token counts reported in the LLaMA paper.
| Data Source | Tokens (Billions) | Percentage | Description |
|---|---|---|---|
| Common Crawl | 878 | 73.2% | Five web crawl snapshots (2019-30, 2020-05, 2021-04, 2022-05, 2023-06) processed through the CCNet pipeline |
| C4 | 175 | 14.6% | The Colossal Clean Crawled Corpus (c4_en variant from Allen AI) |
| GitHub | 59 | 4.9% | Public repositories under permissive licenses (Apache, BSD, MIT) |
| ArXiv | 28 | 2.3% | LaTeX source files from scientific preprints |
| Books | 26 | 2.2% | Project Gutenberg (PG19 subset) with near-duplicate removal |
| Wikipedia | 24 | 2.0% | Dumps from June to August 2022, covering 20 languages |
| Stack Exchange | 20 | 1.7% | Dumps from the 28 largest Stack Exchange sites |
| Total | 1,210 | 100% |
Each data source underwent specific preprocessing and quality filtering steps:
Common Crawl. The five snapshots were processed using the CCNet pipeline developed by Meta AI. This pipeline applies language identification, deduplication, and quality classification based on a perplexity score computed using a language model trained on Wikipedia. Documents are sorted into "head," "middle," and "tail" buckets by perplexity. The head and middle buckets, which contain text that is more similar to Wikipedia in style and quality, were retained. A fastText classifier trained on Wikipedia reference pages provided additional filtering to remove low-quality content.
C4. The C4 portion was sourced directly from Allen AI's hosted version on Hugging Face. C4 was originally created by Google for training the T5 model by applying a set of heuristic filters to an April 2019 Common Crawl snapshot.
GitHub. Code data was collected from public repositories licensed under Apache, BSD, or MIT licenses. Filtering removed low-quality files based on file length, the proportion of alphanumeric characters, and a whitelist of over 50 programming language file extensions.
ArXiv. LaTeX source files were preprocessed to remove preambles, comments, bibliography sections, and expanded macros, retaining the core scientific text and mathematical notation.
Books. The PG19 subset of Project Gutenberg was used, with SimHash-based near-duplicate removal applied. The Books3 corpus from The Pile was initially included but was later removed due to copyright concerns.
Wikipedia. Dumps spanning 20 languages were preprocessed to strip hyperlinks, HTML comments, and formatting boilerplate, leaving clean encyclopedic text.
Stack Exchange. Data was extracted from dumps of the 28 largest Stack Exchange websites. HTML tags were removed, and answers were ranked by their community score to prioritize higher-quality content.
RedPajama-v1 saw rapid adoption after its release. According to Together AI, the dataset was downloaded more than 190,000 times in its first year. It became one of the most widely used open pretraining corpora and served as the training data for several notable models, including OpenLLaMA and the RedPajama-INCITE family.
RedPajama-Data-v2 was released on October 30, 2023, and represents a major expansion in both scale and design philosophy compared to v1. Where v1 attempted to replicate a specific training recipe, v2 takes a fundamentally different approach: it provides a massive pool of minimally processed web data alongside rich quality annotations, allowing researchers to construct their own custom filtered subsets.
At the time of its release, RedPajama-Data-v2 was the largest publicly available dataset specifically designed for LLM pretraining. The raw dataset contains over 100 billion text documents totaling more than 100 trillion tokens, sourced from 84 Common Crawl snapshots spanning 2014 through April 2023. After deduplication and filtering to the head and middle quality partitions, the dataset provides approximately 30.4 trillion tokens across 20.8 billion documents.
Unlike v1, which focused primarily on English, v2 covers five languages:
| Language | Coverage |
|---|---|
| English | Primary language, largest share of documents |
| German | Included |
| French | Included |
| Spanish | Included |
| Italian | Included |
The core processing step is the CCNet pipeline, chosen for its lightweight approach that preserves as much information as possible from the raw data. Each Common Crawl snapshot passes through the following stages:
The full dataset (head + middle + tail) contains 113.3 billion documents with 123.7 trillion tokens. The head and middle partition alone contains 32.8 billion documents with 50.7 trillion tokens before deduplication, and 20.8 billion documents with 30.4 trillion tokens after deduplication.
Documents for each Common Crawl snapshot are partitioned into 5,000 shards, with filenames encoding the shard number, document language, and perplexity bucket.
A defining feature of RedPajama-Data-v2 is its set of 46 pre-computed quality annotations for every document. Rather than making opinionated filtering decisions that discard data permanently, the project computes and distributes these signals so that downstream users can apply their own filtering strategies. The quality signals fall into five categories:
Natural Language Indicators. These heuristics measure how closely a document resembles well-formed natural language. Specific signals include word count, sentence count, mean word length, the fraction of words in all capitals, the fraction of lines ending with an ellipsis, the ratio of unique words to total words, and the presence of terminal punctuation.
Repetitiveness Signals. Repetitive content is a known contributor to language model degeneration during training. These signals measure the character fraction occupied by the most frequent n-grams (for n = 2, 3, 4) and duplicated n-grams (for n = 5 through 10), helping identify documents with excessive boilerplate or template-generated text.
Content-Based Signals. These flags identify potentially problematic content. They include word density scores computed against the LDNOOBW blocklist for detecting NSFW material and the UT1 blocklist for flagging URLs associated with specific domain categories.
ML-Based Heuristics. Several machine learning classifiers provide quality estimates. FastText unigram classifiers distinguish between unfiltered RedPajama-v2 text and high-quality reference domains such as Wikipedia, Wikipedia-referenced websites, books, and OpenWebText (for English). For non-English languages, Wikipedia alone serves as the reference domain. Additionally, DSIR (Data Selection via Importance Resampling) importance weights estimate the relevance of each sample to target domains using word unigram and bigram language models.
Deduplication Signals. MinHash signatures are computed at multiple Jaccard similarity thresholds (0.7, 0.8, 0.9, and 1.0) to support fuzzy deduplication at varying levels of strictness. A Bloom filter-based system identifies exact duplicates with an approximately 1% false positive rate.
The guiding principle behind v2 is to provide data in its rawest usable form while equipping researchers with the tools to filter it according to their specific needs. This stands in contrast to datasets like C4 or FineWeb, which ship pre-filtered data and do not include the removed documents. By preserving even the "tail" quality bucket and providing granular quality signals, RedPajama-v2 enables research into data selection strategies, curriculum learning, and the relationship between data quality and model performance.
SlimPajama is a cleaned and deduplicated derivative of RedPajama-Data-v1, created by Cerebras Systems and released on June 9, 2023. The dataset reduces the original 1.21 trillion tokens down to 627 billion tokens by removing 49.6% of the data through aggressive deduplication and quality filtering. It is released under the Apache 2.0 license and is available on Hugging Face.
Cerebras applied MinHashLSH (Locality-Sensitive Hashing) with a Jaccard similarity threshold of 0.8 for near-duplicate detection. Document signatures were constructed from lower-cased 13-grams after preprocessing to remove punctuation, consecutive spaces, newlines, and tabs. Critically, deduplication was performed both within and across all seven data sources in RedPajama-v1, meaning a document appearing in both CommonCrawl and C4 would have its duplicate removed.
Additionally, 1.86% of documents were filtered as low-quality content based on having fewer than 200 characters.
The end-to-end preprocessing pipeline required approximately 2.5 days on a 64-core CPU with a peak memory consumption of 1.4 TB. Cerebras rewrote the datasketch library implementation to enable distributed, multi-threaded, and memory-efficient processing at the trillion-token scale. These tools were open-sourced alongside the dataset.
The deduplication process did not affect all sources equally. Web-sourced data (CommonCrawl and C4) saw the largest reductions, while curated sources retained a higher proportion of their content.
| Data Source | Original (RedPajama-v1) | SlimPajama | Percentage of SlimPajama |
|---|---|---|---|
| CommonCrawl | 878B tokens | ~327B tokens | 52.2% |
| C4 | 175B tokens | ~167B tokens | 26.7% |
| GitHub | 59B tokens | ~33B tokens | 5.2% |
| ArXiv | 28B tokens | ~29B tokens | 4.6% |
| Books | 26B tokens | ~26B tokens | 4.2% |
| Wikipedia | 24B tokens | ~24B tokens | 3.8% |
| Stack Exchange | 20B tokens | ~21B tokens | 3.3% |
| Total | 1,210B tokens | 627B tokens | 100% |
SlimPajama also includes separate validation and test sets of 500 million tokens each, which have been decontaminated against the training data to support reliable evaluation.
SlimPajama demonstrated that aggressive global deduplication could significantly improve training efficiency without sacrificing model quality. Cerebras reported that models trained on SlimPajama achieve equal or better accuracy compared to models trained on the full RedPajama-v1 dataset for the same compute budget. The dataset became a popular choice for researchers who wanted a high-quality, ready-to-use pretraining corpus without needing to run their own deduplication pipeline.
The RedPajama-INCITE family of models was released by Together AI on May 5, 2023. These models were trained on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) as part of the INCITE 2023 compute grant on "Scalable Foundation Models for Transferrable Generalist AI," awarded to MILA, LAION, and EleutherAI in fall 2022. Training used the DeeperSpeed codebase developed by EleutherAI.
The release included six model variants:
| Model | Parameters | Training Tokens | Description |
|---|---|---|---|
| RedPajama-INCITE-Base-3B-v1 | 2.8B | 800B | Base pretrained model |
| RedPajama-INCITE-Instruct-3B-v1 | 2.8B | 800B + fine-tuning | Instruction-tuned using GPT-JT recipe |
| RedPajama-INCITE-Chat-3B-v1 | 2.8B | 800B + fine-tuning | Chat model fine-tuned on OASST1 and Dolly v2.0 |
| RedPajama-INCITE-Base-7B | 6.9B | 1T | Base pretrained model |
| RedPajama-INCITE-Instruct-7B | 6.9B | 1T + fine-tuning | Instruction-tuned variant |
| RedPajama-INCITE-Chat-7B | 6.9B | 1T + fine-tuning | Chat model variant |
All models were released under the Apache 2.0 license, permitting both research and commercial use.
Training infrastructure. The 3B models were trained on 256 nodes (1,536 NVIDIA V100 GPUs), while the 7B models required 512 nodes (3,072 V100 GPUs). Because the IBM Power9 architecture of Summit was not natively supported by standard PyTorch distributions, the team compiled PyTorch from source. The V100 GPUs lack support for bfloat16, so training used fp16 precision with loss scaling. Pipeline parallelism (12-way for 3B, 6-way for 7B) was combined with 2-way tensor parallelism.
Benchmark results. The 3B instruction-tuned model scored 0.453 on HELM's 16 core scenarios, compared to 0.465 for LLaMA-7B and 0.377 for Pythia-2.8B. On the lm-evaluation-harness zero-shot benchmark suite, the 3B base model averaged 0.6662, compared to 0.6451 for Pythia-2.8B and 0.6197 for GPT-Neo-2.7B. The 7B model scored 1.0 points below Falcon-7B and 4.1 points below LLaMA-7B on HELM-classic metrics.
OpenLLaMA is a permissively licensed open-source reproduction of Meta AI's LLaMA, developed by Xinyang Geng and Hao Liu at UC Berkeley AI Research (BAIR). Released in May 2023, OpenLLaMA provides models at the 3B, 7B, and 13B parameter scales, all trained on 1 trillion tokens.
The v1 models were trained exclusively on RedPajama-Data-v1, following the same preprocessing steps and training hyperparameters as the original LLaMA paper, including model architecture, context length, training steps, learning rate schedule, and optimizer. The v2 models used a mixed dataset combining the Falcon RefinedWeb dataset, the StarCoder dataset, and the Wikipedia, ArXiv, Books, and Stack Exchange subsets from RedPajama.
Both the training framework (EasyLM) and the model weights are licensed under Apache 2.0.
RedPajama data has been incorporated into the training pipelines of several additional production models:
| Model | Organization | How RedPajama Was Used |
|---|---|---|
| Snowflake Arctic | Snowflake | Used RedPajama alongside RefinedWeb, C4, and StarCoder for pretraining a 480B parameter MoE model |
| XGen | Salesforce | Included the GitHub subset from RedPajama in the code data mixture during pretraining |
| OLMo | Allen Institute for AI (AI2) | Used RedPajama data as part of the Dolma training corpus |
| OpenELM | Apple | Referenced RedPajama data in training pipeline |
RedPajama exists within a broader ecosystem of open pretraining datasets, each with different design goals, scales, and trade-offs.
| Dataset | Organization | Release | Total Tokens | Sources | Languages | Raw Data Available | Composite (Multi-Source) | License |
|---|---|---|---|---|---|---|---|---|
| C4 | 2019 | ~156B (en) | Single Common Crawl snapshot | English (+ mC4 multilingual) | No | No | ODC-BY | |
| The Pile | EleutherAI | 2020 | ~300B (825 GiB) | 22 diverse sources | Primarily English | No | Yes | MIT |
| RedPajama-v1 | Together AI | April 2023 | 1.2T | 7 sources (LLaMA recipe) | Primarily English (20 languages in Wikipedia) | No | Yes | Apache 2.0 |
| SlimPajama | Cerebras | June 2023 | 627B | 7 sources (deduplicated RedPajama-v1) | Primarily English | No | Yes | Apache 2.0 |
| RedPajama-v2 | Together AI | October 2023 | 30.4T (deduplicated head+middle) | 84 Common Crawl snapshots | English, German, French, Spanish, Italian | Yes | No (web-only) | Apache 2.0 |
| Dolma | Allen Institute for AI (AI2) | 2024 | ~3T | Web, academic papers, code, books, encyclopedic | English | No | Yes | ODC-BY |
| FineWeb | Hugging Face | 2024 | ~15T | 96 Common Crawl dumps | English | No | No (web-only) | ODC-BY |
| DCLM-Pool | DataComp | 2024 | ~240T+ | Common Crawl | Multilingual | Yes | No | Various |
Scale. RedPajama-v2's 30+ trillion deduplicated tokens (and 100+ trillion raw tokens) place it among the largest open pretraining datasets ever released. Only DCLM-Pool is comparable in raw scale.
Transparency. RedPajama-v2 is one of the few datasets that provides raw, unfiltered data alongside pre-computed quality signals. Most other datasets (C4, FineWeb, The Pile) ship only the filtered output, making it impossible to study the effects of different filtering strategies on the same base data.
Composability. The 46 quality annotations in v2 allow users to construct custom filtered subsets without needing to reprocess the raw data. This modular design supports reproducible research into data curation methods.
Multi-source composition. RedPajama-v1, SlimPajama, The Pile, and Dolma all combine multiple data sources (web, code, books, academic papers), which has been shown to improve model performance across diverse downstream tasks. RedPajama-v2 and FineWeb, by contrast, focus exclusively on web data but at much larger scale.
The RedPajama NeurIPS 2024 paper includes ablation studies testing different filtering strategies on v2 data, using 468M and 1.6B parameter decoder-only transformer models.
Models trained on RedPajama-v2 filtered with Gopher rules (a set of heuristic quality filters) and fuzzy deduplication achieved the highest aggregated benchmark scores, averaging 37.6 normalized accuracy across 13 evaluation tasks. These tasks included ANLI, ARC, Winogrande, HellaSwag, LAMBADA, CoQA, MMLU, OpenBookQA, PIQA, PubMedQA, SciQ, SocialIQA, and TruthfulQA.
An interesting finding was that unfiltered RedPajama-v2 data yielded the lowest perplexity on the Paloma validation set, suggesting that the broad domain coverage of unfiltered web data provides value for general language modeling even if it underperforms on specific benchmarks.
At the 1.6B parameter scale, RedPajama-v2 filtered with the full Gopher rule set approached the quality of RefinedWeb, scoring 50.0 average accuracy compared to RefinedWeb's 52.0. Adding natural language filtering on top of Gopher rules further improved performance to 47.9 accuracy on the natural language subset of benchmarks.
These results demonstrate that RedPajama-v2, when combined with appropriate filtering, can produce training data competitive with other high-quality web corpora.
Both RedPajama-v1 and v2 are distributed in JSON Lines format with shard-based partitioning. RedPajama-v2 documents follow the CCNet schema, which includes fields for the URL, download date, content digest, length metrics, source domain, language identification, perplexity scores, and quality bucket classification. Quality signals use span-level annotation structures that enable filtering at multiple granularity levels.
The datasets are accessible through multiple channels:
| Dataset | Compressed Size | Uncompressed Size |
|---|---|---|
| RedPajama-v1 | ~3 TB | ~5 TB |
| RedPajama-v2 (full) | Not publicly stated | ~270 TB |
| SlimPajama | Smaller than v1 | ~900 GB |
RedPajama played a significant role in the rapid growth of open-source language model development during 2023 and 2024. Before its release, researchers who wanted to train LLaMA-class models had limited options for open, high-quality pretraining data at the trillion-token scale. The Pile provided 825 GiB (roughly 300 billion tokens), and C4 offered about 156 billion tokens in English, but neither matched the 1.2 trillion token scale that LLaMA demonstrated was necessary for strong performance at the 7B-65B parameter range.
RedPajama-v1 filled this gap directly, enabling projects like OpenLLaMA to produce fully open reproductions of LLaMA using the same data recipe. SlimPajama then showed that careful deduplication could extract a more efficient training signal from the same base data. Together, these datasets lowered the barrier to entry for training competitive language models.
RedPajama-v2 pushed the frontier further by providing data at a scale previously available only to large corporations with direct access to Common Crawl processing infrastructure. The inclusion of pre-computed quality signals was particularly valuable, as computing these annotations from scratch requires substantial computational resources. By distributing these signals alongside the data, Together AI effectively subsidized the data curation step for the entire research community.
The project also contributed to a broader shift in how the AI community thinks about training data. Rather than treating pretraining corpora as fixed artifacts, RedPajama-v2's design encourages viewing data curation as an ongoing research problem. The availability of raw data with quality annotations has enabled new lines of research into data selection, curriculum learning, and the relationship between data characteristics and model capabilities.
Like all web-sourced datasets, RedPajama inherits the biases and quality issues present in internet text. Common Crawl data contains noise, duplicated content, machine-generated text, and content that may reflect societal biases. While the quality signals in v2 help mitigate some of these issues, no filtering strategy eliminates them entirely.
The Books3 component was removed from RedPajama-v1 after copyright concerns were raised, highlighting the ongoing legal uncertainty surrounding training data sourced from copyrighted materials. RedPajama-v2 sidesteps this issue by drawing exclusively from Common Crawl web data, though web-scraped content itself raises separate intellectual property questions.
The sheer size of RedPajama-v2 (approximately 270 TB uncompressed) presents practical challenges for researchers with limited storage and compute resources. While Hugging Face streaming support helps with access, many filtering and preprocessing operations still require substantial infrastructure.