RedPajama

Data & Datasets Machine Learning Natural Language Processing Open Source AI

21 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v6 · 4,147 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RedPajama is a family of large-scale, openly licensed datasets for training large language models (LLMs), created by Together AI with academic and open-source partners to reproduce, in fully open form, the training data behind Meta AI's LLaMA model. The project launched on April 17, 2023 with RedPajama-Data-1T, a 1.2 trillion token corpus that follows the LLaMA data recipe, and later expanded with RedPajama-Data-v2, which provides over 30 trillion deduplicated tokens drawn from 84 Common Crawl snapshots across five languages.^[1]^[2] Together AI described the effort as "a project to create a set of leading, fully open-source models," starting with the open release of the training data.^[1] RedPajama datasets are among the most widely used open pretraining corpora and have informed the training of production models including Snowflake Arctic, Salesforce XGen, AI2's OLMo, and Apple's OpenELM.^[3]

The RedPajama paper was accepted to the Datasets and Benchmarks Track at NeurIPS 2024, formally documenting the dataset's design, quality signals, and downstream impact.^[3]

Why was RedPajama created?

The release of Meta AI's LLaMA models in February 2023 represented a turning point for open-source language model development. LLaMA demonstrated that smaller, well-trained models could match or exceed the performance of much larger models when given sufficient high-quality training data.^[8] However, while Meta released the model weights (initially under a research-only license), the training dataset itself remained proprietary. This created a gap: researchers and developers could study the model architecture but could not reproduce the training pipeline or build upon the same data foundation.^[1]

Together AI, a company founded in June 2022 by Vipul Ved Prakash, Ce Zhang, Chris Re, and Percy Liang, set out to address this gap. The company's core mission centers on making AI development more open and accessible by reducing the concentration of training resources among a small number of well-funded organizations. Together AI assembled a coalition of academic and research partners to build an open reproduction of the LLaMA training data recipe. This coalition included Ontocord.ai, the ETH Zurich DS3Lab, Stanford's Center for Research on Foundation Models (CRFM), Stanford's Hazy Research group, and MILA (the Quebec AI Institute).^[1]

The project was named "RedPajama" as a playful nod to LLaMA (the animal), with the red pajama imagery referencing the goal of creating an open, community-driven alternative.^[1]

RedPajama-Data-v1

Overview

RedPajama-Data-v1 (also known as RedPajama-Data-1T) was released on April 17, 2023.^[1] It is a 1.2 trillion token dataset that closely follows the data recipe described in the original LLaMA paper by Touvron et al. (2023).^[8] The dataset was designed to replicate both the composition and scale of LLaMA's training corpus, using only publicly available data sources. The full dataset occupies approximately 5 TB of disk space when uncompressed (roughly 3 TB compressed) and is distributed as 2,084 multi-GB JSONL files hosted on Hugging Face.^[14]

What data sources are in RedPajama-v1?

RedPajama-v1 draws from seven distinct data sources, each processed with filters tuned to approximate the token counts reported in the LLaMA paper.^[1]

Data Source	Tokens (Billions)	Percentage	Description
Common Crawl	878	73.2%	Five web crawl snapshots (2019-30, 2020-05, 2021-04, 2022-05, 2023-06) processed through the CCNet pipeline
C4	175	14.6%	The Colossal Clean Crawled Corpus (c4_en variant from Allen AI)
GitHub	59	4.9%	Public repositories under permissive licenses (Apache, BSD, MIT)
ArXiv	28	2.3%	LaTeX source files from scientific preprints
Books	26	2.2%	Project Gutenberg (PG19 subset) with near-duplicate removal
Wikipedia	24	2.0%	Dumps from June to August 2022, covering 20 languages
Stack Exchange	20	1.7%	Dumps from the 28 largest Stack Exchange sites
Total	1,210	100%

Processing Details

Each data source underwent specific preprocessing and quality filtering steps:

Common Crawl. The five snapshots were processed using the CCNet pipeline developed by Meta AI. This pipeline applies language identification, deduplication, and quality classification based on a perplexity score computed using a language model trained on Wikipedia. Documents are sorted into "head," "middle," and "tail" buckets by perplexity. The head and middle buckets, which contain text that is more similar to Wikipedia in style and quality, were retained. A fastText classifier trained on Wikipedia reference pages provided additional filtering to remove low-quality content.^[14]

C4. The C4 portion was sourced directly from Allen AI's hosted version on Hugging Face. C4 was originally created by Google for training the T5 model by applying a set of heuristic filters to an April 2019 Common Crawl snapshot.^[14]

GitHub. Code data was collected from public repositories licensed under Apache, BSD, or MIT licenses. Filtering removed low-quality files based on file length, the proportion of alphanumeric characters, and a whitelist of over 50 programming language file extensions.^[14]

ArXiv. LaTeX source files were preprocessed to remove preambles, comments, bibliography sections, and expanded macros, retaining the core scientific text and mathematical notation.^[14]

Books. The PG19 subset of Project Gutenberg was used, with SimHash-based near-duplicate removal applied. The Books3 corpus from The Pile was initially included but was later removed due to copyright concerns.^[14]

Wikipedia. Dumps spanning 20 languages were preprocessed to strip hyperlinks, HTML comments, and formatting boilerplate, leaving clean encyclopedic text.^[14]

Stack Exchange. Data was extracted from dumps of the 28 largest Stack Exchange websites. HTML tags were removed, and answers were ranked by their community score to prioritize higher-quality content.^[14]

Adoption

RedPajama-v1 saw rapid adoption after its release. According to Together AI, the dataset was downloaded more than 190,000 times in its first year.^[3] It became one of the most widely used open pretraining corpora and served as the training data for several notable models, including OpenLLaMA and the RedPajama-INCITE family.

RedPajama-Data-v2

Overview

RedPajama-Data-v2 was released on October 30, 2023, and represents a major expansion in both scale and design philosophy compared to v1.^[2] Where v1 attempted to replicate a specific training recipe, v2 takes a fundamentally different approach: it provides a massive pool of minimally processed web data alongside rich quality annotations, allowing researchers to construct their own custom filtered subsets.^[2] Together AI framed the goal directly: "our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted."^[2]

At the time of its release, RedPajama-Data-v2 was the largest publicly available dataset specifically designed for LLM pretraining. The raw dataset contains over 100 billion text documents totaling more than 100 trillion tokens, sourced from 84 Common Crawl snapshots spanning 2014 through April 2023. After deduplication and filtering to the head and middle quality partitions, the dataset provides approximately 30.4 trillion tokens across 20.8 billion documents.^[2]

What languages does RedPajama-v2 cover?

Unlike v1, which focused primarily on English, v2 covers five languages:^[2]

Language	Coverage
English	Primary language, largest share of documents
German	Included
French	Included
Spanish	Included
Italian	Included

Processing Pipeline

The core processing step is the CCNet pipeline, chosen for its lightweight approach that preserves as much information as possible from the raw data. Each Common Crawl snapshot passes through the following stages:

Text extraction from WARC files
Language identification using a fastText classifier
Perplexity scoring using a Wikipedia-trained language model
Bucket partitioning into head, middle, and tail quality tiers

The full dataset (head + middle + tail) contains 113.3 billion documents with 123.7 trillion tokens. The head and middle partition alone contains 32.8 billion documents with 50.7 trillion tokens before deduplication, and 20.8 billion documents with 30.4 trillion tokens after deduplication.^[3]

Documents for each Common Crawl snapshot are partitioned into 5,000 shards, with filenames encoding the shard number, document language, and perplexity bucket.

What are RedPajama-v2's quality signals?

A defining feature of RedPajama-Data-v2 is its set of 46 pre-computed quality annotations for every document.^[3] Together AI's announcement describes the release as shipping "more than 40 pre-computed data quality annotations" so that downstream users can filter and weight the data themselves.^[2] Rather than making opinionated filtering decisions that discard data permanently, the project computes and distributes these signals so that downstream users can apply their own filtering strategies. The quality signals fall into five categories:

Natural Language Indicators. These heuristics measure how closely a document resembles well-formed natural language. Specific signals include word count, sentence count, mean word length, the fraction of words in all capitals, the fraction of lines ending with an ellipsis, the ratio of unique words to total words, and the presence of terminal punctuation.^[3]

Repetitiveness Signals. Repetitive content is a known contributor to language model degeneration during training. These signals measure the character fraction occupied by the most frequent n-grams (for n = 2, 3, 4) and duplicated n-grams (for n = 5 through 10), helping identify documents with excessive boilerplate or template-generated text.^[3]

Content-Based Signals. These flags identify potentially problematic content. They include word density scores computed against the LDNOOBW blocklist for detecting NSFW material and the UT1 blocklist for flagging URLs associated with specific domain categories.^[3]

ML-Based Heuristics. Several machine learning classifiers provide quality estimates. FastText unigram classifiers distinguish between unfiltered RedPajama-v2 text and high-quality reference domains such as Wikipedia, Wikipedia-referenced websites, books, and OpenWebText (for English). For non-English languages, Wikipedia alone serves as the reference domain. Additionally, DSIR (Data Selection via Importance Resampling) importance weights estimate the relevance of each sample to target domains using word unigram and bigram language models.^[3]

Deduplication Signals. MinHash signatures are computed at multiple Jaccard similarity thresholds (0.7, 0.8, 0.9, and 1.0) to support fuzzy deduplication at varying levels of strictness. A Bloom filter-based system identifies exact duplicates with an approximately 1% false positive rate.^[3]

Design Philosophy

The guiding principle behind v2 is to provide data in its rawest usable form while equipping researchers with the tools to filter it according to their specific needs.^[2] This stands in contrast to datasets like C4 or FineWeb, which ship pre-filtered data and do not include the removed documents. By preserving even the "tail" quality bucket and providing granular quality signals, RedPajama-v2 enables research into data selection strategies, curriculum learning, and the relationship between data quality and model performance.

SlimPajama

Overview

SlimPajama is a cleaned and deduplicated derivative of RedPajama-Data-v1, created by Cerebras Systems and released on June 9, 2023.^[5] The dataset reduces the original 1.21 trillion tokens down to 627 billion tokens by removing 49.6% of the data through aggressive deduplication and quality filtering.^[5] It is released under the Apache 2.0 license and is available on Hugging Face.

Deduplication Methodology

Cerebras applied MinHashLSH (Locality-Sensitive Hashing) with a Jaccard similarity threshold of 0.8 for near-duplicate detection. Document signatures were constructed from lower-cased 13-grams after preprocessing to remove punctuation, consecutive spaces, newlines, and tabs. Critically, deduplication was performed both within and across all seven data sources in RedPajama-v1, meaning a document appearing in both CommonCrawl and C4 would have its duplicate removed.^[5]

Additionally, 1.86% of documents were filtered as low-quality content based on having fewer than 200 characters.^[5]

The end-to-end preprocessing pipeline required approximately 2.5 days on a 64-core CPU with a peak memory consumption of 1.4 TB. Cerebras rewrote the datasketch library implementation to enable distributed, multi-threaded, and memory-efficient processing at the trillion-token scale. These tools were open-sourced alongside the dataset.^[5]

Data Composition After Deduplication

The deduplication process did not affect all sources equally. Web-sourced data (CommonCrawl and C4) saw the largest reductions, while curated sources retained a higher proportion of their content.^[5]

Data Source	Original (RedPajama-v1)	SlimPajama	Percentage of SlimPajama
CommonCrawl	878B tokens	~327B tokens	52.2%
C4	175B tokens	~167B tokens	26.7%
GitHub	59B tokens	~33B tokens	5.2%
ArXiv	28B tokens	~29B tokens	4.6%
Books	26B tokens	~26B tokens	4.2%
Wikipedia	24B tokens	~24B tokens	3.8%
Stack Exchange	20B tokens	~21B tokens	3.3%
Total	1,210B tokens	627B tokens	100%

SlimPajama also includes separate validation and test sets of 500 million tokens each, which have been decontaminated against the training data to support reliable evaluation.^[5]

Impact

SlimPajama demonstrated that aggressive global deduplication could significantly improve training efficiency without sacrificing model quality. Cerebras reported that models trained on SlimPajama achieve equal or better accuracy compared to models trained on the full RedPajama-v1 dataset for the same compute budget.^[5]^[6] The dataset became a popular choice for researchers who wanted a high-quality, ready-to-use pretraining corpus without needing to run their own deduplication pipeline.

Which models were trained on RedPajama?

RedPajama-INCITE

The RedPajama-INCITE family of models was released by Together AI on May 5, 2023.^[4] These models were trained on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) as part of the INCITE 2023 compute grant on "Scalable Foundation Models for Transferrable Generalist AI," awarded to MILA, LAION, and EleutherAI in fall 2022. Training used the DeeperSpeed codebase developed by EleutherAI.^[4]

The release included six model variants:^[4]

Model	Parameters	Training Tokens	Description
RedPajama-INCITE-Base-3B-v1	2.8B	800B	Base pretrained model
RedPajama-INCITE-Instruct-3B-v1	2.8B	800B + fine-tuning	Instruction-tuned using GPT-JT recipe
RedPajama-INCITE-Chat-3B-v1	2.8B	800B + fine-tuning	Chat model fine-tuned on OASST1 and Dolly v2.0
RedPajama-INCITE-Base-7B	6.9B	1T	Base pretrained model
RedPajama-INCITE-Instruct-7B	6.9B	1T + fine-tuning	Instruction-tuned variant
RedPajama-INCITE-Chat-7B	6.9B	1T + fine-tuning	Chat model variant

All models were released under the Apache 2.0 license, permitting both research and commercial use.^[4]

Training infrastructure. The 3B models were trained on 256 nodes (1,536 NVIDIA V100 GPUs), while the 7B models required 512 nodes (3,072 V100 GPUs). Because the IBM Power9 architecture of Summit was not natively supported by standard PyTorch distributions, the team compiled PyTorch from source. The V100 GPUs lack support for bfloat16, so training used fp16 precision with loss scaling. Pipeline parallelism (12-way for 3B, 6-way for 7B) was combined with 2-way tensor parallelism.^[4]

Benchmark results. The 3B instruction-tuned model scored 0.453 on HELM's 16 core scenarios, compared to 0.465 for LLaMA-7B and 0.377 for Pythia-2.8B. On the lm-evaluation-harness zero-shot benchmark suite, the 3B base model averaged 0.6662, compared to 0.6451 for Pythia-2.8B and 0.6197 for GPT-Neo-2.7B. The 7B model scored 1.0 points below Falcon-7B and 4.1 points below LLaMA-7B on HELM-classic metrics.^[4]

OpenLLaMA

OpenLLaMA is a permissively licensed open-source reproduction of Meta AI's LLaMA, developed by Xinyang Geng and Hao Liu at UC Berkeley AI Research (BAIR). Released in May 2023, OpenLLaMA provides models at the 3B, 7B, and 13B parameter scales, all trained on 1 trillion tokens.^[7]

The v1 models were trained exclusively on RedPajama-Data-v1, following the same preprocessing steps and training hyperparameters as the original LLaMA paper, including model architecture, context length, training steps, learning rate schedule, and optimizer. The v2 models used a mixed dataset combining the Falcon RefinedWeb dataset, the StarCoder dataset, and the Wikipedia, ArXiv, Books, and Stack Exchange subsets from RedPajama.^[7]

Both the training framework (EasyLM) and the model weights are licensed under Apache 2.0.^[7]

Other Notable Models

RedPajama data has been incorporated into the training pipelines of several additional production models:

Model	Organization	How RedPajama Was Used
Snowflake Arctic	Snowflake	Used RedPajama alongside RefinedWeb, C4, and StarCoder for pretraining a 480B parameter MoE model
XGen	Salesforce	Included the GitHub subset from RedPajama in the code data mixture during pretraining
OLMo	Allen Institute for AI (AI2)	Used RedPajama data as part of the Dolma training corpus
OpenELM	Apple	Referenced RedPajama data in training pipeline

How does RedPajama compare to other pretraining datasets?

RedPajama exists within a broader ecosystem of open pretraining datasets, each with different design goals, scales, and trade-offs.

Dataset	Organization	Release	Total Tokens	Sources	Languages	Raw Data Available	Composite (Multi-Source)	License
C4	Google	2019	~156B (en)	Single Common Crawl snapshot	English (+ mC4 multilingual)	No	No	ODC-BY
The Pile	EleutherAI	2020	~300B (825 GiB)	22 diverse sources	Primarily English	No	Yes	MIT
RedPajama-v1	Together AI	April 2023	1.2T	7 sources (LLaMA recipe)	Primarily English (20 languages in Wikipedia)	No	Yes	Apache 2.0
SlimPajama	Cerebras	June 2023	627B	7 sources (deduplicated RedPajama-v1)	Primarily English	No	Yes	Apache 2.0
RedPajama-v2	Together AI	October 2023	30.4T (deduplicated head+middle)	84 Common Crawl snapshots	English, German, French, Spanish, Italian	Yes	No (web-only)	Apache 2.0
Dolma	Allen Institute for AI (AI2)	2024	~3T	Web, academic papers, code, books, encyclopedic	English	No	Yes	ODC-BY
FineWeb	Hugging Face	2024	~15T	96 Common Crawl dumps	English	No	No (web-only)	ODC-BY
DCLM-Pool	DataComp	2024	~240T+	Common Crawl	Multilingual	Yes	No	Various

Key Differentiators

Scale. RedPajama-v2's 30+ trillion deduplicated tokens (and 100+ trillion raw tokens) place it among the largest open pretraining datasets ever released.^[2] Only DCLM-Pool is comparable in raw scale.

Transparency. RedPajama-v2 is one of the few datasets that provides raw, unfiltered data alongside pre-computed quality signals. Most other datasets (C4, FineWeb, The Pile) ship only the filtered output, making it impossible to study the effects of different filtering strategies on the same base data.

Composability. The 46 quality annotations in v2 allow users to construct custom filtered subsets without needing to reprocess the raw data.^[2] This modular design supports reproducible research into data curation methods.

Multi-source composition. RedPajama-v1, SlimPajama, The Pile, and Dolma all combine multiple data sources (web, code, books, academic papers), which has been shown to improve model performance across diverse downstream tasks. RedPajama-v2 and FineWeb, by contrast, focus exclusively on web data but at much larger scale.

Ablation Studies and Quality Analysis

The RedPajama NeurIPS 2024 paper includes ablation studies testing different filtering strategies on v2 data, using 468M and 1.6B parameter decoder-only transformer models.^[3]

468M Parameter Model Results

Models trained on RedPajama-v2 filtered with Gopher rules (a set of heuristic quality filters) and fuzzy deduplication achieved the highest aggregated benchmark scores, averaging 37.6 normalized accuracy across 13 evaluation tasks. These tasks included ANLI, ARC, Winogrande, HellaSwag, LAMBADA, CoQA, MMLU, OpenBookQA, PIQA, PubMedQA, SciQ, SocialIQA, and TruthfulQA.^[3]

An interesting finding was that unfiltered RedPajama-v2 data yielded the lowest perplexity on the Paloma validation set, suggesting that the broad domain coverage of unfiltered web data provides value for general language modeling even if it underperforms on specific benchmarks.^[3]

1.6B Parameter Model Results

At the 1.6B parameter scale, RedPajama-v2 filtered with the full Gopher rule set approached the quality of RefinedWeb, scoring 50.0 average accuracy compared to RefinedWeb's 52.0. Adding natural language filtering on top of Gopher rules further improved performance to 47.9 accuracy on the natural language subset of benchmarks.^[3]

These results demonstrate that RedPajama-v2, when combined with appropriate filtering, can produce training data competitive with other high-quality web corpora.

Technical Infrastructure and Availability

Data Format

Both RedPajama-v1 and v2 are distributed in JSON Lines format with shard-based partitioning. RedPajama-v2 documents follow the CCNet schema, which includes fields for the URL, download date, content digest, length metrics, source domain, language identification, perplexity scores, and quality bucket classification. Quality signals use span-level annotation structures that enable filtering at multiple granularity levels.

Is RedPajama open source and where can you get it?

The datasets are released under the Apache 2.0 license and are accessible through multiple channels:

Hugging Face Hub: Complete datasets with streaming support
Public HTTPS endpoints: Organized URL listings for direct download
GitHub repository: All processing scripts and documentation are open-sourced at github.com/togethercomputer/RedPajama-Data

Storage Requirements

Dataset	Compressed Size	Uncompressed Size
RedPajama-v1	~3 TB	~5 TB
RedPajama-v2 (full)	Not publicly stated	~270 TB
SlimPajama	Smaller than v1	~900 GB

Impact on the Open-Source LLM Ecosystem

RedPajama played a significant role in the rapid growth of open-source language model development during 2023 and 2024. Before its release, researchers who wanted to train LLaMA-class models had limited options for open, high-quality pretraining data at the trillion-token scale. The Pile provided 825 GiB (roughly 300 billion tokens), and C4 offered about 156 billion tokens in English, but neither matched the 1.2 trillion token scale that LLaMA demonstrated was necessary for strong performance at the 7B-65B parameter range.^[9]

RedPajama-v1 filled this gap directly, enabling projects like OpenLLaMA to produce fully open reproductions of LLaMA using the same data recipe.^[7] SlimPajama then showed that careful deduplication could extract a more efficient training signal from the same base data. Together, these datasets lowered the barrier to entry for training competitive language models.

RedPajama-v2 pushed the frontier further by providing data at a scale previously available only to large corporations with direct access to Common Crawl processing infrastructure. The inclusion of pre-computed quality signals was particularly valuable, as computing these annotations from scratch requires substantial computational resources. By distributing these signals alongside the data, Together AI effectively subsidized the data curation step for the entire research community.^[2]

The project also contributed to a broader shift in how the AI community thinks about training data. Rather than treating pretraining corpora as fixed artifacts, RedPajama-v2's design encourages viewing data curation as an ongoing research problem. The availability of raw data with quality annotations has enabled new lines of research into data selection, curriculum learning, and the relationship between data characteristics and model capabilities.

Limitations and Considerations

Like all web-sourced datasets, RedPajama inherits the biases and quality issues present in internet text. Common Crawl data contains noise, duplicated content, machine-generated text, and content that may reflect societal biases. While the quality signals in v2 help mitigate some of these issues, no filtering strategy eliminates them entirely.

The Books3 component was removed from RedPajama-v1 after copyright concerns were raised, highlighting the ongoing legal uncertainty surrounding training data sourced from copyrighted materials.^[14] RedPajama-v2 sidesteps this issue by drawing exclusively from Common Crawl web data, though web-scraped content itself raises separate intellectual property questions.

The sheer size of RedPajama-v2 (approximately 270 TB uncompressed) presents practical challenges for researchers with limited storage and compute resources. While Hugging Face streaming support helps with access, many filtering and preprocessing operations still require substantial infrastructure.

References

Together AI. "RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens." Together AI Blog, April 17, 2023. https://www.together.ai/blog/redpajama ↩
Together AI. "RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models." Together AI Blog, October 30, 2023. https://www.together.ai/blog/redpajama-data-v2 ↩
Together Computer et al. "RedPajama: an Open Dataset for Training Large Language Models." arXiv:2411.12372, November 2024. Accepted at NeurIPS 2024 Datasets and Benchmarks Track. https://arxiv.org/abs/2411.12372 ↩
Together AI. "Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models." Together AI Blog, May 5, 2023. https://www.together.ai/blog/redpajama-models-v1 ↩
Cerebras. "SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama." Cerebras Blog, June 9, 2023. https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama ↩
Shen, Zhiqiang, et al. "SlimPajama-DC: Understanding Data Combinations for LLM Training." arXiv:2309.10818, September 2023. https://arxiv.org/abs/2309.10818 ↩
Geng, Xinyang, and Hao Liu. "OpenLLaMA: An Open Reproduction of LLaMA." GitHub, May 2023. https://github.com/openlm-research/open_llama ↩
Touvron, Hugo, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, February 2023. https://arxiv.org/abs/2302.13971 ↩
Gao, Leo, et al. "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027, December 2020. https://arxiv.org/abs/2101.00027 ↩
Snowflake. "Snowflake Arctic: The Best LLM for Enterprise AI." Snowflake Blog, April 2024. https://www.snowflake.com/en/data-cloud/arctic/
Nijkamp, Erik, et al. "Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length." Salesforce AI Research Blog, 2023. https://blog.salesforceairesearch.com/xgen
Soldaini, Luca, et al. "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." arXiv:2402.00159, January 2024. https://arxiv.org/abs/2402.00159
Penedo, Guilherme, et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, June 2024. https://arxiv.org/abs/2406.17557
Willison, Simon. "What's in the RedPajama-Data-1T LLM training set." simonwillison.net, April 17, 2023. https://simonwillison.net/2023/Apr/17/redpajama-data/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

BitNet b1.58 Common Corpus Common Crawl Common Pile DCLM (DataComp for Language Models)DatologyAI Dolma Essential AI FineWeb LongLoRA LongRoPE Membership Inference Attack RefinedWeb Salesforce AI SlimPajama The Stack (BigCode dataset)The Stack v2 Training Set TxT360

Why was RedPajama created?

RedPajama-Data-v1

Overview

What data sources are in RedPajama-v1?

Processing Details

Adoption

RedPajama-Data-v2

Overview

What languages does RedPajama-v2 cover?

Processing Pipeline

What are RedPajama-v2's quality signals?

Design Philosophy

SlimPajama

Overview

Deduplication Methodology

Data Composition After Deduplication

Impact

Which models were trained on RedPajama?

RedPajama-INCITE

OpenLLaMA

Other Notable Models

How does RedPajama compare to other pretraining datasets?

Key Differentiators

Ablation Studies and Quality Analysis

468M Parameter Model Results

1.6B Parameter Model Results

Technical Infrastructure and Availability

Data Format

Is RedPajama open source and where can you get it?

Storage Requirements

Impact on the Open-Source LLM Ecosystem

Limitations and Considerations

See Also

References

Improve this article

Related Articles

The Pile (dataset)

FineWeb

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here

Related Articles

The Pile (dataset)

FineWeb

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here