TxT360

Data & Datasets Large Language Models Open Source AI

9 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 1,891 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TxT360 is an open large-scale pretraining corpus for large language models, released in October 2024 by the LLM360 project, a collaboration led by Petuum and the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) ^[1]^[2]. The name stands for Trillion eXtracted Text. Its distinguishing feature is that it pulls together 99 Common Crawl web snapshots and 14 curated, high-quality non-web sources, then runs a single global deduplication pass over all of them at once. That produces roughly 5 trillion unique tokens of cleaned text, which a documented upsampling recipe expands into a corpus of more than 15 trillion tokens ^[1]^[3]. The team reported that, with that recipe, TxT360 matches or beats FineWeb on several downstream evaluation metrics ^[1].

Motivation

By 2024 the open pretraining-data landscape had split into two camps. On one side were enormous web corpora distilled from Common Crawl, such as FineWeb, RedPajama, and the C4-style cleanups, which gave models broad coverage of long-tail and everyday text but carried the noise and redundancy of the raw web. On the other side were curated collections like The Pile and Dolma that mixed in books, code, scientific papers, and reference text to raise quality, but did not deduplicate web and curated material against each other in a unified way.

TxT360 was built to close that gap. The idea is straightforward to state and hard to execute: take a very large slice of the web, add a set of trusted curated sources, deduplicate everything together so the same passage does not appear once from a crawl and again from a book or a Wikipedia mirror, and keep enough bookkeeping that a practitioner can decide afterward how heavily to weight each part. The project also treats transparency as a first-class goal, releasing the processing scripts and per-document metadata rather than only the final token stream ^[1]^[4].

Data sources

The web portion comes from 99 Common Crawl snapshots, processed into about 4.83 trillion tokens that occupy roughly 9.2 TB on disk. The latest snapshot used carried a cutoff around mid-2024 ^[3]. Alongside the web data sit 14 curated sources grouped across roughly ten domains: legal text, patents, encyclopedic reference, scientific literature, mathematics, forum discussion, parliamentary proceedings, and books. The five paper sources (arXiv, S2ORC, PhilPapers, PubMed Abstracts, and PubMed Central) are reported together as a single "Papers" line in the published statistics ^[3]^[5].

The table below lists the token counts and raw sizes published in the dataset card. Counts are given before global deduplication.

Source	Raw size	Tokens	Notes
Common Crawl (99 snapshots)	9.2 TB	4.83T	Open web text
Papers (arXiv, S2ORC, PhilPapers, PubMed Abstracts, PubMed Central)	712 GB	154.96B	Scientific literature
Wikipedia	199 GB	35.98B	310+ languages
StackExchange	81 GB	27.76B	Q&A forums
FreeLaw	71 GB	16.70B	Court opinions and legal filings
DM Math	22 GB	5.23B	Synthetic mathematics problems
USPTO	45 GB	4.95B	US patent grants
PG-19	11 GB	2.63B	Pre-1919 books from Project Gutenberg
Europarl	6.1 GB	1.96B	European Parliament proceedings
Ubuntu IRC	6.0 GB	1.89B	Chat logs
HackerNews	4.2 GB	1.05B	Forum posts and comments

The processing pipeline

The web data goes through a filtering pipeline before deduplication. Text is first extracted from raw WARC records, then passed through language identification, URL-based filtering to drop blocklisted and low-value domains, repetition removal to cut boilerplate and spam, and line-level and document-level quality filters that discard pages failing heuristics for length, symbol ratios, and other signals. A privacy step removes personally identifiable information such as email addresses and IP addresses ^[5]^[6]. The cumulative effect of these filters is large: LLM360 reported that about 97.65 percent of the original raw web data was removed before the remaining text reached the deduplication stage ^[6].

The 14 curated sources are processed with their own per-source cleaning rather than the full web pipeline, since material like court opinions, patents, and books does not carry the same kinds of web noise. After both tracks are cleaned, they are combined and deduplicated as one corpus.

Global deduplication

Deduplication is the part of TxT360 the authors stress most. Most prior datasets deduplicated each snapshot or each source on its own, which leaves cross-source and cross-snapshot copies in place. A popular article can be crawled in dozens of monthly snapshots, and the same Wikipedia paragraph can show up in a web crawl, a curated Wikipedia dump, and a third-party mirror. TxT360 instead deduplicates the entire 99-snapshot web set and all 14 curated sources together ^[1]^[2].

The system uses two passes. Exact duplicates are caught with a Bloom filter, a compact probabilistic structure that can test whether a document has been seen before without storing every document in memory. Near-duplicates are caught with MinHash, which produces a signature for each document so that texts sharing most of their content collide even when they differ in small ways. To make MinHash scale to trillions of tokens, the signature is split into bands, and each band gets its own Bloom filter; documents whose bands collide are treated as fuzzy duplicates ^[5]^[6]^[7]. Running this globally is what brings the cleaned corpus down to roughly 5 trillion unique tokens ^[1].

Why this matters for training is concrete. Duplicated text wastes compute, can cause a model to memorize repeated passages verbatim, and distorts the effective data mixture because heavily mirrored content gets silently overweighted. Removing duplicates across the whole corpus gives a cleaner starting distribution and a more honest token count.

The upsampling recipe

Deduplicating down to 5 trillion unique tokens raises an obvious problem: many strong open models train on 15 trillion tokens or more, so a 5T corpus is not large enough on its own. Rather than dump the duplicates back in indiscriminately, TxT360 keeps detailed metadata about how often and where each document appeared, then uses that to rebuild a larger corpus in a controlled way.

During deduplication, every kept document records its duplicate count and the snapshots it came from. Documents are bucketed by how many times they were seen, with ranges such as 1-1 (unique), 2-5, 6-10, 11-100, 101-1000, and a top bucket reaching into the tens of millions for Common Crawl ^[3]. The published upsampling recipe is a "rehydration" strategy: documents are repeated in proportion to how often they originally occurred, on the reasoning that text appearing across many independent pages tends to be more canonical and worth seeing more than once ^[8]. Applying this weighting brings the corpus past 15 trillion tokens while letting practitioners reconstruct the original distribution, or design their own weighting, because the metadata that drives the process ships with the data ^[1]^[3]. The team also noted that mixing in a dedicated code dataset alongside TxT360 produced more stable training curves in their experiments ^[8].

Comparison to other open corpora

TxT360 sits in the same family as FineWeb, RedPajama, SlimPajama, and Dolma, and it borrows ideas from each. Its closest reference point is FineWeb, the Common Crawl distillation from Hugging Face. LLM360 reported that a model trained on the 15T-token upsampled TxT360 matched or surpassed FineWeb on several evaluation metrics, including MMLU and Natural Questions, though the released materials describe the comparison qualitatively rather than publishing a full per-benchmark score table ^[1].

Corpus	Primary scope	Global cross-source dedup	Approx. scale
TxT360	Web plus 14 curated sources	Yes, web and curated together	~5T unique, 15T+ upsampled
FineWeb	Common Crawl web only	Per-snapshot	~15T tokens
RedPajama (v1)	Web plus curated, LLaMA-style mix	Per-source	~1.2T tokens
SlimPajama	Deduplicated RedPajama	Yes, across RedPajama	~627B tokens
Dolma	Web plus curated	Per-source	~3T tokens

The contrasts are matters of design rather than simple ranking. RedPajama and Dolma mix web and curated text but deduplicate within sources; SlimPajama showed how much aggressive deduplication shrinks a corpus, taking RedPajama from about 1.2 trillion tokens to roughly 627 billion; FineWeb focused on getting the web track right at very large scale. TxT360's contribution is to unify the web-plus-curated mix with a single global dedup and then publish a recipe and metadata for rebuilding scale on top of the deduplicated base ^[1]^[8].

Open-science context

TxT360 fits LLM360's broader program of fully transparent model development. The project had already released open models with unusually complete artifacts: Amber, a 7-billion-parameter model trained on 1.3 trillion tokens and shipped with 360 intermediate checkpoints; CrystalCoder, a 7-billion-parameter code-and-language model trained on 1.4 trillion tokens with 143 checkpoints; and K2, a 65-billion-parameter model released with its checkpoints, training code, logs, and data ^[9]^[10]. The recurring "360" branding signals the goal of a full-circle view of training, with data, code, checkpoints, and metrics all open.

For TxT360 specifically, LLM360 published the processing scripts, the per-document duplication metadata, and documentation of the filtering and weighting decisions, distributed under the permissive odc-by license ^[3]^[4]. That level of disclosure lets other groups audit the pipeline, reproduce the corpus, or fork it with a different mixture, which is the central argument the project makes for releasing it.

Limitations

TxT360 inherits the constraints of the data it is built from. The web track is dominated by English and other high-resource languages, and even with 310-plus-language Wikipedia in the mix, low-resource language coverage is thin. The curated set is large but bounded by source cutoffs in 2023 and 2024, so the corpus does not reflect events after those dates. Aggressive filtering removes most raw web text, which improves quality but can also drop legitimate content that fails the heuristics, and quality filters trained on reference corpora can carry their own biases about what counts as good text.

The deduplication and upsampling design also leaves open questions that the project frames as research directions rather than solved problems. Probabilistic structures like Bloom filters trade a small false-positive rate for scalability, and the rehydration recipe assumes that frequently duplicated text deserves more weight, which is a reasonable default but not a proven optimum for every training objective. Because the released comparison to FineWeb is reported without a complete public score table, independent replication is needed to pin down exactly where and by how much the gains hold.

References

LLM360. "TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend." LLM360 News, October 2024. https://www.llm360.ai/news/txt360-blogpost.html ↩
LLM360. "TxT360" (GitHub repository). 2024. https://github.com/LLM360/TxT360 ↩
LLM360. "LLM360/TxT360" (dataset card). Hugging Face, 2024. https://huggingface.co/datasets/LLM360/TxT360 ↩
LLM360. "TxT360: Trillion Extracted Text" (Hugging Face Space). 2024. https://huggingface.co/spaces/LLM360/TxT360 ↩
LLM360. "TxT360 README." GitHub, 2024. https://github.com/LLM360/TxT360/blob/main/README.md ↩
Nelson, Asif Razzaq. "LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens." MarkTechPost, October 8, 2024. https://www.marktechpost.com/2024/10/08/llm360-group-introduces-txt360-a-top-quality-llm-pre-training-dataset-with-15t-tokens/ ↩
Khan, Faraz; et al. "LSHBloom: Memory-efficient, Extreme-scale Document Deduplication." arXiv preprint arXiv:2411.04257, 2024. https://arxiv.org/abs/2411.04257 ↩
"The Birth of 5.7 Trillion Quality Tokens: Large Language Model Training Dataset TxT360." AIBase, October 2024. https://www.aibase.com/news/12194 ↩
Liu, Zhengzhong; et al. "LLM360: Towards Fully Transparent Open-Source LLMs." arXiv preprint arXiv:2312.06550, 2023. https://arxiv.org/abs/2312.06550 ↩
LLM360. "LLM360: Open-Source LLMs towards Community-Driven AGI." 2024. https://www.llm360.ai/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Dolma

Motivation

Data sources

The processing pipeline

Global deduplication

The upsampling recipe

Comparison to other open corpora

Open-science context

Limitations

References

Improve this article

Related Articles

Dolma

RefinedWeb

SlimPajama

OpenOrca

Cosmopedia

The Pile (dataset)

What links here

Related Articles

Dolma

RefinedWeb

SlimPajama

OpenOrca

Cosmopedia

The Pile (dataset)