TxT360
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,891 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,891 words
Add missing citations, update stale details, or suggest a clearer explanation.
TxT360 is an open large-scale pretraining corpus for large language models, released in October 2024 by the LLM360 project, a collaboration led by Petuum and the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) [1][2]. The name stands for Trillion eXtracted Text. Its distinguishing feature is that it pulls together 99 Common Crawl web snapshots and 14 curated, high-quality non-web sources, then runs a single global deduplication pass over all of them at once. That produces roughly 5 trillion unique tokens of cleaned text, which a documented upsampling recipe expands into a corpus of more than 15 trillion tokens [1][3]. The team reported that, with that recipe, TxT360 matches or beats FineWeb on several downstream evaluation metrics [1].
By 2024 the open pretraining-data landscape had split into two camps. On one side were enormous web corpora distilled from Common Crawl, such as FineWeb, RedPajama, and the C4-style cleanups, which gave models broad coverage of long-tail and everyday text but carried the noise and redundancy of the raw web. On the other side were curated collections like The Pile and Dolma that mixed in books, code, scientific papers, and reference text to raise quality, but did not deduplicate web and curated material against each other in a unified way.
TxT360 was built to close that gap. The idea is straightforward to state and hard to execute: take a very large slice of the web, add a set of trusted curated sources, deduplicate everything together so the same passage does not appear once from a crawl and again from a book or a Wikipedia mirror, and keep enough bookkeeping that a practitioner can decide afterward how heavily to weight each part. The project also treats transparency as a first-class goal, releasing the processing scripts and per-document metadata rather than only the final token stream [1][4].
The web portion comes from 99 Common Crawl snapshots, processed into about 4.83 trillion tokens that occupy roughly 9.2 TB on disk. The latest snapshot used carried a cutoff around mid-2024 [3]. Alongside the web data sit 14 curated sources grouped across roughly ten domains: legal text, patents, encyclopedic reference, scientific literature, mathematics, forum discussion, parliamentary proceedings, and books. The five paper sources (arXiv, S2ORC, PhilPapers, PubMed Abstracts, and PubMed Central) are reported together as a single "Papers" line in the published statistics [3][5].
The table below lists the token counts and raw sizes published in the dataset card. Counts are given before global deduplication.
| Source | Raw size | Tokens | Notes |
|---|---|---|---|
| Common Crawl (99 snapshots) | 9.2 TB | 4.83T | Open web text |
| Papers (arXiv, S2ORC, PhilPapers, PubMed Abstracts, PubMed Central) | 712 GB | 154.96B | Scientific literature |
| Wikipedia | 199 GB | 35.98B | 310+ languages |
| StackExchange | 81 GB | 27.76B | Q&A forums |
| FreeLaw | 71 GB | 16.70B | Court opinions and legal filings |
| DM Math | 22 GB | 5.23B | Synthetic mathematics problems |
| USPTO | 45 GB | 4.95B | US patent grants |
| PG-19 | 11 GB | 2.63B | Pre-1919 books from Project Gutenberg |
| Europarl | 6.1 GB | 1.96B | European Parliament proceedings |
| Ubuntu IRC | 6.0 GB | 1.89B | Chat logs |
| HackerNews | 4.2 GB | 1.05B | Forum posts and comments |
The web data goes through a filtering pipeline before deduplication. Text is first extracted from raw WARC records, then passed through language identification, URL-based filtering to drop blocklisted and low-value domains, repetition removal to cut boilerplate and spam, and line-level and document-level quality filters that discard pages failing heuristics for length, symbol ratios, and other signals. A privacy step removes personally identifiable information such as email addresses and IP addresses [5][6]. The cumulative effect of these filters is large: LLM360 reported that about 97.65 percent of the original raw web data was removed before the remaining text reached the deduplication stage [6].
The 14 curated sources are processed with their own per-source cleaning rather than the full web pipeline, since material like court opinions, patents, and books does not carry the same kinds of web noise. After both tracks are cleaned, they are combined and deduplicated as one corpus.
Deduplication is the part of TxT360 the authors stress most. Most prior datasets deduplicated each snapshot or each source on its own, which leaves cross-source and cross-snapshot copies in place. A popular article can be crawled in dozens of monthly snapshots, and the same Wikipedia paragraph can show up in a web crawl, a curated Wikipedia dump, and a third-party mirror. TxT360 instead deduplicates the entire 99-snapshot web set and all 14 curated sources together [1][2].
The system uses two passes. Exact duplicates are caught with a Bloom filter, a compact probabilistic structure that can test whether a document has been seen before without storing every document in memory. Near-duplicates are caught with MinHash, which produces a signature for each document so that texts sharing most of their content collide even when they differ in small ways. To make MinHash scale to trillions of tokens, the signature is split into bands, and each band gets its own Bloom filter; documents whose bands collide are treated as fuzzy duplicates [5][6][7]. Running this globally is what brings the cleaned corpus down to roughly 5 trillion unique tokens [1].
Why this matters for training is concrete. Duplicated text wastes compute, can cause a model to memorize repeated passages verbatim, and distorts the effective data mixture because heavily mirrored content gets silently overweighted. Removing duplicates across the whole corpus gives a cleaner starting distribution and a more honest token count.
Deduplicating down to 5 trillion unique tokens raises an obvious problem: many strong open models train on 15 trillion tokens or more, so a 5T corpus is not large enough on its own. Rather than dump the duplicates back in indiscriminately, TxT360 keeps detailed metadata about how often and where each document appeared, then uses that to rebuild a larger corpus in a controlled way.
During deduplication, every kept document records its duplicate count and the snapshots it came from. Documents are bucketed by how many times they were seen, with ranges such as 1-1 (unique), 2-5, 6-10, 11-100, 101-1000, and a top bucket reaching into the tens of millions for Common Crawl [3]. The published upsampling recipe is a "rehydration" strategy: documents are repeated in proportion to how often they originally occurred, on the reasoning that text appearing across many independent pages tends to be more canonical and worth seeing more than once [8]. Applying this weighting brings the corpus past 15 trillion tokens while letting practitioners reconstruct the original distribution, or design their own weighting, because the metadata that drives the process ships with the data [1][3]. The team also noted that mixing in a dedicated code dataset alongside TxT360 produced more stable training curves in their experiments [8].
TxT360 sits in the same family as FineWeb, RedPajama, SlimPajama, and Dolma, and it borrows ideas from each. Its closest reference point is FineWeb, the Common Crawl distillation from Hugging Face. LLM360 reported that a model trained on the 15T-token upsampled TxT360 matched or surpassed FineWeb on several evaluation metrics, including MMLU and Natural Questions, though the released materials describe the comparison qualitatively rather than publishing a full per-benchmark score table [1].
| Corpus | Primary scope | Global cross-source dedup | Approx. scale |
|---|---|---|---|
| TxT360 | Web plus 14 curated sources | Yes, web and curated together | ~5T unique, 15T+ upsampled |
| FineWeb | Common Crawl web only | Per-snapshot | ~15T tokens |
| RedPajama (v1) | Web plus curated, LLaMA-style mix | Per-source | ~1.2T tokens |
| SlimPajama | Deduplicated RedPajama | Yes, across RedPajama | ~627B tokens |
| Dolma | Web plus curated | Per-source | ~3T tokens |
The contrasts are matters of design rather than simple ranking. RedPajama and Dolma mix web and curated text but deduplicate within sources; SlimPajama showed how much aggressive deduplication shrinks a corpus, taking RedPajama from about 1.2 trillion tokens to roughly 627 billion; FineWeb focused on getting the web track right at very large scale. TxT360's contribution is to unify the web-plus-curated mix with a single global dedup and then publish a recipe and metadata for rebuilding scale on top of the deduplicated base [1][8].
TxT360 fits LLM360's broader program of fully transparent model development. The project had already released open models with unusually complete artifacts: Amber, a 7-billion-parameter model trained on 1.3 trillion tokens and shipped with 360 intermediate checkpoints; CrystalCoder, a 7-billion-parameter code-and-language model trained on 1.4 trillion tokens with 143 checkpoints; and K2, a 65-billion-parameter model released with its checkpoints, training code, logs, and data [9][10]. The recurring "360" branding signals the goal of a full-circle view of training, with data, code, checkpoints, and metrics all open.
For TxT360 specifically, LLM360 published the processing scripts, the per-document duplication metadata, and documentation of the filtering and weighting decisions, distributed under the permissive odc-by license [3][4]. That level of disclosure lets other groups audit the pipeline, reproduce the corpus, or fork it with a different mixture, which is the central argument the project makes for releasing it.
TxT360 inherits the constraints of the data it is built from. The web track is dominated by English and other high-resource languages, and even with 310-plus-language Wikipedia in the mix, low-resource language coverage is thin. The curated set is large but bounded by source cutoffs in 2023 and 2024, so the corpus does not reflect events after those dates. Aggressive filtering removes most raw web text, which improves quality but can also drop legitimate content that fails the heuristics, and quality filters trained on reference corpora can carry their own biases about what counts as good text.
The deduplication and upsampling design also leaves open questions that the project frames as research directions rather than solved problems. Probabilistic structures like Bloom filters trade a small false-positive rate for scalability, and the rehydration recipe assumes that frequently duplicated text deserves more weight, which is a reasonable default but not a proven optimum for every training objective. Because the released comparison to FineWeb is reported without a complete public score table, independent replication is needed to pin down exactly where and by how much the gains hold.