FineWeb-2
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,388 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,388 words
Add missing citations, update stale details, or suggest a clearer explanation.
FineWeb-2 (also written FineWeb2) is a massively multilingual web pretraining dataset released by Hugging Face in December 2024. It is the multilingual successor to FineWeb, Hugging Face's English-only web corpus, and it extends the same processing philosophy to more than 1,000 languages drawn from Common Crawl [1][2]. The dataset is organized into 1,868 language-script pairs and was produced by applying a single, language-adaptive curation pipeline to 96 Common Crawl snapshots spanning the summer of 2013 through April 2024 [3]. FineWeb-2 is released under the permissive Open Data Commons Attribution License (ODC-By 1.0), making it usable for both research and commercial pretraining [1][2].
FineWeb-2 was built by the same team behind FineWeb and is documented in the paper "FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language," posted to arXiv on 26 June 2025 [3]. On a curated set of evaluation tasks across diverse languages, the authors report that FineWeb-2 outperforms earlier open multilingual corpora including CC-100, mC4, CulturaX, and HPLT, while being substantially larger, and in some cases matches or exceeds datasets hand-curated for a single language [1][3].
FineWeb-2 builds directly on the original FineWeb dataset, a 15-trillion-token English-language web corpus derived from 96 Common Crawl snapshots and released by Hugging Face in 2024 [4]. FineWeb established a reproducible recipe for turning raw Common Crawl dumps into high-quality pretraining text, with carefully ablated design choices for text extraction, language identification, quality filtering, and deduplication. The associated paper, "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," was published at the NeurIPS 2024 Datasets and Benchmarks Track [4].
The FineWeb release also introduced FineWeb-Edu, a 1.3-trillion-token subset filtered for educational content using a classifier trained to score the educational value of web pages. Models pretrained on FineWeb-Edu showed markedly stronger performance on knowledge- and reasoning-intensive benchmarks such as MMLU and ARC [4]. FineWeb-2 inherits FineWeb's emphasis on transparency, reproducibility, and ablation-driven design, but its central problem is different: extending a fundamentally English-tuned pipeline to work well across a very large number of languages and writing systems, where filtering heuristics and deduplication thresholds cannot simply be copied from English [3].
FineWeb-2 is a text-only dataset intended for training multilingual AI language models. Reported size figures vary by measurement and by source, so they are attributed below.
| Attribute | Value | Source |
|---|---|---|
| Languages | Over 1,000 | [1][3] |
| Language-script pairs | 1,868 | [3] |
| Compressed size | About 8 TB | [2] |
| Approximate word count | Almost 3 trillion words | [2] |
| Total dataset size (paper) | About 20 TB | [3] |
| Document count (paper) | About 5 billion documents | [3] |
| Common Crawl snapshots | 96 (summer 2013 to April 2024) | [1][3] |
| Token scale | Over 1 trillion tokens | [1] |
| License | ODC-By 1.0 | [1][2] |
The Hugging Face dataset card describes the public release as roughly 8 TB of compressed text equivalent to almost 3 trillion words [2], while the academic paper characterizes the full multilingual collection as a roughly 20 TB, 5-billion-document dataset [3]. The difference reflects compressed versus total measured sizes and the inclusion of per-language data variants. Each language-script combination is provided as its own configuration on the Hugging Face Hub, with separate splits, so users can download data for individual languages rather than the entire corpus [2]. News coverage at launch sometimes cited 1,893 language-script pairs; the figure of 1,868 comes from the peer-facing paper and is used here as the canonical count [1][3].
The core technical contribution of FineWeb-2 is a curation pipeline, based on FineWeb and implemented in Hugging Face's datatrove data-processing library, that automatically adapts to any language rather than relying on hand-tuned, language-specific rules [3]. The pipeline proceeds through several stages: language identification, deduplication, filtering, and a deduplication-informed upsampling step the authors call rehydration [3].
To choose pipeline settings, the team ablated design decisions on nine diverse "canary" languages: Arabic, Chinese, French, Hindi, Russian, Swahili, Telugu, Thai, and Turkish [3]. These languages were selected to span different scripts, resource levels, and linguistic families so that choices made for them would generalize to the long tail of lower-resource languages. Deduplication is performed per language using MinHash-based near-duplicate detection, with thresholds tuned so that the deduplication behavior is appropriate for each language rather than borrowed from English [3].
A distinctive element is the rehydration (rebalancing) strategy, which decides how much to upsample documents after deduplication. Instead of treating all duplicates as noise to be removed, the method uses each document's MinHash cluster as a signal: clusters whose members survive filtering at high rates are treated as higher quality and are upsampled more aggressively (up to roughly 10x), while heavily duplicated low-quality clusters and isolated singleton documents receive lower weights (around 1x) [3]. The authors describe this as a principled, scalable, and affordable way to rebalance datasets by jointly considering duplication count and quality, which matters because for many languages there is far less web text available than for English [3].
Designing a multilingual data pipeline requires reliable ways to measure whether a change actually improves model quality in each language. As a precursor to FineWeb-2, the Hugging Face team published FineTasks, a curated suite of multilingual evaluation tasks intended to provide high-signal, robust measurements across many languages, released alongside expanded multilingual support in the LightEval evaluation framework [5]. The FineWeb-2 paper similarly emphasizes that its pipeline choices were guided by a set of meaningful, informative evaluation tasks chosen through a novel selection process based on measurable criteria, rather than relying on a single noisy benchmark per language [3][5].
For evaluation, the authors trained ablation models on hundreds of billions of tokens to compare pipeline variants, and trained smaller models to compare FineWeb-2 against existing corpora on the canary and additional unseen languages [3]. The paper reports that FineWeb-2 outperforms prior multilingual datasets, specifically CC-100, mC4, CulturaX, and HPLT, on 11 of 14 evaluated languages, while being substantially larger than those baselines [1][3]. The authors note an important caveat: in some languages, pipelines hand-designed by native-language experts can still beat the automatically adapted FineWeb-2 pipeline, which sets a target for future work [3]. The release was validated through hundreds of ablation experiments and is described as fully reproducible [1].
FineWeb-2 is one of the largest fully open multilingual pretraining corpora released to date, and it addresses a persistent gap in open language-model development: while high-quality open English pretraining data had advanced quickly, comparable resources for the world's many other languages lagged behind [3]. By packaging a transparent, ablated, and language-adaptive pipeline together with the resulting data, FineWeb-2 lowers the barrier to training capable models in non-English and lower-resource languages, and it has become a widely used community resource, including as a base for derived datasets such as quality-filtered variants built by other groups [2][3].
Within the open-data ecosystem, FineWeb-2 is the multilingual counterpart to FineWeb and FineWeb-Edu, and it sits alongside other open pretraining corpora such as RedPajama, Dolma, CulturaX, and HPLT. It is distributed under the ODC-By 1.0 license, which permits reuse and redistribution, including for commercial purposes, provided attribution is given [1][2]. The dataset, the datatrove pipeline code, and the accompanying documentation are all openly available through the HuggingFaceFW organization on the Hugging Face Hub [1][2][3].