FineWeb-2

Data & Datasets Machine Learning

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,388 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

FineWeb-2 (also written FineWeb2) is a massively multilingual web pretraining dataset released by Hugging Face in December 2024. It is the multilingual successor to FineWeb, Hugging Face's English-only web corpus, and it extends the same processing philosophy to more than 1,000 languages drawn from Common Crawl ^[1]^[2]. The dataset is organized into 1,868 language-script pairs and was produced by applying a single, language-adaptive curation pipeline to 96 Common Crawl snapshots spanning the summer of 2013 through April 2024 ^[3]. FineWeb-2 is released under the permissive Open Data Commons Attribution License (ODC-By 1.0), making it usable for both research and commercial pretraining ^[1]^[2].

FineWeb-2 was built by the same team behind FineWeb and is documented in the paper "FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language," posted to arXiv on 26 June 2025 ^[3]. On a curated set of evaluation tasks across diverse languages, the authors report that FineWeb-2 outperforms earlier open multilingual corpora including CC-100, mC4, CulturaX, and HPLT, while being substantially larger, and in some cases matches or exceeds datasets hand-curated for a single language ^[1]^[3].

Lineage: FineWeb and FineWeb-Edu

FineWeb-2 builds directly on the original FineWeb dataset, a 15-trillion-token English-language web corpus derived from 96 Common Crawl snapshots and released by Hugging Face in 2024 ^[4]. FineWeb established a reproducible recipe for turning raw Common Crawl dumps into high-quality pretraining text, with carefully ablated design choices for text extraction, language identification, quality filtering, and deduplication. The associated paper, "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," was published at the NeurIPS 2024 Datasets and Benchmarks Track ^[4].

The FineWeb release also introduced FineWeb-Edu, a 1.3-trillion-token subset filtered for educational content using a classifier trained to score the educational value of web pages. Models pretrained on FineWeb-Edu showed markedly stronger performance on knowledge- and reasoning-intensive benchmarks such as MMLU and ARC ^[4]. FineWeb-2 inherits FineWeb's emphasis on transparency, reproducibility, and ablation-driven design, but its central problem is different: extending a fundamentally English-tuned pipeline to work well across a very large number of languages and writing systems, where filtering heuristics and deduplication thresholds cannot simply be copied from English ^[3].

What FineWeb-2 contains

FineWeb-2 is a text-only dataset intended for training multilingual AI language models. Reported size figures vary by measurement and by source, so they are attributed below.

Attribute	Value	Source
Languages	Over 1,000	^[1]^[3]
Language-script pairs	1,868	^[3]
Compressed size	About 8 TB	^[2]
Approximate word count	Almost 3 trillion words	^[2]
Total dataset size (paper)	About 20 TB	^[3]
Document count (paper)	About 5 billion documents	^[3]
Common Crawl snapshots	96 (summer 2013 to April 2024)	^[1]^[3]
Token scale	Over 1 trillion tokens	^[1]
License	ODC-By 1.0	^[1]^[2]

The Hugging Face dataset card describes the public release as roughly 8 TB of compressed text equivalent to almost 3 trillion words ^[2], while the academic paper characterizes the full multilingual collection as a roughly 20 TB, 5-billion-document dataset ^[3]. The difference reflects compressed versus total measured sizes and the inclusion of per-language data variants. Each language-script combination is provided as its own configuration on the Hugging Face Hub, with separate splits, so users can download data for individual languages rather than the entire corpus ^[2]. News coverage at launch sometimes cited 1,893 language-script pairs; the figure of 1,868 comes from the peer-facing paper and is used here as the canonical count ^[1]^[3].

Methodology: a per-language pipeline

The core technical contribution of FineWeb-2 is a curation pipeline, based on FineWeb and implemented in Hugging Face's datatrove data-processing library, that automatically adapts to any language rather than relying on hand-tuned, language-specific rules ^[3]. The pipeline proceeds through several stages: language identification, deduplication, filtering, and a deduplication-informed upsampling step the authors call rehydration ^[3].

To choose pipeline settings, the team ablated design decisions on nine diverse "canary" languages: Arabic, Chinese, French, Hindi, Russian, Swahili, Telugu, Thai, and Turkish ^[3]. These languages were selected to span different scripts, resource levels, and linguistic families so that choices made for them would generalize to the long tail of lower-resource languages. Deduplication is performed per language using MinHash-based near-duplicate detection, with thresholds tuned so that the deduplication behavior is appropriate for each language rather than borrowed from English ^[3].

A distinctive element is the rehydration (rebalancing) strategy, which decides how much to upsample documents after deduplication. Instead of treating all duplicates as noise to be removed, the method uses each document's MinHash cluster as a signal: clusters whose members survive filtering at high rates are treated as higher quality and are upsampled more aggressively (up to roughly 10x), while heavily duplicated low-quality clusters and isolated singleton documents receive lower weights (around 1x) ^[3]. The authors describe this as a principled, scalable, and affordable way to rebalance datasets by jointly considering duplication count and quality, which matters because for many languages there is far less web text available than for English ^[3].

Evaluation and quality

Designing a multilingual data pipeline requires reliable ways to measure whether a change actually improves model quality in each language. As a precursor to FineWeb-2, the Hugging Face team published FineTasks, a curated suite of multilingual evaluation tasks intended to provide high-signal, robust measurements across many languages, released alongside expanded multilingual support in the LightEval evaluation framework ^[5]. The FineWeb-2 paper similarly emphasizes that its pipeline choices were guided by a set of meaningful, informative evaluation tasks chosen through a novel selection process based on measurable criteria, rather than relying on a single noisy benchmark per language ^[3]^[5].

For evaluation, the authors trained ablation models on hundreds of billions of tokens to compare pipeline variants, and trained smaller models to compare FineWeb-2 against existing corpora on the canary and additional unseen languages ^[3]. The paper reports that FineWeb-2 outperforms prior multilingual datasets, specifically CC-100, mC4, CulturaX, and HPLT, on 11 of 14 evaluated languages, while being substantially larger than those baselines ^[1]^[3]. The authors note an important caveat: in some languages, pipelines hand-designed by native-language experts can still beat the automatically adapted FineWeb-2 pipeline, which sets a target for future work ^[3]. The release was validated through hundreds of ablation experiments and is described as fully reproducible ^[1].

Significance and license

FineWeb-2 is one of the largest fully open multilingual pretraining corpora released to date, and it addresses a persistent gap in open language-model development: while high-quality open English pretraining data had advanced quickly, comparable resources for the world's many other languages lagged behind ^[3]. By packaging a transparent, ablated, and language-adaptive pipeline together with the resulting data, FineWeb-2 lowers the barrier to training capable models in non-English and lower-resource languages, and it has become a widely used community resource, including as a base for derived datasets such as quality-filtered variants built by other groups ^[2]^[3].

Within the open-data ecosystem, FineWeb-2 is the multilingual counterpart to FineWeb and FineWeb-Edu, and it sits alongside other open pretraining corpora such as RedPajama, Dolma, CulturaX, and HPLT. It is distributed under the ODC-By 1.0 license, which permits reuse and redistribution, including for commercial purposes, provided attribution is given ^[1]^[2]. The dataset, the datatrove pipeline code, and the accompanying documentation are all openly available through the HuggingFaceFW organization on the Hugging Face Hub ^[1]^[2]^[3].

References

HuggingFaceFW, "fineweb-2 (dataset card)," Hugging Face. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 ↩
MarkTechPost, "Hugging Face Releases FineWeb2: 8TB of Compressed Text Data with Almost 3T Words and 1000 Languages Outperforming Other Datasets," 8 December 2024. https://www.marktechpost.com/2024/12/08/hugging-face-releases-fineweb2-8tb-of-compressed-text-data-with-almost-3t-words-and-1000-languages-outperforming-other-datasets/ ↩
Guilherme Penedo, Hynek Kydlicek, Vinko Sabolcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf, "FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language," arXiv:2506.20920, 26 June 2025. https://arxiv.org/abs/2506.20920 ↩
Guilherme Penedo, Hynek Kydlicek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," NeurIPS 2024 Datasets and Benchmarks Track; arXiv:2406.17557. https://arxiv.org/abs/2406.17557 ↩
HuggingFaceFW, "Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks (FineTasks)," Hugging Face blog. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

RefinedWeb

Overview

Lineage: FineWeb and FineWeb-Edu

What FineWeb-2 contains

Methodology: a per-language pipeline

Evaluation and quality

Significance and license

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset