Nemotron-CC
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,746 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,746 words
Add missing citations, update stale details, or suggest a clearer explanation.
Nemotron-CC is a large-scale, open English-language pretraining dataset for large language models released by NVIDIA in December 2024. The corpus contains approximately 6.3 trillion tokens drawn from 99 Common Crawl snapshots, of which roughly 4.4 trillion are globally deduplicated original tokens and approximately 1.9 trillion are synthetically generated tokens produced by rephrasing or restructuring web documents with an instruction-tuned model. The dataset was introduced in the paper "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" by Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro of NVIDIA, which was first posted to arXiv on 3 December 2024 and subsequently accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) as a long paper.
Nemotron-CC was built using NVIDIA's NeMo Curator data-curation framework and is distributed openly through the Common Crawl Foundation's contributed dataset repository. The corpus is notable for combining an ensemble of model-based quality classifiers with large-scale synthetic data generation to address a trade-off the authors call the "quality versus quantity dilemma": aggressive heuristic filtering, as used in datasets such as FineWeb-Edu, removes up to roughly 90 percent of source tokens, while lightly filtered corpora retain volume but suffer in benchmark accuracy. By rephrasing low-quality documents and diversifying high-quality ones, the Nemotron-CC pipeline retains approximately four times more unique real tokens than the DataComp-LM (DCLM) baseline at comparable downstream accuracy. NVIDIA reported that an 8-billion-parameter dense transformer trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, outperforms the Llama 3.1 8B base model on the Massive Multitask Language Understanding (MMLU) benchmark by 5 points and on ARC-Challenge by 3.1 points, with a 0.5-point average improvement across ten common downstream tasks.
The Common Crawl Foundation has published monthly web snapshots of broad-coverage crawls since 2008, and the resulting petabytes of WARC files have become the dominant source of pretraining data for open foundation models. Earlier curated derivatives include C4 (2019), the Pile (2020), RefinedWeb (2023), RedPajama-V1 and V2 (2023), Dolma (2023), FineWeb and FineWeb-Edu (2024), and DCLM-Baseline (2024). These projects converged on a multi-stage recipe of language identification, text extraction from HTML, exact and fuzzy deduplication, heuristic quality filters, and increasingly model-based quality classifiers. Empirical scaling laws from Chinchilla and subsequent work suggest that frontier-scale models benefit from training horizons of 10 to 20 tokens per parameter or more, which has pushed the demand for clean English web text into the tens of trillions of tokens.
By late 2024 the most accurate open Common Crawl derivative, by reported MMLU scores at fixed compute, was DCLM-Baseline from the DataComp-LM benchmark, which used a fastText classifier trained on instruction-tuned outputs to filter Common Crawl down to roughly 3.8 trillion tokens. FineWeb-Edu had taken an alternative approach, using a Llama-3-70B-derived classifier to score documents for educational value and retaining only the top 10 percent or so of the source corpus, yielding around 1.3 trillion tokens with strong knowledge benchmark performance. Both approaches sacrificed a substantial fraction of available tokens, motivating the Nemotron-CC team to design a pipeline that could preserve more of the underlying material while still matching or exceeding accuracy baselines.
The Nemotron-CC pipeline begins with 99 Common Crawl snapshots spanning CC-MAIN-2013-20 through CC-MAIN-2024-30. WARC files are processed using the jusText HTML-to-text extractor, and a FastText language identifier retains documents predicted to be English. The team reports that English content constitutes roughly 73 percent of the raw token stream after extraction. Exact deduplication removes byte-identical documents through hashing, and fuzzy deduplication applies MinHash signatures with locality-sensitive hashing to remove near-duplicates with high Jaccard similarity. After global deduplication across all snapshots the corpus contains approximately 4.4 trillion unique English tokens.
A central design choice in Nemotron-CC is the use of an ensemble of three model-based quality classifiers rather than a single scoring model:
Each classifier produces a real-valued score, which is mapped via percentile thresholds to an integer rank from 0 (worst) to 19 (best). The maximum rank across the three classifiers becomes the ensemble score, and documents are then bucketed into five quality tiers ranging from low to high. The authors found that taking the maximum, rather than the mean, allowed each classifier to surface high-quality documents that the others missed, expanding the high-quality pool relative to single-classifier filtering.
The team also disables conventional heuristic filters such as length, symbol-to-word ratio, and perplexity thresholds for documents in the highest-quality buckets, observing that those filters tended to remove genuinely useful technical content that simply differed from prose norms. Heuristic filtering and KenLM perplexity filtering are retained for lower-quality tiers, where they continue to reduce noise.
The second pillar of Nemotron-CC is large-scale synthetic data generation. The team applied an instruction-tuned 12B-parameter Mistral NeMo model to rewrite or augment documents in two regimes:
The five prompt templates are described in the paper as Wikipedia-style rephrasing, diverse question-answer pair generation, distillation (concise rewrites), knowledge extraction (informative content restated), and knowledge lists (organized factual compilations). In total the synthetic generation step produces roughly 1.9 trillion tokens, of which approximately 1.5 trillion come from high-quality diversification and around 336 billion from low-quality rephrasing. The synthetic component constitutes about 30 percent of the final corpus.
The final Nemotron-CC corpus is organized into quality buckets that mix real and synthetic content. The approximate composition reported in the paper is summarized below.
| Subset | Approximate tokens | Notes |
|---|---|---|
| High quality (HQ) bucket | 553 B | Top ensemble-rank real documents |
| Medium-high quality | 504 B | Second tier of real documents |
| Medium quality | 2,023 B | Bulk of real corpus |
| Medium-low quality | 894 B | Lighter filtering |
| Low quality | 402 B | Retained with caution |
| Synthetic from HQ (diversified) | ~1,500 B | Four-prompt diversification |
| Synthetic from LQ (rephrased) | ~336 B | Single rephrasing prompt |
| Total | ~6.3 T | 4.4 T real plus 1.9 T synthetic |
NVIDIA also released a curated high-quality subset, Nemotron-CC-HQ, containing approximately 1.1 trillion tokens (about 0.6 trillion real and 0.5 trillion synthetic), intended for shorter training runs or as a high-grade mix-in within larger blends.
The paper evaluates Nemotron-CC in two regimes: a controlled short-horizon comparison against peer datasets using 8B-parameter models trained on 1 trillion tokens, and a long-horizon training run of 15 trillion tokens designed to test scaling.
Using the same 8B Llama-3-style architecture and training recipe across datasets, the authors report MMLU scores after 1 trillion training tokens. Nemotron-CC-HQ outperforms both DCLM-Baseline and FineWeb-Edu by substantial margins.
| Dataset | Tokens trained | MMLU (5-shot) |
|---|---|---|
| FineWeb-Edu | 1 T | 42.9 |
| RefinedWeb | 1 T | ~46 |
| DCLM-Baseline | 1 T | 53.4 |
| Nemotron-CC (full) | 1 T | ~53 |
| Nemotron-CC-HQ | 1 T | 59.0 |
The 5.6-point MMLU gap between Nemotron-CC-HQ and DCLM-Baseline at matched compute is the headline short-horizon result. The full Nemotron-CC corpus is also notable in that it matches DCLM-Baseline's accuracy while containing roughly four times as many unique real tokens, addressing the long-horizon supply problem.
The team trained an 8B-parameter dense transformer on 15 trillion tokens, drawing 7.2 trillion from Nemotron-CC and the remainder from a mixture of code, math, multilingual, and curated knowledge sources. The resulting model is compared with Meta's Llama 3.1 8B base across ten downstream benchmarks. Nemotron-CC-trained results meet or exceed Llama 3.1 8B on the majority of tasks, including a 5-point gain on MMLU and a 3.1-point gain on ARC-Challenge.
| Benchmark | Nemotron-CC 8B (15T) | Llama 3.1 8B |
|---|---|---|
| MMLU (5-shot) | 70.3 | 65.3 |
| ARC-Challenge | 58.1 | 55.0 |
| ARC-Easy | 82.7 | 81.4 |
| HellaSwag | 80.8 | 81.6 |
| Winogrande | 73.8 | 73.5 |
| RACE | 37.8 | 37.5 |
| PIQA | 81.1 | 81.1 |
| Social IQA | 47.4 | 47.2 |
| CommonsenseQA | 69.9 | 70.4 |
| OpenBookQA | 45.4 | 45.0 |
| Average | 64.7 | 63.8 |
The average gain of 0.5 to 0.9 points across the ten benchmarks is modest in aggregate but is concentrated on the knowledge-intensive MMLU and ARC-Challenge tasks, which the authors interpret as evidence that the synthetic component and aggressive quality bucketing help the model acquire factual and reasoning content.
Nemotron-CC sits alongside a growing family of large open English pretraining corpora. The comparison below summarizes scale and design choices.
| Dataset | Year | Approx tokens | Source | Quality filter | Synthetic data |
|---|---|---|---|---|---|
| C4 | 2019 | 156 B (en) | Common Crawl | Heuristics | None |
| RefinedWeb | 2023 | 5 T (600 B public) | Common Crawl | Heuristics, MacroData | None |
| RedPajama-V2 | 2023 | 30 T (raw) | Common Crawl | Quality signals | None |
| Dolma v1.7 | 2024 | 2.3 T | Mixed web, books, code | Heuristics, classifiers | None |
| FineWeb | 2024 | 15 T | Common Crawl | Heuristics, dedup | None |
| FineWeb-Edu | 2024 | 1.3 T | Common Crawl | Educational classifier | None |
| DCLM-Baseline | 2024 | 3.8 T | Common Crawl | fastText classifier | None |
| Common Pile v0.1 | 2025 | 8 T | Permissively licensed sources | Heuristics | None |
| Common Corpus | 2024 | 2 T | Permissively licensed sources | Heuristics | None |
| Nemotron-CC | 2024 | 6.3 T | Common Crawl | Classifier ensemble | 1.9 T |
Relative to FineWeb-Edu, Nemotron-CC adopts a less aggressive filter and replaces lost tokens with synthetic generations, trading some pure educational density for substantially more total volume. Relative to DCLM-Baseline, Nemotron-CC uses a multi-classifier ensemble rather than a single fastText filter and adds the synthetic component, which the team argues is the primary driver of the MMLU gap at fixed compute. Compared with Dolma and Common Pile, which emphasize source diversity or licensing transparency rather than scale alone, Nemotron-CC is specialized for the English Common Crawl slice and aimed squarely at maximizing downstream benchmark accuracy under fixed training horizons.
The end-to-end pipeline that produced Nemotron-CC is implemented in NeMo Curator, NVIDIA's open-source GPU-accelerated data-curation library. NeMo Curator provides distributed implementations of HTML extraction, language identification, exact and fuzzy deduplication, heuristic filtering, classifier-based filtering, perplexity scoring, and synthetic data generation. NVIDIA has published the rephrasing prompts, classifier checkpoints, and pipeline configurations used for Nemotron-CC in the NeMo Curator GitHub repository, allowing third parties to reproduce the construction or apply the same recipe to non-English crawls or domain-specific corpora.
The dataset itself is hosted on the Common Crawl Foundation's data servers at data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/ and on Hugging Face under nvidia/Nemotron-CC and nvidia/Nemotron-CC-HQ. The corpus inherits Common Crawl's terms of use; the synthetic component is released under a permissive license attached to NVIDIA's contributions. Quality metadata is included alongside text, allowing downstream users to filter or reweight buckets according to their training objectives.
Nemotron-CC was received as a significant contribution to the open pretraining-data ecosystem on release. Practitioners on Hugging Face and in academic venues highlighted three aspects: the explicit demonstration that synthetic rephrasing can substitute for discarded low-quality text without degrading downstream accuracy, the ensemble-of-classifiers approach as an alternative to single-classifier pipelines such as DCLM, and the unusually transparent reporting of per-bucket token counts and per-benchmark scores. Within months of release, Nemotron-CC and its HQ subset appeared as components in pretraining mixtures used by several open-model efforts, and the corpus was used as a baseline for subsequent dataset releases.
NVIDIA itself extended the recipe in 2025 with two follow-ups. Nemotron-CC-Math, introduced in an arXiv preprint dated August 2025, applied a similar pipeline to mathematics-bearing web pages, using layout-aware rendering with the Lynx browser and an LLM-based cleaning stage to preserve MathJax, KaTeX, and MathML equation content. The math corpus is released in two variants, Nemotron-CC-Math-4+ at 52.3 billion tokens (top quality scores 4 to 5) and Nemotron-CC-Math-3+ at 133.3 billion tokens (scores 3 to 5), under an Apache 2.0 license. When used in pretraining a Nemotron-T 8B model, the math corpus produced reported gains of +4.8 to +12.6 points on the MATH benchmark and +4.6 to +14.3 points on MBPP+ over strong baselines, and was reported to outperform FineMath, MegaMath, and OpenWebMath on math reasoning evaluations.
NVIDIA also released Nemotron-CC-v2 and a broader Nemotron pretraining dataset family in 2025 as part of the Nemotron Nano 2 model release. Nemotron-CC-v2 expanded the original recipe with additional Common Crawl snapshots, refined classifiers, and additional synthetic prompt templates, and was used together with code, math, and multilingual mixes to train the Nemotron Nano 2 hybrid Mamba-Transformer reasoning model.
Several limitations of Nemotron-CC have been noted. The synthetic component is generated by a single instruction-tuned model, Mistral-NeMo 12B Instruct, which raises questions about model-induced bias propagating into pretraining data and about possible benchmark contamination if the teacher model was itself exposed to evaluation material. The authors discuss decontamination procedures applied to the synthetic outputs, but third-party audits at the scale of 1.9 trillion synthetic tokens are difficult.
The corpus is English-only, which restricts its direct utility for multilingual model training, although the same NeMo Curator recipe has been demonstrated on non-English subsets. The use of Common Crawl as the sole source means that Nemotron-CC inherits known issues of web data, including over-representation of certain genres, copyrighted content under fair-use ambiguity, and the long-tail of low-quality SEO and machine-generated text that even ensemble classifiers cannot fully filter. The dataset's license is permissive but defers to Common Crawl's terms, which some downstream users interpret cautiously for commercial deployment.
Finally, the long-horizon 15T-token comparison against Llama 3.1 8B holds the model architecture roughly constant but not the mixture: Llama 3.1's training mix is proprietary, so the comparison is suggestive rather than fully controlled. The authors are explicit that the 0.5-point average gain across ten benchmarks is small, and the principal claim is the substantial improvement on knowledge-heavy tasks such as MMLU rather than uniform dominance.