# Nemotron-CC

> Source: https://aiwiki.ai/wiki/nemotron_cc
> Updated: 2026-06-24
> Categories: Data & Datasets, NVIDIA, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Nemotron-CC** is a large-scale, open English-language [pretraining](/wiki/pretraining) dataset for [large language models](/wiki/large_language_models) released by [NVIDIA](/wiki/nvidia) in December 2024. The corpus contains approximately 6.3 trillion tokens drawn from 99 [Common Crawl](/wiki/common_crawl) snapshots, of which roughly 4.4 trillion are globally deduplicated original tokens and approximately 1.9 trillion are synthetically generated tokens produced by rephrasing or restructuring web documents with an instruction-tuned model [1][3]. The dataset was introduced in the paper "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" by Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and [Bryan Catanzaro](/wiki/bryan_catanzaro) of NVIDIA, which was first posted to arXiv on 3 December 2024 and subsequently accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) as a long paper [1][2].

Nemotron-CC was built using NVIDIA's [NeMo Curator](/wiki/nemo_curator) data-curation framework and is distributed openly through the Common Crawl Foundation's contributed dataset repository [4][6]. The corpus is notable for combining an ensemble of model-based quality classifiers with large-scale [synthetic data](/wiki/synthetic_data) generation to address a trade-off the authors call the "quality versus quantity dilemma": aggressive heuristic filtering, as used in datasets such as [FineWeb](/wiki/fineweb)-Edu, removes up to roughly 90 percent of source tokens, while lightly filtered corpora retain volume but suffer in benchmark accuracy [1][3]. By rephrasing low-quality documents and diversifying high-quality ones, the Nemotron-CC pipeline retains approximately four times more unique real tokens than the DataComp-LM (DCLM) baseline at comparable downstream accuracy [3][5]. NVIDIA reported that an 8-billion-parameter dense transformer trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, outperforms the Llama 3.1 8B base model on the Massive Multitask Language Understanding (MMLU) benchmark by 5 points and on ARC-Challenge by 3.1 points, with a 0.5-point average improvement across ten common downstream tasks [1][5].

## What problem does Nemotron-CC solve?

The Common Crawl Foundation has published monthly web snapshots of broad-coverage crawls since 2008, and the resulting petabytes of WARC files have become the dominant source of pretraining data for open foundation models. Earlier curated derivatives include C4 (2019), the Pile (2020), RefinedWeb (2023), RedPajama-V1 and V2 (2023), [Dolma](/wiki/dolma) (2023), FineWeb and FineWeb-Edu (2024), and [DCLM](/wiki/dclm)-Baseline (2024). These projects converged on a multi-stage recipe of language identification, text extraction from HTML, exact and fuzzy deduplication, heuristic quality filters, and increasingly model-based quality classifiers. Empirical scaling laws from Chinchilla and subsequent work suggest that frontier-scale models benefit from training horizons of 10 to 20 tokens per parameter or more, which has pushed the demand for clean English web text into the tens of trillions of tokens.

By late 2024 the most accurate open Common Crawl derivative, by reported MMLU scores at fixed compute, was DCLM-Baseline from the DataComp-LM benchmark, which used a fastText classifier trained on instruction-tuned outputs to filter Common Crawl down to roughly 3.8 trillion tokens [10]. FineWeb-Edu had taken an alternative approach, using a Llama-3-70B-derived classifier to score documents for educational value and retaining only the top 10 percent or so of the source corpus, yielding around 1.3 trillion tokens with strong knowledge benchmark performance [11]. Both approaches sacrificed a substantial fraction of available tokens, motivating the Nemotron-CC team to design a pipeline that could preserve more of the underlying material while still matching or exceeding accuracy baselines. The authors summarize the core insight in two findings: "Ensembling different model-based classifiers can help select a larger and more diverse set of high quality tokens," and "Rephrasing can effectively reduce noise and errors in low-quality data and produce diverse variants with fresh unique tokens" [5].

## How was Nemotron-CC built?

The Nemotron-CC pipeline begins with 99 Common Crawl snapshots spanning CC-MAIN-2013-20 through CC-MAIN-2024-30 [1]. WARC files are processed using the jusText HTML-to-text extractor, and a FastText language identifier retains documents predicted to be English. The team reports that English content constitutes roughly 73 percent of the raw token stream after extraction. Exact deduplication removes byte-identical documents through hashing, and fuzzy deduplication applies MinHash signatures with locality-sensitive hashing to remove near-duplicates with high Jaccard similarity. After global deduplication across all snapshots the corpus contains approximately 4.4 trillion unique English tokens [3].

### Quality classifier ensemble

A central design choice in Nemotron-CC is the use of an ensemble of three model-based quality classifiers rather than a single scoring model:

- The DCLM fastText classifier, originally trained on instruction-tuned text contrasted with random web crawl samples.
- A FineWeb-Edu-style classifier built from a Snowflake-arctic-embed-m encoder and trained on annotations from a Mixtral-derived teacher.
- A FineWeb Nemotron-4 Edu classifier trained with annotations from the Nemotron-4 340B Instruct model.

Each classifier produces a real-valued score, which is mapped via percentile thresholds to an integer rank from 0 (worst) to 19 (best). The maximum rank across the three classifiers becomes the ensemble score, and documents are then bucketed into five quality tiers ranging from low to high. The authors found that taking the maximum, rather than the mean, allowed each classifier to surface high-quality documents that the others missed, expanding the high-quality pool relative to single-classifier filtering [1][5].

The team also disables conventional heuristic filters such as length, symbol-to-word ratio, and perplexity thresholds for documents in the highest-quality buckets, observing that those filters tended to remove genuinely useful technical content that simply differed from prose norms [1]. Heuristic filtering and KenLM perplexity filtering are retained for lower-quality tiers, where they continue to reduce noise.

### Synthetic data generation

The second pillar of Nemotron-CC is large-scale synthetic data generation. The team applied an instruction-tuned 12B-parameter Mistral NeMo model to rewrite or augment documents in two regimes [1][4]:

- For high-quality documents, four diversification prompts produce additional pretraining material that varies stylistically and structurally from the source while retaining its information content.
- For low-quality documents, a single rephrasing prompt rewrites noisy or poorly formatted text into clearer prose, recovering some of the underlying knowledge that would otherwise be discarded.

The five prompt templates are described in the paper as Wikipedia-style rephrasing, diverse question-answer pair generation, distillation (concise rewrites), knowledge extraction (informative content restated), and knowledge lists (organized factual compilations) [1]. In total the synthetic generation step produces roughly 1.9 trillion tokens, of which approximately 1.5 trillion come from high-quality diversification and around 336 billion from low-quality rephrasing. The synthetic component constitutes about 30 percent of the final corpus [1][3].

### Dataset composition

The final Nemotron-CC corpus is organized into quality buckets that mix real and synthetic content. The approximate composition reported in the paper is summarized below [1].

| Subset | Approximate tokens | Notes |
|---|---|---|
| High quality (HQ) bucket | 553 B | Top ensemble-rank real documents |
| Medium-high quality | 504 B | Second tier of real documents |
| Medium quality | 2,023 B | Bulk of real corpus |
| Medium-low quality | 894 B | Lighter filtering |
| Low quality | 402 B | Retained with caution |
| Synthetic from HQ (diversified) | ~1,500 B | Four-prompt diversification |
| Synthetic from LQ (rephrased) | ~336 B | Single rephrasing prompt |
| Total | ~6.3 T | 4.4 T real plus 1.9 T synthetic |

NVIDIA also released a curated high-quality subset, Nemotron-CC-HQ, containing approximately 1.1 trillion tokens (about 0.6 trillion real and 0.5 trillion synthetic), intended for shorter training runs or as a high-grade mix-in within larger blends [3][5].

## How does Nemotron-CC perform on benchmarks?

The paper evaluates Nemotron-CC in two regimes: a controlled short-horizon comparison against peer datasets using 8B-parameter models trained on 1 trillion tokens, and a long-horizon training run of 15 trillion tokens designed to test scaling [1].

### Short-horizon comparison

Using the same 8B Llama-3-style architecture and training recipe across datasets, the authors report MMLU scores after 1 trillion training tokens. Nemotron-CC-HQ outperforms both DCLM-Baseline and FineWeb-Edu by substantial margins [1][3].

| Dataset | Tokens trained | MMLU (5-shot) |
|---|---|---|
| FineWeb-Edu | 1 T | 42.9 |
| RefinedWeb | 1 T | ~46 |
| DCLM-Baseline | 1 T | 53.4 |
| Nemotron-CC (full) | 1 T | ~53 |
| Nemotron-CC-HQ | 1 T | 59.0 |

The 5.6-point MMLU gap between Nemotron-CC-HQ and DCLM-Baseline at matched compute is the headline short-horizon result [3][5]. As NVIDIA states, the full corpus "matches DCLM on MMLU, but contains four times more unique real tokens," addressing the long-horizon supply problem [3].

### Long-horizon training

The team trained an 8B-parameter dense transformer on 15 trillion tokens, drawing 7.2 trillion from Nemotron-CC and the remainder from a mixture of code, math, multilingual, and curated knowledge sources [1][5]. The resulting model is compared with Meta's Llama 3.1 8B base across ten downstream benchmarks. Nemotron-CC-trained results meet or exceed Llama 3.1 8B on the majority of tasks, including a 5-point gain on MMLU and a 3.1-point gain on ARC-Challenge [1][5].

| Benchmark | Nemotron-CC 8B (15T) | Llama 3.1 8B |
|---|---|---|
| MMLU (5-shot) | 70.3 | 65.3 |
| ARC-Challenge | 58.1 | 55.0 |
| ARC-Easy | 82.7 | 81.4 |
| HellaSwag | 80.8 | 81.6 |
| Winogrande | 73.8 | 73.5 |
| RACE | 37.8 | 37.5 |
| PIQA | 81.1 | 81.1 |
| Social IQA | 47.4 | 47.2 |
| CommonsenseQA | 69.9 | 70.4 |
| OpenBookQA | 45.4 | 45.0 |
| Average | 64.7 | 63.8 |

The average gain of 0.5 to 0.9 points across the ten benchmarks is modest in aggregate but is concentrated on the knowledge-intensive MMLU and ARC-Challenge tasks, which the authors interpret as evidence that the synthetic component and aggressive quality bucketing help the model acquire factual and reasoning content [1].

## How does Nemotron-CC compare to DCLM and FineWeb-Edu?

Nemotron-CC sits alongside a growing family of large open English pretraining corpora. The comparison below summarizes scale and design choices.

| Dataset | Year | Approx tokens | Source | Quality filter | Synthetic data |
|---|---|---|---|---|---|
| C4 | 2019 | 156 B (en) | Common Crawl | Heuristics | None |
| RefinedWeb | 2023 | 5 T (600 B public) | Common Crawl | Heuristics, MacroData | None |
| RedPajama-V2 | 2023 | 30 T (raw) | Common Crawl | Quality signals | None |
| Dolma v1.7 | 2024 | 2.3 T | Mixed web, books, code | Heuristics, classifiers | None |
| FineWeb | 2024 | 15 T | Common Crawl | Heuristics, dedup | None |
| FineWeb-Edu | 2024 | 1.3 T | Common Crawl | Educational classifier | None |
| DCLM-Baseline | 2024 | 3.8 T | Common Crawl | fastText classifier | None |
| [Common Pile](/wiki/common_pile) v0.1 | 2025 | 8 T | Permissively licensed sources | Heuristics | None |
| [Common Corpus](/wiki/common_corpus) | 2024 | 2 T | Permissively licensed sources | Heuristics | None |
| Nemotron-CC | 2024 | 6.3 T | Common Crawl | Classifier ensemble | 1.9 T |

Relative to FineWeb-Edu, Nemotron-CC adopts a less aggressive filter and replaces lost tokens with synthetic generations, trading some pure educational density for substantially more total volume. Relative to DCLM-Baseline, Nemotron-CC uses a multi-classifier ensemble rather than a single fastText filter and adds the synthetic component, which the team argues is the primary driver of the MMLU gap at fixed compute [1][5]. Compared with Dolma and Common Pile, which emphasize source diversity or licensing transparency rather than scale alone, Nemotron-CC is specialized for the English Common Crawl slice and aimed squarely at maximizing downstream benchmark accuracy under fixed training horizons.

## What tools built Nemotron-CC, and is it reproducible?

The end-to-end pipeline that produced Nemotron-CC is implemented in [NeMo Curator](/wiki/nemo_curator), NVIDIA's open-source GPU-accelerated data-curation library [4][12]. NeMo Curator provides distributed implementations of HTML extraction, language identification, exact and fuzzy deduplication, heuristic filtering, classifier-based filtering, perplexity scoring, and synthetic data generation. NVIDIA has published the rephrasing prompts, classifier checkpoints, and pipeline configurations used for Nemotron-CC in the NeMo Curator GitHub repository, allowing third parties to reproduce the construction or apply the same recipe to non-English crawls or domain-specific corpora [4][12].

The dataset itself is hosted on the Common Crawl Foundation's data servers at data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/ and on Hugging Face under nvidia/Nemotron-CC and nvidia/Nemotron-CC-HQ [6][9]. The corpus inherits Common Crawl's terms of use; the synthetic component is released under a permissive license attached to NVIDIA's contributions. Quality metadata is included alongside text, allowing downstream users to filter or reweight buckets according to their training objectives.

## How has Nemotron-CC been used and extended?

Nemotron-CC was received as a significant contribution to the open pretraining-data ecosystem on release. Practitioners on Hugging Face and in academic venues highlighted three aspects: the explicit demonstration that synthetic rephrasing can substitute for discarded low-quality text without degrading downstream accuracy, the ensemble-of-classifiers approach as an alternative to single-classifier pipelines such as DCLM, and the unusually transparent reporting of per-bucket token counts and per-benchmark scores. Within months of release, Nemotron-CC and its HQ subset appeared as components in pretraining mixtures used by several open-model efforts, and the corpus was used as a baseline for subsequent dataset releases.

NVIDIA itself extended the recipe in 2025 with two follow-ups. Nemotron-CC-Math, introduced in an arXiv preprint dated August 2025, applied a similar pipeline to mathematics-bearing web pages, using layout-aware rendering with the Lynx browser and an LLM-based cleaning stage to preserve MathJax, KaTeX, and MathML equation content [7]. The math corpus is released in two variants, Nemotron-CC-Math-4+ at 52.3 billion tokens (top quality scores 4 to 5) and Nemotron-CC-Math-3+ at 133.3 billion tokens (scores 3 to 5), under an Apache 2.0 license [7]. When used in pretraining a Nemotron-T 8B model, the math corpus produced reported gains of +4.8 to +12.6 points on the MATH benchmark and +4.6 to +14.3 points on MBPP+ over strong baselines, and was reported to outperform FineMath, MegaMath, and OpenWebMath on math reasoning evaluations [7].

NVIDIA also released Nemotron-CC-v2 and a broader Nemotron pretraining dataset family in 2025 as part of the [Nemotron Nano 2](/wiki/nemotron_nano_2) model release. Nemotron-CC-v2 expanded the original recipe with additional Common Crawl snapshots, refined classifiers, and additional synthetic prompt templates, and was used together with code, math, and multilingual mixes to train the Nemotron Nano 2 hybrid Mamba-Transformer reasoning model [8].

## Limitations and criticism

Several limitations of Nemotron-CC have been noted. The synthetic component is generated by a single instruction-tuned model, Mistral-NeMo 12B Instruct, which raises questions about model-induced bias propagating into pretraining data and about possible benchmark contamination if the teacher model was itself exposed to evaluation material [1]. The authors discuss decontamination procedures applied to the synthetic outputs, but third-party audits at the scale of 1.9 trillion synthetic tokens are difficult.

The corpus is English-only, which restricts its direct utility for multilingual model training, although the same NeMo Curator recipe has been demonstrated on non-English subsets. The use of Common Crawl as the sole source means that Nemotron-CC inherits known issues of web data, including over-representation of certain genres, copyrighted content under fair-use ambiguity, and the long-tail of low-quality SEO and machine-generated text that even ensemble classifiers cannot fully filter. The dataset's license is permissive but defers to Common Crawl's terms, which some downstream users interpret cautiously for commercial deployment.

Finally, the long-horizon 15T-token comparison against Llama 3.1 8B holds the model architecture roughly constant but not the mixture: Llama 3.1's training mix is proprietary, so the comparison is suggestive rather than fully controlled. The authors are explicit that the 0.5-point average gain across ten benchmarks is small, and the principal claim is the substantial improvement on knowledge-heavy tasks such as MMLU rather than uniform dominance [1].

## See also

- [FineWeb](/wiki/fineweb)
- [Dolma](/wiki/dolma)
- [DCLM](/wiki/dclm)
- [Common Pile](/wiki/common_pile)
- [Common Corpus](/wiki/common_corpus)
- [Common Crawl](/wiki/common_crawl)
- [NeMo Curator](/wiki/nemo_curator)
- [Synthetic data](/wiki/synthetic_data)
- [Pretraining](/wiki/pretraining)
- [Large language models](/wiki/large_language_models)
- [NVIDIA](/wiki/nvidia)

## References

1. Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., and Catanzaro, B. (2024). "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." arXiv:2412.02595. https://arxiv.org/abs/2412.02595
2. Su, D. et al. (2025). "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Volume 1: Long Papers, pages 2459-2475. https://aclanthology.org/2025.acl-long.123/
3. NVIDIA Technical Blog. (2025). "Announcing Nemotron-CC: A Trillion-Token English Language Dataset for LLM Pretraining." https://developer.nvidia.com/blog/announcing-nemotron-cc-a-trillion-token-english-language-dataset-for-llm-pretraining/
4. NVIDIA Technical Blog. (2025). "Building Nemotron-CC, A High-Quality Trillion Token Dataset for LLM Pretraining from Common Crawl Using NVIDIA NeMo Curator." https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/
5. NVIDIA Applied Deep Learning Research. "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." https://research.nvidia.com/labs/adlr/Nemotron-CC/
6. Common Crawl Foundation. "Nemotron-CC Dataset Index." https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
7. NVIDIA. (2025). "Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset." arXiv:2508.15096. https://arxiv.org/abs/2508.15096
8. NVIDIA. (2025). "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model." arXiv:2508.14444. https://arxiv.org/abs/2508.14444
9. Hugging Face. "nvidia/Nemotron-CC and nvidia/Nemotron-CC-HQ Datasets." https://huggingface.co/datasets/nvidia/Nemotron-CC
10. Li, J. et al. (2024). "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." arXiv:2406.11794. https://arxiv.org/abs/2406.11794
11. Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557. https://arxiv.org/abs/2406.17557
12. NVIDIA. "NeMo Curator: Scalable Data Curation for LLMs." https://github.com/NVIDIA/NeMo-Curator