Nemotron-CC

Data & Datasets NVIDIA Natural Language Processing

14 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 2,853 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Nemotron-CC is a large-scale, open English-language pretraining dataset for large language models released by NVIDIA in December 2024. The corpus contains approximately 6.3 trillion tokens drawn from 99 Common Crawl snapshots, of which roughly 4.4 trillion are globally deduplicated original tokens and approximately 1.9 trillion are synthetically generated tokens produced by rephrasing or restructuring web documents with an instruction-tuned model ^[1]^[3]. The dataset was introduced in the paper "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" by Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro of NVIDIA, which was first posted to arXiv on 3 December 2024 and subsequently accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) as a long paper ^[1]^[2].

Nemotron-CC was built using NVIDIA's NeMo Curator data-curation framework and is distributed openly through the Common Crawl Foundation's contributed dataset repository ^[4]^[6]. The corpus is notable for combining an ensemble of model-based quality classifiers with large-scale synthetic data generation to address a trade-off the authors call the "quality versus quantity dilemma": aggressive heuristic filtering, as used in datasets such as FineWeb-Edu, removes up to roughly 90 percent of source tokens, while lightly filtered corpora retain volume but suffer in benchmark accuracy ^[1]^[3]. By rephrasing low-quality documents and diversifying high-quality ones, the Nemotron-CC pipeline retains approximately four times more unique real tokens than the DataComp-LM (DCLM) baseline at comparable downstream accuracy ^[3]^[5]. NVIDIA reported that an 8-billion-parameter dense transformer trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, outperforms the Llama 3.1 8B base model on the Massive Multitask Language Understanding (MMLU) benchmark by 5 points and on ARC-Challenge by 3.1 points, with a 0.5-point average improvement across ten common downstream tasks ^[1]^[5].

What problem does Nemotron-CC solve?

The Common Crawl Foundation has published monthly web snapshots of broad-coverage crawls since 2008, and the resulting petabytes of WARC files have become the dominant source of pretraining data for open foundation models. Earlier curated derivatives include C4 (2019), the Pile (2020), RefinedWeb (2023), RedPajama-V1 and V2 (2023), Dolma (2023), FineWeb and FineWeb-Edu (2024), and DCLM-Baseline (2024). These projects converged on a multi-stage recipe of language identification, text extraction from HTML, exact and fuzzy deduplication, heuristic quality filters, and increasingly model-based quality classifiers. Empirical scaling laws from Chinchilla and subsequent work suggest that frontier-scale models benefit from training horizons of 10 to 20 tokens per parameter or more, which has pushed the demand for clean English web text into the tens of trillions of tokens.

By late 2024 the most accurate open Common Crawl derivative, by reported MMLU scores at fixed compute, was DCLM-Baseline from the DataComp-LM benchmark, which used a fastText classifier trained on instruction-tuned outputs to filter Common Crawl down to roughly 3.8 trillion tokens ^[10]. FineWeb-Edu had taken an alternative approach, using a Llama-3-70B-derived classifier to score documents for educational value and retaining only the top 10 percent or so of the source corpus, yielding around 1.3 trillion tokens with strong knowledge benchmark performance ^[11]. Both approaches sacrificed a substantial fraction of available tokens, motivating the Nemotron-CC team to design a pipeline that could preserve more of the underlying material while still matching or exceeding accuracy baselines. The authors summarize the core insight in two findings: "Ensembling different model-based classifiers can help select a larger and more diverse set of high quality tokens," and "Rephrasing can effectively reduce noise and errors in low-quality data and produce diverse variants with fresh unique tokens" ^[5].

How was Nemotron-CC built?

The Nemotron-CC pipeline begins with 99 Common Crawl snapshots spanning CC-MAIN-2013-20 through CC-MAIN-2024-30 ^[1]. WARC files are processed using the jusText HTML-to-text extractor, and a FastText language identifier retains documents predicted to be English. The team reports that English content constitutes roughly 73 percent of the raw token stream after extraction. Exact deduplication removes byte-identical documents through hashing, and fuzzy deduplication applies MinHash signatures with locality-sensitive hashing to remove near-duplicates with high Jaccard similarity. After global deduplication across all snapshots the corpus contains approximately 4.4 trillion unique English tokens ^[3].

Quality classifier ensemble

A central design choice in Nemotron-CC is the use of an ensemble of three model-based quality classifiers rather than a single scoring model:

The DCLM fastText classifier, originally trained on instruction-tuned text contrasted with random web crawl samples.
A FineWeb-Edu-style classifier built from a Snowflake-arctic-embed-m encoder and trained on annotations from a Mixtral-derived teacher.
A FineWeb Nemotron-4 Edu classifier trained with annotations from the Nemotron-4 340B Instruct model.

Each classifier produces a real-valued score, which is mapped via percentile thresholds to an integer rank from 0 (worst) to 19 (best). The maximum rank across the three classifiers becomes the ensemble score, and documents are then bucketed into five quality tiers ranging from low to high. The authors found that taking the maximum, rather than the mean, allowed each classifier to surface high-quality documents that the others missed, expanding the high-quality pool relative to single-classifier filtering ^[1]^[5].

The team also disables conventional heuristic filters such as length, symbol-to-word ratio, and perplexity thresholds for documents in the highest-quality buckets, observing that those filters tended to remove genuinely useful technical content that simply differed from prose norms ^[1]. Heuristic filtering and KenLM perplexity filtering are retained for lower-quality tiers, where they continue to reduce noise.

Synthetic data generation

The second pillar of Nemotron-CC is large-scale synthetic data generation. The team applied an instruction-tuned 12B-parameter Mistral NeMo model to rewrite or augment documents in two regimes ^[1]^[4]:

For high-quality documents, four diversification prompts produce additional pretraining material that varies stylistically and structurally from the source while retaining its information content.
For low-quality documents, a single rephrasing prompt rewrites noisy or poorly formatted text into clearer prose, recovering some of the underlying knowledge that would otherwise be discarded.

The five prompt templates are described in the paper as Wikipedia-style rephrasing, diverse question-answer pair generation, distillation (concise rewrites), knowledge extraction (informative content restated), and knowledge lists (organized factual compilations) ^[1]. In total the synthetic generation step produces roughly 1.9 trillion tokens, of which approximately 1.5 trillion come from high-quality diversification and around 336 billion from low-quality rephrasing. The synthetic component constitutes about 30 percent of the final corpus ^[1]^[3].

Dataset composition

The final Nemotron-CC corpus is organized into quality buckets that mix real and synthetic content. The approximate composition reported in the paper is summarized below ^[1].

Subset	Approximate tokens	Notes
High quality (HQ) bucket	553 B	Top ensemble-rank real documents
Medium-high quality	504 B	Second tier of real documents
Medium quality	2,023 B	Bulk of real corpus
Medium-low quality	894 B	Lighter filtering
Low quality	402 B	Retained with caution
Synthetic from HQ (diversified)	~1,500 B	Four-prompt diversification
Synthetic from LQ (rephrased)	~336 B	Single rephrasing prompt
Total	~6.3 T	4.4 T real plus 1.9 T synthetic

NVIDIA also released a curated high-quality subset, Nemotron-CC-HQ, containing approximately 1.1 trillion tokens (about 0.6 trillion real and 0.5 trillion synthetic), intended for shorter training runs or as a high-grade mix-in within larger blends ^[3]^[5].

How does Nemotron-CC perform on benchmarks?

The paper evaluates Nemotron-CC in two regimes: a controlled short-horizon comparison against peer datasets using 8B-parameter models trained on 1 trillion tokens, and a long-horizon training run of 15 trillion tokens designed to test scaling ^[1].

Short-horizon comparison

Using the same 8B Llama-3-style architecture and training recipe across datasets, the authors report MMLU scores after 1 trillion training tokens. Nemotron-CC-HQ outperforms both DCLM-Baseline and FineWeb-Edu by substantial margins ^[1]^[3].

Dataset	Tokens trained	MMLU (5-shot)
FineWeb-Edu	1 T	42.9
RefinedWeb	1 T	~46
DCLM-Baseline	1 T	53.4
Nemotron-CC (full)	1 T	~53
Nemotron-CC-HQ	1 T	59.0

The 5.6-point MMLU gap between Nemotron-CC-HQ and DCLM-Baseline at matched compute is the headline short-horizon result ^[3]^[5]. As NVIDIA states, the full corpus "matches DCLM on MMLU, but contains four times more unique real tokens," addressing the long-horizon supply problem ^[3].

Long-horizon training

The team trained an 8B-parameter dense transformer on 15 trillion tokens, drawing 7.2 trillion from Nemotron-CC and the remainder from a mixture of code, math, multilingual, and curated knowledge sources ^[1]^[5]. The resulting model is compared with Meta's Llama 3.1 8B base across ten downstream benchmarks. Nemotron-CC-trained results meet or exceed Llama 3.1 8B on the majority of tasks, including a 5-point gain on MMLU and a 3.1-point gain on ARC-Challenge ^[1]^[5].

Benchmark	Nemotron-CC 8B (15T)	Llama 3.1 8B
MMLU (5-shot)	70.3	65.3
ARC-Challenge	58.1	55.0
ARC-Easy	82.7	81.4
HellaSwag	80.8	81.6
Winogrande	73.8	73.5
RACE	37.8	37.5
PIQA	81.1	81.1
Social IQA	47.4	47.2
CommonsenseQA	69.9	70.4
OpenBookQA	45.4	45.0
Average	64.7	63.8

The average gain of 0.5 to 0.9 points across the ten benchmarks is modest in aggregate but is concentrated on the knowledge-intensive MMLU and ARC-Challenge tasks, which the authors interpret as evidence that the synthetic component and aggressive quality bucketing help the model acquire factual and reasoning content ^[1].

How does Nemotron-CC compare to DCLM and FineWeb-Edu?

Nemotron-CC sits alongside a growing family of large open English pretraining corpora. The comparison below summarizes scale and design choices.

Dataset	Year	Approx tokens	Source	Quality filter	Synthetic data
C4	2019	156 B (en)	Common Crawl	Heuristics	None
RefinedWeb	2023	5 T (600 B public)	Common Crawl	Heuristics, MacroData	None
RedPajama-V2	2023	30 T (raw)	Common Crawl	Quality signals	None
Dolma v1.7	2024	2.3 T	Mixed web, books, code	Heuristics, classifiers	None
FineWeb	2024	15 T	Common Crawl	Heuristics, dedup	None
FineWeb-Edu	2024	1.3 T	Common Crawl	Educational classifier	None
DCLM-Baseline	2024	3.8 T	Common Crawl	fastText classifier	None
Common Pile v0.1	2025	8 T	Permissively licensed sources	Heuristics	None
Common Corpus	2024	2 T	Permissively licensed sources	Heuristics	None
Nemotron-CC	2024	6.3 T	Common Crawl	Classifier ensemble	1.9 T

Relative to FineWeb-Edu, Nemotron-CC adopts a less aggressive filter and replaces lost tokens with synthetic generations, trading some pure educational density for substantially more total volume. Relative to DCLM-Baseline, Nemotron-CC uses a multi-classifier ensemble rather than a single fastText filter and adds the synthetic component, which the team argues is the primary driver of the MMLU gap at fixed compute ^[1]^[5]. Compared with Dolma and Common Pile, which emphasize source diversity or licensing transparency rather than scale alone, Nemotron-CC is specialized for the English Common Crawl slice and aimed squarely at maximizing downstream benchmark accuracy under fixed training horizons.

What tools built Nemotron-CC, and is it reproducible?

The end-to-end pipeline that produced Nemotron-CC is implemented in NeMo Curator, NVIDIA's open-source GPU-accelerated data-curation library ^[4]^[12]. NeMo Curator provides distributed implementations of HTML extraction, language identification, exact and fuzzy deduplication, heuristic filtering, classifier-based filtering, perplexity scoring, and synthetic data generation. NVIDIA has published the rephrasing prompts, classifier checkpoints, and pipeline configurations used for Nemotron-CC in the NeMo Curator GitHub repository, allowing third parties to reproduce the construction or apply the same recipe to non-English crawls or domain-specific corpora ^[4]^[12].

The dataset itself is hosted on the Common Crawl Foundation's data servers at data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/ and on Hugging Face under nvidia/Nemotron-CC and nvidia/Nemotron-CC-HQ ^[6]^[9]. The corpus inherits Common Crawl's terms of use; the synthetic component is released under a permissive license attached to NVIDIA's contributions. Quality metadata is included alongside text, allowing downstream users to filter or reweight buckets according to their training objectives.

How has Nemotron-CC been used and extended?

Nemotron-CC was received as a significant contribution to the open pretraining-data ecosystem on release. Practitioners on Hugging Face and in academic venues highlighted three aspects: the explicit demonstration that synthetic rephrasing can substitute for discarded low-quality text without degrading downstream accuracy, the ensemble-of-classifiers approach as an alternative to single-classifier pipelines such as DCLM, and the unusually transparent reporting of per-bucket token counts and per-benchmark scores. Within months of release, Nemotron-CC and its HQ subset appeared as components in pretraining mixtures used by several open-model efforts, and the corpus was used as a baseline for subsequent dataset releases.

NVIDIA itself extended the recipe in 2025 with two follow-ups. Nemotron-CC-Math, introduced in an arXiv preprint dated August 2025, applied a similar pipeline to mathematics-bearing web pages, using layout-aware rendering with the Lynx browser and an LLM-based cleaning stage to preserve MathJax, KaTeX, and MathML equation content ^[7]. The math corpus is released in two variants, Nemotron-CC-Math-4+ at 52.3 billion tokens (top quality scores 4 to 5) and Nemotron-CC-Math-3+ at 133.3 billion tokens (scores 3 to 5), under an Apache 2.0 license ^[7]. When used in pretraining a Nemotron-T 8B model, the math corpus produced reported gains of +4.8 to +12.6 points on the MATH benchmark and +4.6 to +14.3 points on MBPP+ over strong baselines, and was reported to outperform FineMath, MegaMath, and OpenWebMath on math reasoning evaluations ^[7].

NVIDIA also released Nemotron-CC-v2 and a broader Nemotron pretraining dataset family in 2025 as part of the Nemotron Nano 2 model release. Nemotron-CC-v2 expanded the original recipe with additional Common Crawl snapshots, refined classifiers, and additional synthetic prompt templates, and was used together with code, math, and multilingual mixes to train the Nemotron Nano 2 hybrid Mamba-Transformer reasoning model ^[8].

Limitations and criticism

Several limitations of Nemotron-CC have been noted. The synthetic component is generated by a single instruction-tuned model, Mistral-NeMo 12B Instruct, which raises questions about model-induced bias propagating into pretraining data and about possible benchmark contamination if the teacher model was itself exposed to evaluation material ^[1]. The authors discuss decontamination procedures applied to the synthetic outputs, but third-party audits at the scale of 1.9 trillion synthetic tokens are difficult.

The corpus is English-only, which restricts its direct utility for multilingual model training, although the same NeMo Curator recipe has been demonstrated on non-English subsets. The use of Common Crawl as the sole source means that Nemotron-CC inherits known issues of web data, including over-representation of certain genres, copyrighted content under fair-use ambiguity, and the long-tail of low-quality SEO and machine-generated text that even ensemble classifiers cannot fully filter. The dataset's license is permissive but defers to Common Crawl's terms, which some downstream users interpret cautiously for commercial deployment.

Finally, the long-horizon 15T-token comparison against Llama 3.1 8B holds the model architecture roughly constant but not the mixture: Llama 3.1's training mix is proprietary, so the comparison is suggestive rather than fully controlled. The authors are explicit that the 0.5-point average gain across ten benchmarks is small, and the principal claim is the substantial improvement on knowledge-heavy tasks such as MMLU rather than uniform dominance ^[1].

References

Su, D., Kong, K., Lin, Y., Jennings, J., Norick, B., Kliegl, M., Patwary, M., Shoeybi, M., and Catanzaro, B. (2024). "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." arXiv:2412.02595. https://arxiv.org/abs/2412.02595 ↩
Su, D. et al. (2025). "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Volume 1: Long Papers, pages 2459-2475. https://aclanthology.org/2025.acl-long.123/ ↩
NVIDIA Technical Blog. (2025). "Announcing Nemotron-CC: A Trillion-Token English Language Dataset for LLM Pretraining." https://developer.nvidia.com/blog/announcing-nemotron-cc-a-trillion-token-english-language-dataset-for-llm-pretraining/ ↩
NVIDIA Technical Blog. (2025). "Building Nemotron-CC, A High-Quality Trillion Token Dataset for LLM Pretraining from Common Crawl Using NVIDIA NeMo Curator." https://developer.nvidia.com/blog/building-nemotron-cc-a-high-quality-trillion-token-dataset-for-llm-pretraining-from-common-crawl-using-nvidia-nemo-curator/ ↩
NVIDIA Applied Deep Learning Research. "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset." https://research.nvidia.com/labs/adlr/Nemotron-CC/ ↩
Common Crawl Foundation. "Nemotron-CC Dataset Index." https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html ↩
NVIDIA. (2025). "Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset." arXiv:2508.15096. https://arxiv.org/abs/2508.15096 ↩
NVIDIA. (2025). "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model." arXiv:2508.14444. https://arxiv.org/abs/2508.14444 ↩
Hugging Face. "nvidia/Nemotron-CC and nvidia/Nemotron-CC-HQ Datasets." https://huggingface.co/datasets/nvidia/Nemotron-CC ↩
Li, J. et al. (2024). "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." arXiv:2406.11794. https://arxiv.org/abs/2406.11794 ↩
Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557. https://arxiv.org/abs/2406.17557 ↩
NVIDIA. "NeMo Curator: Scalable Data Curation for LLMs." https://github.com/NVIDIA/NeMo-Curator ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Common Corpus Common Pile DCLM (DataComp for Language Models)DatologyAI Nemotron Nemotron Nano 2 Nemotron-4 Nemotron-H Nvidia WRAP (Web Rephrase Augmented Pre-training)

What problem does Nemotron-CC solve?

How was Nemotron-CC built?

Quality classifier ensemble

Synthetic data generation

Dataset composition

How does Nemotron-CC perform on benchmarks?

Short-horizon comparison

Long-horizon training

How does Nemotron-CC compare to DCLM and FineWeb-Edu?

What tools built Nemotron-CC, and is it reproducible?

How has Nemotron-CC been used and extended?

Limitations and criticism

See also

References

Improve this article

Related Articles

MimicGen

RoboCasa

Reporting Bias

Common Crawl

The Pile (dataset)

FineWeb

What links here

Related Articles

MimicGen

RoboCasa

Reporting Bias

Common Crawl

The Pile (dataset)

FineWeb

What links here