Nemotron Nano 2
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,655 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,655 words
Add missing citations, update stale details, or suggest a clearer explanation.
Nemotron Nano 2 is a family of small, open-weight reasoning language models released by NVIDIA in August 2025 as part of its Nemotron line. The models use a hybrid architecture that replaces most of the self-attention layers found in a standard Transformer with Mamba-2 state space model layers, retaining only a handful of attention layers. This design is optimized for high-throughput generation of the long "thinking" traces that reasoning models produce, and NVIDIA reports up to roughly 6 times higher inference throughput than a similarly sized pure-Transformer model in reasoning workloads while matching or exceeding its accuracy.[1][2] Alongside the model weights, NVIDIA released most of the pre-training corpus, the Nemotron-Pre-Training-Dataset-v1, making Nemotron Nano 2 one of the more openly documented small reasoning model releases of 2025.[2][3]
"Nano" denotes the small and efficient tier of the Nemotron program, intended to run a full 128,000-token context window on a single mid-range data center GPU. The flagship aligned model, NVIDIA-Nemotron-Nano-9B-v2, has about 9 billion parameters and was compressed and distilled from a 12-billion-parameter base model, NVIDIA-Nemotron-Nano-12B-v2-Base.[1][4]
Nemotron Nano 2 was announced on August 18, 2025, with a companion technical report, "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model," posted to arXiv (2508.14444) and credited to a team of 215 NVIDIA authors.[1] The release comprised three checkpoints on Hugging Face: the aligned reasoning model NVIDIA-Nemotron-Nano-9B-v2, plus two base checkpoints, NVIDIA-Nemotron-Nano-9B-v2-Base and NVIDIA-Nemotron-Nano-12B-v2-Base.[2][4]
The family targets a specific deployment regime: agentic and reasoning tasks where a model must generate long chains of intermediate tokens before producing a final answer. Because the cost of generating those tokens dominates inference, NVIDIA prioritized decode-time throughput, using the Mamba-heavy hybrid backbone to lower the per-token compute and memory cost relative to attention.[1][2] The 9B model supports a maximum context length of 128,000 tokens and, after compression, can serve that full context on a single NVIDIA A10G GPU (22 GiB of memory) in bfloat16 precision.[2][4]
The models support toggleable reasoning. By default the aligned model emits an explicit reasoning trace, which a developer can disable with a control token, and it exposes a configurable "thinking budget" that caps how many internal reasoning tokens are spent before the model must answer.[4][5] Supported languages include English, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and Chinese, among others.[4]
Nemotron is NVIDIA's family of open language models and associated datasets, spanning model lines such as Nemotron 3, Llama Nemotron (built on Meta's Llama base models), and the efficiency-focused research model Jet-Nemotron. The Nano 2 architecture is derived from Nemotron-H, an earlier NVIDIA hybrid Mamba-Transformer line that established the approach of substituting Mamba-2 layers for most attention layers.[1][2]
NVIDIA has positioned the Nemotron program around openness, releasing not only weights but also large training datasets and recipes. The Nano 2 release continued this by publishing the Nemotron-Pre-Training-Dataset-v1, a corpus that NVIDIA describes as a major portion of the data used to train the models, alongside post-training data.[2][3] The dataset builds on NVIDIA's earlier Nemotron-CC web-crawl curation work.
The Nano 2 models are built on the Nemotron-H hybrid design. In a conventional Transformer, every block uses self-attention; in the Nano 2 backbone, the great majority of those blocks are replaced with Mamba-2 state space model layers, which scale linearly with sequence length and maintain a fixed-size recurrent state rather than a growing key-value cache.[1][2] A small number of attention layers are kept to preserve the in-context retrieval behavior that pure state space models struggle with.
The 12-billion-parameter base model, Nemotron-Nano-12B-v2-Base, is reported to use 62 layers in total: 6 self-attention layers, 28 feed-forward (MLP) layers, and 28 Mamba-2 layers, with a hidden dimension of 5,120, a feed-forward intermediate dimension of 20,480, and grouped-query attention using 40 query heads and 8 key-value heads.[6] The base model was pre-trained on roughly 20 trillion tokens using an FP8 training recipe.[1][4]
To produce the deployable 9B model, NVIDIA applied the Minitron compression strategy, which combines structured pruning with knowledge distillation from the larger model. The resulting NVIDIA-Nemotron-Nano-9B-v2 keeps 56 of the original layers, including just 4 attention layers, with the embedding dimension pruned from 5,120 to 4,480 and the feed-forward intermediate dimension pruned from 20,480 to 15,680.[6] This compression is what allows the model to fit a 128K-token context on a single A10G GPU.[1][2]
| Attribute | Detail |
|---|---|
| Developer | NVIDIA |
| Announced | August 18, 2025[1] |
| Architecture | Hybrid Mamba-2 + Transformer attention (Nemotron-H lineage)[1][2] |
| Aligned model | NVIDIA-Nemotron-Nano-9B-v2 (~9B parameters)[4] |
| Base models | Nemotron-Nano-12B-v2-Base, Nemotron-Nano-9B-v2-Base[2] |
| Base pre-training | ~20 trillion tokens, FP8 precision[1][4] |
| 12B base layers | 62 total: 6 attention, 28 MLP, 28 Mamba-2; hidden 5,120; GQA 40 query / 8 KV heads[6] |
| 9B model layers | 56 total, including 4 attention; hidden 4,480[6] |
| Compression method | Minitron pruning and distillation[1] |
| Context length | 128,000 tokens[2][4] |
| Single-GPU target | 128K context on one NVIDIA A10G (22 GiB), bfloat16[2][4] |
| Reasoning control | Toggleable thinking, configurable thinking-token budget[4][5] |
| Throughput claim | Up to ~6x vs Qwen3-8B in reasoning settings (NVIDIA)[1][2] |
| Weights license | NVIDIA Open Model License Agreement[4] |
| Dataset license | Creative Commons Attribution 4.0 (CC BY 4.0)[3] |
The headline checkpoint, NVIDIA-Nemotron-Nano-9B-v2, is an aligned, post-trained reasoning model "ready for commercial use" under the NVIDIA Open Model License Agreement.[4] The two base checkpoints are provided for further fine-tuning: Nemotron-Nano-12B-v2-Base is the full pre-trained model before compression and alignment, and Nemotron-Nano-9B-v2-Base is the pruned base prior to alignment.[2][4]
A central feature of the release is the openness of the training data. NVIDIA published the Nemotron-Pre-Training-Dataset-v1, which it describes as comprising about 6.6 trillion tokens of web crawl, mathematics, code, supervised fine-tuning, and multilingual question-answer data.[2][3] The collection includes several named components: Nemotron-CC-v2 (a multilingual web-crawl dataset with synthetic question-answer rephrasings), Nemotron-CC-Math-v1 (about 133 billion tokens of mathematics-focused data), Nemotron-Pretraining-Code-v1 (a curated code corpus), and Nemotron-Pretraining-SFT-v1 (synthetic instruction-tuned data).[2][3] The dataset is distributed under the Creative Commons Attribution 4.0 (CC BY 4.0) license, while the model weights are governed by the separate NVIDIA Open Model License.[3][4]
NVIDIA reported data-quality gains from these components, for example improvements on the MATH benchmark of roughly +4.8 to +12.6 points over strong baselines from its math data, +4.6 to +14.3 points on MBPP+ for code generation, and about +10.0 on Global-MMLU from its multilingual question-answer data versus using multilingual Common Crawl alone.[2] These figures are NVIDIA's own ablation results.
NVIDIA presents Nemotron Nano 2 as achieving leading accuracy among open models of comparable size while running substantially faster. In the technical report, the 9B model is benchmarked primarily against Qwen3-8B, an open small model from Alibaba's Qwen3 family that NVIDIA identifies as the strongest comparably sized baseline.[1][2]
With reasoning enabled, NVIDIA's published scores for NVIDIA-Nemotron-Nano-9B-v2 include MATH500 at 97.8 percent, AIME 2025 at 72.1 percent, GPQA at 64.0 percent, LiveCodeBench at 71.1 percent, BFCL v3 (tool use) at 66.9 percent, IFEval (instruction strict) at 90.3 percent, and RULER at 128K context at 78.9 percent.[4] In head-to-head comparisons reported by NVIDIA, the model scored 56.67 versus 20.00 for Qwen3-8B on AIME 2024 (pass@32), 64.48 versus 59.61 on GPQA-Diamond, and 97.75 versus 96.3 on MATH-500.[6] NVIDIA characterizes the overall result as on-par or better accuracy than Qwen3-8B while delivering up to 6 times higher inference throughput in reasoning settings such as 8,000 input and 16,000 output tokens, with reported speedups varying roughly from 3x to 6x depending on the workload.[1][6] All of these are vendor-reported numbers and have not been independently audited here.
A distinguishing capability is fine-grained control over reasoning. The model defaults to producing an explicit reasoning trace, activated by a /think control token, and a /no_think token disables the trace for latency-sensitive or simple requests.[4][5] Developers can additionally set a thinking budget through a max_thinking_tokens parameter, capping the internal reasoning length before the model must commit to an answer. NVIDIA reports that this behavior was instilled during post-training by including a fraction of the data, about 5 percent, with deliberately truncated reasoning traces, teaching the model to produce a useful answer even when its thinking is cut short.[2][5]
Nemotron Nano 2 is notable on two fronts. Architecturally, it is one of the more prominent demonstrations that a hybrid Mamba-Transformer model can match strong Transformer reasoning models at the small-model scale while offering a large throughput advantage, an advantage that grows with the length of the reasoning traces these models generate. By keeping only a few attention layers, the design avoids the quadratic attention cost and the large key-value cache that constrain long-context decoding in conventional small models.[1][2]
On openness, the release goes further than most open-weight small models by publishing the majority of the pre-training corpus under a permissive license, in addition to the weights. This places Nemotron Nano 2 among the more reproducible small reasoning model efforts of 2025, comparable in spirit to other open small-model families such as Qwen3, Microsoft's Phi, and Google's Gemma, while distinguishing itself through the combination of a state-space hybrid backbone and open data.[2][3] The work also extended NVIDIA's Nemotron-H and Minitron research, and the hybrid Mamba-Transformer approach was carried forward into subsequent Nemotron releases.[1][6]