Nemotron Nano 2

AI Models Large Language Models Open Source AI

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,732 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Nemotron Nano 2 is a family of small, open-weight reasoning language models released by NVIDIA on August 18, 2025, built on a hybrid Mamba-2 state space model plus Transformer architecture that replaces most self-attention layers with Mamba-2 layers to generate long reasoning traces faster. Its flagship checkpoint, NVIDIA-Nemotron-Nano-9B-v2, runs a 128,000-token context window on a single mid-range data center GPU and, according to NVIDIA, delivers up to roughly 6 times higher inference throughput than the similarly sized Qwen3-8B while matching or exceeding its reasoning accuracy.^[1]^[2] NVIDIA released the model weights together with most of the pre-training corpus, the Nemotron-Pre-Training-Dataset-v1, making Nemotron Nano 2 one of the more openly documented small reasoning model releases of 2025.^[2]^[3]

"Nano" denotes the small and efficient tier of the Nemotron program, intended to run a full 128,000-token context window on a single mid-range data center GPU. The flagship aligned model, NVIDIA-Nemotron-Nano-9B-v2, has about 9 billion parameters and was compressed and distilled from a 12-billion-parameter base model, NVIDIA-Nemotron-Nano-12B-v2-Base.^[1]^[4]

What is Nemotron Nano 2?

Nemotron Nano 2 was announced on August 18, 2025, with a companion technical report, "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model," posted to arXiv (2508.14444) and credited to a team of 215 NVIDIA authors.^[1] The release comprised three checkpoints on Hugging Face: the aligned reasoning model NVIDIA-Nemotron-Nano-9B-v2, plus two base checkpoints, NVIDIA-Nemotron-Nano-9B-v2-Base and NVIDIA-Nemotron-Nano-12B-v2-Base.^[2]^[4]

The family targets a specific deployment regime: agentic and reasoning tasks where a model must generate long chains of intermediate tokens before producing a final answer. Because the cost of generating those tokens dominates inference, NVIDIA prioritized decode-time throughput, using the Mamba-heavy hybrid backbone to lower the per-token compute and memory cost relative to attention.^[1]^[2] The 9B model supports a maximum context length of 128,000 tokens and, after compression, can serve that full context on a single NVIDIA A10G GPU (22 GiB of memory) in bfloat16 precision.^[2]^[4]

The models support toggleable reasoning. By default the aligned model emits an explicit reasoning trace, which a developer can disable with a control token, and it exposes a configurable "thinking budget" that caps how many internal reasoning tokens are spent before the model must answer.^[4]^[5] Supported languages include English, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and Chinese, among others.^[4]

Who developed Nemotron Nano 2?

Nemotron is NVIDIA's family of open language models and associated datasets, spanning model lines such as Nemotron 3, Llama Nemotron (built on Meta's Llama base models), and the efficiency-focused research model Jet-Nemotron. The Nano 2 architecture is derived from Nemotron-H, an earlier NVIDIA hybrid Mamba-Transformer line that established the approach of substituting Mamba-2 layers for most attention layers.^[1]^[2]

NVIDIA has positioned the Nemotron program around openness, releasing not only weights but also large training datasets and recipes. The Nano 2 release continued this by publishing the Nemotron-Pre-Training-Dataset-v1, a corpus that NVIDIA describes as a major portion of the data used to train the models, alongside post-training data.^[2]^[3] The dataset builds on NVIDIA's earlier Nemotron-CC web-crawl curation work.

What is the hybrid Mamba-Transformer architecture?

The Nano 2 models are built on the Nemotron-H hybrid design. In a conventional Transformer, every block uses self-attention; in the Nano 2 backbone, the great majority of those blocks are replaced with Mamba-2 state space model layers, which scale linearly with sequence length and maintain a fixed-size recurrent state rather than a growing key-value cache.^[1]^[2] A small number of attention layers are kept to preserve the in-context retrieval behavior that pure state space models struggle with.

The 12-billion-parameter base model, Nemotron-Nano-12B-v2-Base, is reported to use 62 layers in total: 6 self-attention layers, 28 feed-forward (MLP) layers, and 28 Mamba-2 layers, with a hidden dimension of 5,120, a feed-forward intermediate dimension of 20,480, and grouped-query attention using 40 query heads and 8 key-value heads.^[6] The base model was pre-trained on roughly 20 trillion tokens using an FP8 training recipe.^[1]^[4]

To produce the deployable 9B model, NVIDIA applied the Minitron compression strategy, which combines structured pruning with knowledge distillation from the larger model. The resulting NVIDIA-Nemotron-Nano-9B-v2 keeps 56 of the original layers, including just 4 attention layers, with the embedding dimension pruned from 5,120 to 4,480 and the feed-forward intermediate dimension pruned from 20,480 to 15,680.^[6] This compression is what allows the model to fit a 128K-token context on a single A10G GPU.^[1]^[2]

Specifications

Attribute	Detail
Developer	NVIDIA
Announced	August 18, 2025^[1]
Architecture	Hybrid Mamba-2 + Transformer attention (Nemotron-H lineage)^[1]^[2]
Aligned model	NVIDIA-Nemotron-Nano-9B-v2 (~9B parameters)^[4]
Base models	Nemotron-Nano-12B-v2-Base, Nemotron-Nano-9B-v2-Base^[2]
Base pre-training	~20 trillion tokens, FP8 precision^[1]^[4]
12B base layers	62 total: 6 attention, 28 MLP, 28 Mamba-2; hidden 5,120; GQA 40 query / 8 KV heads^[6]
9B model layers	56 total, including 4 attention; hidden 4,480^[6]
Compression method	Minitron pruning and distillation^[1]
Context length	128,000 tokens^[2]^[4]
Single-GPU target	128K context on one NVIDIA A10G (22 GiB), bfloat16^[2]^[4]
Reasoning control	Toggleable thinking, configurable thinking-token budget^[4]^[5]
Throughput claim	Up to ~6x vs Qwen3-8B in reasoning settings (NVIDIA)^[1]^[2]
Weights license	NVIDIA Open Model License Agreement^[4]
Dataset license	Creative Commons Attribution 4.0 (CC BY 4.0)^[3]

Is Nemotron Nano 2 open source, and what data was released?

The headline checkpoint, NVIDIA-Nemotron-Nano-9B-v2, is an aligned, post-trained reasoning model that NVIDIA states is "ready for commercial use" under the NVIDIA Open Model License Agreement.^[4] The two base checkpoints are provided for further fine-tuning: Nemotron-Nano-12B-v2-Base is the full pre-trained model before compression and alignment, and Nemotron-Nano-9B-v2-Base is the pruned base prior to alignment.^[2]^[4]

A central feature of the release is the openness of the training data. NVIDIA published the Nemotron-Pre-Training-Dataset-v1, which it describes as comprising about 6.6 trillion tokens of web crawl, mathematics, code, supervised fine-tuning, and multilingual question-answer data.^[2]^[3] The collection includes several named components: Nemotron-CC-v2 (a multilingual web-crawl dataset with synthetic question-answer rephrasings), Nemotron-CC-Math-v1 (about 133 billion tokens of mathematics-focused data), Nemotron-Pretraining-Code-v1 (a curated code corpus), and Nemotron-Pretraining-SFT-v1 (synthetic instruction-tuned data).^[2]^[3] The dataset is distributed under the Creative Commons Attribution 4.0 (CC BY 4.0) license, while the model weights are governed by the separate NVIDIA Open Model License.^[3]^[4]

NVIDIA reported data-quality gains from these components, for example improvements on the MATH benchmark of roughly +4.8 to +12.6 points over strong baselines from its math data, +4.6 to +14.3 points on MBPP+ for code generation, and about +10.0 on Global-MMLU from its multilingual question-answer data versus using multilingual Common Crawl alone.^[2] These figures are NVIDIA's own ablation results.

How fast is Nemotron Nano 2, and how accurate is it?

NVIDIA presents Nemotron Nano 2 as achieving leading accuracy among open models of comparable size while running substantially faster. In the technical report, the 9B model is benchmarked primarily against Qwen3-8B, an open small model from Alibaba's Qwen3 family that NVIDIA identifies as the strongest comparably sized baseline.^[1]^[2] In its own announcement, NVIDIA states that Nemotron-Nano-9B-v2 "achieves comparable or better accuracies on complex reasoning benchmarks than the leading comparably sized open model Qwen3-8B at up to 6x higher throughput."^[2]

With reasoning enabled, NVIDIA's published scores for NVIDIA-Nemotron-Nano-9B-v2 include MATH500 at 97.8 percent, AIME 2025 at 72.1 percent, GPQA at 64.0 percent, LiveCodeBench at 71.1 percent, BFCL v3 (tool use) at 66.9 percent, IFEval (instruction strict) at 90.3 percent, and RULER at 128K context at 78.9 percent.^[4] In head-to-head comparisons reported by NVIDIA, the model scored 56.67 versus 20.00 for Qwen3-8B on AIME 2024 (pass@32), 64.48 versus 59.61 on GPQA-Diamond, and 97.75 versus 96.3 on MATH-500.^[6] NVIDIA characterizes the overall result as on-par or better accuracy than Qwen3-8B while delivering up to 6 times higher inference throughput in reasoning settings such as 8,000 input and 16,000 output tokens, with reported speedups varying roughly from 3x to 6x depending on the workload.^[1]^[6] All of these are vendor-reported numbers and have not been independently audited here.

How does reasoning control work in Nemotron Nano 2?

A distinguishing capability is fine-grained control over reasoning. The model defaults to producing an explicit reasoning trace, activated by a /think control token, and a /no_think token disables the trace for latency-sensitive or simple requests.^[4]^[5] Developers can additionally set a thinking budget through a max_thinking_tokens parameter, capping the internal reasoning length before the model must commit to an answer; per the model card, "the thinking budget allows developers to keep accuracy high and meet response-time targets."^[4]^[5] NVIDIA reports that this behavior was instilled during post-training by including a fraction of the data, about 5 percent, with deliberately truncated reasoning traces, teaching the model to produce a useful answer even when its thinking is cut short.^[2]^[5]

Why does Nemotron Nano 2 matter?

Nemotron Nano 2 is notable on two fronts. Architecturally, it is one of the more prominent demonstrations that a hybrid Mamba-Transformer model can match strong Transformer reasoning models at the small-model scale while offering a large throughput advantage, an advantage that grows with the length of the reasoning traces these models generate. By keeping only a few attention layers, the design avoids the quadratic attention cost and the large key-value cache that constrain long-context decoding in conventional small models.^[1]^[2]

On openness, the release goes further than most open-weight small models by publishing the majority of the pre-training corpus under a permissive license, in addition to the weights. This places Nemotron Nano 2 among the more reproducible small reasoning model efforts of 2025, comparable in spirit to other open small-model families such as Qwen3, Microsoft's Phi, and Google's Gemma, while distinguishing itself through the combination of a state-space hybrid backbone and open data.^[2]^[3] The work also extended NVIDIA's Nemotron-H and Minitron research, and the hybrid Mamba-Transformer approach was carried forward into subsequent Nemotron releases.^[1]^[6]

References

NVIDIA. "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model." arXiv:2508.14444, August 2025. https://arxiv.org/abs/2508.14444 ↩
NVIDIA ADLR. "NVIDIA Nemotron Nano 2 and the Nemotron Pretraining Dataset v1." NVIDIA Research, August 18, 2025. https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/ ↩
NVIDIA. "Nemotron-Pre-Training-Dataset-v1." Hugging Face dataset collection, 2025. https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset ↩
NVIDIA. "NVIDIA-Nemotron-Nano-9B-v2 (model card)." Hugging Face, 2025. https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2 ↩
NVIDIA. "Supercharge Edge AI With High-Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B." Hugging Face Blog, 2025. https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2 ↩
NVIDIA. "NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model" (HTML technical report). arXiv, 2025. https://arxiv.org/html/2508.14444v2 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Amazon Bedrock Nemotron Nemotron 3 Nemotron-CC Nemotron-H

What is Nemotron Nano 2?

Who developed Nemotron Nano 2?

What is the hybrid Mamba-Transformer architecture?

Specifications

Is Nemotron Nano 2 open source, and what data was released?

How fast is Nemotron Nano 2, and how accurate is it?

How does reasoning control work in Nemotron Nano 2?

Why does Nemotron Nano 2 matter?

References

Improve this article

Related Articles

Llama 3

OLMo

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

What links here

Related Articles

Llama 3

OLMo

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

What links here