Llama Nemotron
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,776 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,776 words
Add missing citations, update stale details, or suggest a clearer explanation.
Llama Nemotron is a family of open reasoning large language models developed by Nvidia by post-training Meta's Llama models. Positioned as a foundation for enterprise "agentic" AI, the family is distinguished by a toggleable reasoning mode, controlled through the system prompt, that lets a single model either answer directly or produce extended step-by-step chains of thought. Nvidia previewed the family at CES in January 2025 and launched the reasoning-enabled models at its GTC conference in March 2025, releasing open weights, the post-training datasets, and the techniques used to build them. The models are distributed both as downloadable checkpoints on Hugging Face and as NIM microservices.
The family belongs to Nvidia's broader Nemotron line of open models and was built using the same toolchain, including the NeMo framework and techniques related to those behind Minitron pruning and distillation. Its public release coincided with intense industry interest in reasoning models following the appearance of DeepSeek-R1 in early 2025, and Nvidia presented Llama Nemotron as an openly licensed, inference-efficient alternative for developers building AI agents.
Nvidia first introduced the Llama Nemotron name on January 6, 2025, during the Consumer Electronics Show (CES) in Las Vegas, where chief executive Jensen Huang outlined a push into agentic AI. The CES announcement described a three-tier family, Nano, Super, and Ultra, built on Meta's Llama foundation models and aimed at enterprise tasks such as instruction following, function calling, chat, coding, and math. At that stage the models were positioned as agentic building blocks rather than explicit reasoning systems, and the smallest tier was promoted for RTX AI PCs and workstations.
The reasoning-focused models were formally launched on March 18, 2025, at Nvidia's GTC conference. This release added the "detailed thinking" reasoning toggle and published open weights for the Nano and Super tiers, along with post-training datasets and documentation of the methods. Nvidia stated that its post-training refinement improved accuracy by up to 20 percent over the corresponding base Llama model and optimized inference throughput by up to 5 times relative to other leading open reasoning models. The largest tier, Ultra, followed a few weeks later, with its open weights published in early April 2025.
The launched family consists of three tiers, each derived from a different Meta Llama checkpoint and tuned for a distinct deployment target. The Nano tier is intended for PCs and edge devices, Super for a single data-center GPU, and Ultra for multi-GPU servers.
| Tier | Base Llama model | Parameters | Context | Target deployment | Initial release |
|---|---|---|---|---|---|
| Llama 3.1 Nemotron Nano | Llama 3.1 8B Instruct | 8B | 128K | PCs and edge devices | March 18, 2025 |
| Llama 3.3 Nemotron Super | Llama 3.3 70B Instruct | 49B | 128K | Single data-center GPU | March 18, 2025 |
| Llama 3.1 Nemotron Ultra | Llama 3.1 405B Instruct | 253B | 128K | Multi-GPU servers | April 7, 2025 |
Each model is a dense decoder-only Transformer; none uses a mixture-of-experts design, so all parameters are active on every forward pass. The Nano model is a fine-tune that preserves the Llama 3.1 8B architecture, while the Super and Ultra models are substantially restructured through neural architecture search to shrink the parameter count relative to their 70B and 405B base models. The Ultra model represents roughly a 62 percent parameter reduction versus the 405B baseline, and Nvidia designed it to fit on a single eight-GPU node of H100 80GB accelerators for inference.
Llama Nemotron is the product of a multi-stage post-training pipeline applied to off-the-shelf Llama weights, described in Nvidia's technical report "Llama-Nemotron: Efficient Reasoning Models" (Bercovich et al., 2025). The pipeline combines architecture optimization with reasoning-focused training.
For the Super and Ultra tiers, Nvidia first applies neural architecture search using a system it calls Puzzle, which produces non-uniform, non-repetitive block structures. Some attention blocks are skipped entirely or replaced with a single linear layer, and feed-forward network (FFN) layers are given variable expansion and compression ratios across the model. A technique called FFN Fusion merges consecutive layers whose attention has been removed into fewer, wider FFN layers, which reduces sequential depth and improves latency. These structural changes lower parameter count and inference cost while a block-wise knowledge distillation step transfers capability from the larger reference model. For the Ultra tier, this stage used roughly 65 billion tokens of knowledge distillation followed by about 88 billion tokens of continued pretraining; for the Super tier, distillation used about 40 billion tokens drawn from corpora including FineWeb, Buzz-V1.2, and Dolma.
After the architecture is fixed, the models undergo reasoning-focused supervised fine-tuning on large volumes of synthetic data spanning math, code, science, instruction following, chat, safety, and tool calling. A final stage applies large-scale reinforcement learning, including Group Relative Policy Optimization (GRPO) for reasoning, chat, and instruction following, together with reinforcement learning from human feedback for alignment. Training was performed on Nvidia DGX Cloud.
The defining feature of Llama Nemotron is its dynamic reasoning toggle. Nvidia describes the models as the first open-source models to support switching between standard chat and extended reasoning during inference without loading a separate model. Control is exercised entirely through the system prompt: setting the system message to "detailed thinking on" causes the model to generate long internal reasoning traces before its answer, while "detailed thinking off" produces a direct response in the manner of a conventional instruction-tuned model. According to the technical report, the two behaviors were trained simultaneously and differ only by this system prompt, so the same checkpoint serves both modes.
The reasoning and non-reasoning modes call for different decoding settings. Nvidia recommends temperature 0.6 with top-p 0.95 when reasoning is enabled and greedy decoding when it is disabled. The mechanism lets developers trade latency and cost against accuracy on a per-request basis, using fast direct answers for simple queries and full reasoning for harder problems.
Nvidia released the model weights under the commercially permissive NVIDIA Open Model License, in combination with the relevant Llama Community License Agreement. Beyond the weights, the company published the post-training corpus as the Llama-Nemotron-Post-Training-Dataset on Hugging Face, comprising roughly 30 million or more samples across math, code, science, instruction following, chat, and safety. The dataset is dominated by mathematics and code, and its responses were synthetically generated by a mixture of open models, including DeepSeek-R1, several Qwen 2.5 variants, and Llama models, with the bulk released under the CC-BY-4.0 license. Nvidia also released the HelpSteer preference data used for alignment and pointed to the open-source training stack, including NeMo, NeMo-Aligner, and Megatron-LM. This combination of open weights, open data, and documented methods was intended to let enterprises reproduce or customize their own reasoning models.
Nvidia reported that the models achieve leading accuracy among open models on standard reasoning, math, science, and agentic benchmarks, with the largest gains visible when reasoning is enabled. The tables below give published pass@1 scores from the model cards, illustrating the effect of the reasoning toggle.
Llama 3.1 Nemotron Nano (8B):
| Benchmark | Reasoning off | Reasoning on |
|---|---|---|
| MATH500 | 36.6 | 95.4 |
| AIME 2025 | 0.0 | 47.1 |
| GPQA Diamond | 39.4 | 54.1 |
| MBPP (0-shot) | 66.1 | 84.6 |
Llama 3.3 Nemotron Super (49B):
| Benchmark | Reasoning off | Reasoning on |
|---|---|---|
| MATH500 | 74.0 | 96.6 |
| AIME 2025 | 13.33 | 58.4 |
| GPQA Diamond | 50.0 | 66.67 |
| Arena Hard | 88.3 | (not reported) |
| BFCL V2 Live | 73.7 | (not reported) |
Llama 3.1 Nemotron Ultra (253B):
| Benchmark | Reasoning off | Reasoning on |
|---|---|---|
| GPQA Diamond | 56.60 | 76.01 |
| AIME 2025 | 16.67 | 72.50 |
| MATH500 | 80.40 | 97.00 |
| LiveCodeBench | 29.03 | 66.31 |
| IFEval (instruction, strict) | 88.85 | 89.45 |
Nvidia positioned the Ultra model as competitive with, and on several benchmarks ahead of, much larger open reasoning models, claiming roughly four times the inference throughput of DeepSeek-R1's 671-billion-parameter mixture-of-experts model while using fewer than half as many parameters. The technical report frames the family as performing competitively with state-of-the-art reasoning systems such as DeepSeek-R1 across reasoning and agentic evaluations.
In addition to raw checkpoints, the Llama Nemotron models are distributed as NVIDIA NIM (Nvidia Inference Microservices) containers, part of the Nvidia AI Enterprise software platform. NIM packages each model as a prebuilt, GPU-optimized inference service with a standard API, allowing deployment across clouds, data centers, workstations, and PCs. The Nano tier was offered as a NIM microservice suitable for RTX-class PCs and workstations, while Super and Ultra target data-center GPUs. Nvidia listed enterprise and partner adopters of the family including Accenture, Amdocs, Atlassian, Box, Cadence, CrowdStrike, Deloitte, IQVIA, Microsoft, SAP, and ServiceNow.
Llama Nemotron illustrated several themes in the 2025 generative-AI landscape. It showed that reasoning behavior could be added to existing open base models through post-training rather than trained from scratch, and that a single model could expose reasoning as a switchable feature rather than a fixed property. By coupling neural architecture search and distillation with reasoning-focused fine-tuning, Nvidia compressed very large Llama checkpoints into more deployable sizes while raising benchmark accuracy, and by releasing weights, datasets, and methods under permissive licenses it gave the open-source community a reproducible recipe. The family also served Nvidia's commercial strategy: by making efficient open reasoning models that run well on its hardware and ship as NIM microservices, the company encouraged enterprises to build agentic applications on its software and accelerated computing stack. Subsequent Nemotron releases continued the approach, extending it toward multimodal and newer-architecture open models.