# Llama Nemotron

> Source: https://aiwiki.ai/wiki/llama_nemotron
> Updated: 2026-06-24
> Categories: Large Language Models, NVIDIA, Reasoning Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Llama Nemotron** is a family of open [reasoning](/wiki/reasoning_model) large language models built by [Nvidia](/wiki/nvidia) by post-training Meta's [Llama](/wiki/llama) models for math, coding, and agentic tasks. Its defining feature is a toggleable reasoning mode, controlled entirely through the system prompt, that lets a single model either answer directly or produce extended step-by-step chains of thought. Nvidia previewed the family at CES in January 2025 and launched the reasoning-enabled models at its GTC conference on March 18, 2025, releasing open weights, the open post-training dataset, and the techniques used to build them [1][8]. The technical report states that the models "are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference" [8]. The models ship in three sizes, Nano (8B), Super (49B), and Ultra (253B), and are distributed both as downloadable checkpoints on Hugging Face and as [NIM](/wiki/nvidia_nim) microservices [1][8].

The family belongs to Nvidia's broader [Nemotron](/wiki/nemotron_3) line of open models and was built using the same toolchain, including the [NeMo](/wiki/nvidia_nemo) framework and techniques related to those behind [Minitron](/wiki/minitron) pruning and distillation. Its public release coincided with intense industry interest in reasoning models following the appearance of [DeepSeek-R1](/wiki/deepseek_r1) in early 2025, and Nvidia presented Llama Nemotron as an openly licensed, inference-efficient alternative for developers building AI agents [9].

## When was Llama Nemotron announced?

Nvidia first introduced the Llama Nemotron name on January 6, 2025, during the Consumer Electronics Show (CES) in Las Vegas, where chief executive Jensen Huang outlined a push into agentic AI [3]. The CES announcement described a three-tier family, Nano, Super, and Ultra, built on Meta's Llama foundation models and aimed at enterprise tasks such as instruction following, function calling, chat, coding, and math [3]. At that stage the models were positioned as agentic building blocks rather than explicit reasoning systems, and the smallest tier was promoted for RTX AI PCs and workstations [3].

The reasoning-focused models were formally launched on March 18, 2025, at Nvidia's GTC conference [1]. This release added the "detailed thinking" reasoning toggle and published open weights for the Nano and Super tiers, along with post-training datasets and documentation of the methods [1][2]. Nvidia stated that its post-training refinement "boosts accuracy of the models by up to 20% compared with the base model and optimizes inference speed by 5x compared with other leading open reasoning models" [1]. The largest tier, Ultra, followed a few weeks later, with its open weights published in early April 2025 [10].

At the launch, Jensen Huang, founder and chief executive of Nvidia, said: "Reasoning and agentic AI adoption is incredible. NVIDIA's open reasoning models, software and tools give developers and enterprises everywhere the building blocks to create an accelerated agentic AI workforce." [1]

## What models are in the Llama Nemotron family?

The launched family consists of three tiers, each derived from a different Meta Llama checkpoint and tuned for a distinct deployment target. The Nano tier delivers the highest accuracy on PCs and edge devices, Super offers the best accuracy and highest throughput on a single GPU, and Ultra provides maximum agentic accuracy on multi-GPU servers [1].

| Tier | Base Llama model | Parameters | Context | Target deployment | Initial release |
|------|------------------|-----------|---------|-------------------|-----------------|
| Llama 3.1 Nemotron Nano | [Llama 3.1](/wiki/llama_3_1) 8B Instruct | 8B | 128K | PCs and edge devices | March 18, 2025 |
| Llama 3.3 Nemotron Super | [Llama 3.3](/wiki/llama_3_3) 70B Instruct | 49B | 128K | Single data-center GPU | March 18, 2025 |
| Llama 3.1 Nemotron Ultra | Llama 3.1 405B Instruct | 253B | 128K | Multi-GPU servers | April 7, 2025 |

Each model is a dense decoder-only Transformer; none uses a mixture-of-experts design, so all parameters are active on every forward pass [8]. The Nano model is a fine-tune that preserves the Llama 3.1 8B architecture, while the Super and Ultra models are substantially restructured through neural architecture search to shrink the parameter count relative to their 70B and 405B base models [8]. The Ultra model represents roughly a 62 percent parameter reduction versus the 405B baseline, and Nvidia designed it to fit on a single eight-GPU node of H100 80GB accelerators for inference [6][8].

## How is Llama Nemotron post-trained?

Llama Nemotron is the product of a multi-stage post-training pipeline applied to off-the-shelf Llama weights, described in Nvidia's technical report "Llama-Nemotron: Efficient Reasoning Models" (Bercovich et al., 2025) [8]. The pipeline combines architecture optimization with reasoning-focused training.

For the Super and Ultra tiers, Nvidia first applies neural architecture search using a system it calls Puzzle, which produces non-uniform, non-repetitive block structures [8]. Some attention blocks are skipped entirely or replaced with a single linear layer, and feed-forward network (FFN) layers are given variable expansion and compression ratios across the model. A technique called FFN Fusion merges consecutive layers whose attention has been removed into fewer, wider FFN layers, which reduces sequential depth and improves latency [8]. These structural changes lower parameter count and inference cost while a block-wise knowledge distillation step transfers capability from the larger reference model. For the Ultra tier, this stage used roughly 65 billion tokens of knowledge distillation followed by about 88 billion tokens of continued pretraining; for the Super tier, distillation used about 40 billion tokens drawn from corpora including FineWeb, Buzz-V1.2, and Dolma [8].

After the architecture is fixed, the models undergo reasoning-focused supervised fine-tuning on large volumes of synthetic data spanning math, code, science, instruction following, chat, safety, and tool calling [8]. A final stage applies large-scale reinforcement learning, including Group Relative Policy Optimization (GRPO) for reasoning, chat, and instruction following, together with reinforcement learning from human feedback for alignment. Training was performed on Nvidia DGX Cloud [8].

## How does the toggleable reasoning mode work?

The defining feature of Llama Nemotron is its dynamic reasoning toggle. Nvidia describes the models as the first open-source models to support switching between standard chat and extended reasoning during inference without loading a separate model [8]. Control is exercised entirely through the system prompt: setting the system message to "detailed thinking on" causes the model to generate long internal reasoning traces before its answer, while "detailed thinking off" produces a direct response in the manner of a conventional instruction-tuned model [2][8]. According to the technical report, the two behaviors were trained simultaneously and differ only by this system prompt, so the same checkpoint serves both modes [8].

The reasoning and non-reasoning modes call for different decoding settings. Nvidia recommends temperature 0.6 with top-p 0.95 when reasoning is enabled and greedy decoding when it is disabled [4][5]. The mechanism lets developers trade latency and cost against accuracy on a per-request basis, using fast direct answers for simple queries and full reasoning for harder problems.

## Is Llama Nemotron open source?

Nvidia released the model weights under the commercially permissive NVIDIA Open Model License, in combination with the relevant Llama Community License Agreement [4][8]. Beyond the weights, the company published the post-training corpus as the Llama-Nemotron-Post-Training-Dataset on Hugging Face, comprising roughly 30 million or more samples across math, code, science, instruction following, chat, and safety [7]. The dataset is dominated by mathematics and code, and its responses were synthetically generated by a mixture of open models, including DeepSeek-R1, several Qwen 2.5 variants, and Llama models, with the bulk released under the CC-BY-4.0 license [7]. Nvidia also released the HelpSteer preference data used for alignment and pointed to the open-source training stack, including NeMo, NeMo-Aligner, and Megatron-LM [8]. The technical report states that "we release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset" alongside the training codebases [8]. This combination of open weights, open data, and documented methods was intended to let enterprises reproduce or customize their own reasoning models.

## How does Llama Nemotron perform on benchmarks?

Nvidia reported that the models achieve leading accuracy among open models on standard reasoning, math, science, and agentic benchmarks, with the largest gains visible when reasoning is enabled [1][8]. The tables below give published pass@1 scores from the model cards, illustrating the effect of the reasoning toggle.

Llama 3.1 Nemotron Nano (8B) [4]:

| Benchmark | Reasoning off | Reasoning on |
|-----------|---------------|--------------|
| MATH500 | 36.6 | 95.4 |
| AIME 2025 | 0.0 | 47.1 |
| GPQA Diamond | 39.4 | 54.1 |
| MBPP (0-shot) | 66.1 | 84.6 |

Llama 3.3 Nemotron Super (49B) [5]:

| Benchmark | Reasoning off | Reasoning on |
|-----------|---------------|--------------|
| MATH500 | 74.0 | 96.6 |
| AIME 2025 | 13.33 | 58.4 |
| GPQA Diamond | 50.0 | 66.67 |
| Arena Hard | 88.3 | (not reported) |
| BFCL V2 Live | 73.7 | (not reported) |

Llama 3.1 Nemotron Ultra (253B) [6]:

| Benchmark | Reasoning off | Reasoning on |
|-----------|---------------|--------------|
| GPQA Diamond | 56.60 | 76.01 |
| AIME 2025 | 16.67 | 72.50 |
| MATH500 | 80.40 | 97.00 |
| LiveCodeBench | 29.03 | 66.31 |
| IFEval (instruction, strict) | 88.85 | 89.45 |

Nvidia positioned the Ultra model as competitive with, and on several benchmarks ahead of, much larger open reasoning models, claiming roughly four times the inference throughput of DeepSeek-R1's 671-billion-parameter mixture-of-experts model while using fewer than half as many parameters [10]. The technical report concludes that the family "performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency" [8].

## How is Llama Nemotron packaged as NIM microservices?

In addition to raw checkpoints, the Llama Nemotron models are distributed as NVIDIA NIM (Nvidia Inference Microservices) containers, part of the Nvidia AI Enterprise software platform [1]. NIM packages each model as a prebuilt, GPU-optimized inference service with a standard API, allowing deployment across clouds, data centers, workstations, and PCs [1]. The Nano tier was offered as a NIM microservice suitable for RTX-class PCs and workstations, while Super and Ultra target data-center GPUs. The Nano and Super models and NIM microservices were available at launch as a hosted API from build.nvidia.com and Hugging Face, free to members of the NVIDIA Developer Program for development, testing, and research [1]. Nvidia listed enterprise and partner adopters of the family including Accenture, Amdocs, Atlassian, Box, Cadence, CrowdStrike, Deloitte, IQVIA, Microsoft, SAP, and ServiceNow [1].

## Why does Llama Nemotron matter?

Llama Nemotron illustrated several themes in the 2025 generative-AI landscape. It showed that reasoning behavior could be added to existing open base models through post-training rather than trained from scratch, and that a single model could expose reasoning as a switchable feature rather than a fixed property [8]. By coupling neural architecture search and distillation with reasoning-focused fine-tuning, Nvidia compressed very large Llama checkpoints into more deployable sizes while raising benchmark accuracy, and by releasing weights, datasets, and methods under permissive licenses it gave the open-source community a reproducible recipe [8]. The family also served Nvidia's commercial strategy: by making efficient open reasoning models that run well on its hardware and ship as NIM microservices, the company encouraged enterprises to build agentic applications on its software and accelerated computing stack [1]. Subsequent Nemotron releases continued the approach, extending it toward multimodal and newer-architecture open models.

## References

[1] NVIDIA Newsroom, "NVIDIA Launches Family of Open Reasoning AI Models for Developers and Enterprises to Build Agentic AI Platforms," March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-family-of-open-reasoning-ai-models-for-developers-and-enterprises-to-build-agentic-ai-platforms

[2] NVIDIA Technical Blog, "Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models," March 18, 2025. https://developer.nvidia.com/blog/build-enterprise-ai-agents-with-advanced-open-nvidia-llama-nemotron-reasoning-models/

[3] NVIDIA Blog, "NVIDIA Announces Nemotron Model Families to Advance Agentic AI," January 6, 2025. https://blogs.nvidia.com/blog/nemotron-model-families/

[4] Hugging Face, "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" model card. https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1

[5] Hugging Face, "nvidia/Llama-3_3-Nemotron-Super-49B-v1" model card. https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1

[6] Hugging Face, "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" model card. https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

[7] Hugging Face, "nvidia/Llama-Nemotron-Post-Training-Dataset." https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset

[8] A. Bercovich et al., "Llama-Nemotron: Efficient Reasoning Models," arXiv:2505.00949, 2025. https://arxiv.org/abs/2505.00949

[9] VentureBeat, "Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI," March 18, 2025. https://venturebeat.com/ai/nvidia-debuts-llama-nemotron-open-reasoning-models-in-a-bid-to-advance-agentic-ai

[10] VentureBeat, "Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size," April 8, 2025. https://venturebeat.com/ai/nvidias-new-llama-3-1-nemotron-ultra-outperforms-deepseek-r1-at-half-the-size

