Llama Nemotron

Large Language Models NVIDIA Reasoning Models

10 min read

Updated Jul 6, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 6, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 1,995 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Llama Nemotron is a family of open reasoning large language models built by Nvidia by post-training Meta's Llama models for math, coding, and agentic tasks. Its defining feature is a toggleable reasoning mode, controlled entirely through the system prompt, that lets a single model either answer directly or produce extended step-by-step chains of thought. Nvidia previewed the family at CES in January 2025 and launched the reasoning-enabled models at its GTC conference on March 18, 2025, releasing open weights, the open post-training dataset, and the techniques used to build them ^[1]^[8]. The technical report states that the models "are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference" ^[8]. The models ship in three sizes, Nano (8B), Super (49B), and Ultra (253B), and are distributed both as downloadable checkpoints on Hugging Face and as NIM microservices ^[1]^[8].

The family belongs to Nvidia's broader Nemotron line of open models and was built using the same toolchain, including the NeMo framework and techniques related to those behind Minitron pruning and distillation. Its public release coincided with intense industry interest in reasoning models following the appearance of DeepSeek-R1 in early 2025, and Nvidia presented Llama Nemotron as an openly licensed, inference-efficient alternative for developers building AI agents ^[9].

When was Llama Nemotron announced?

Nvidia first introduced the Llama Nemotron name on January 6, 2025, during the Consumer Electronics Show (CES) in Las Vegas, where chief executive Jensen Huang outlined a push into agentic AI ^[3]. The CES announcement described a three-tier family, Nano, Super, and Ultra, built on Meta's Llama foundation models and aimed at enterprise tasks such as instruction following, function calling, chat, coding, and math ^[3]. At that stage the models were positioned as agentic building blocks rather than explicit reasoning systems, and the smallest tier was promoted for RTX AI PCs and workstations ^[3].

The reasoning-focused models were formally launched on March 18, 2025, at Nvidia's GTC conference ^[1]. This release added the "detailed thinking" reasoning toggle and published open weights for the Nano and Super tiers, along with post-training datasets and documentation of the methods ^[1]^[2]. Nvidia stated that its post-training refinement "boosts accuracy of the models by up to 20% compared with the base model and optimizes inference speed by 5x compared with other leading open reasoning models" ^[1]. The largest tier, Ultra, followed a few weeks later, with its open weights published in early April 2025 ^[10].

At the launch, Jensen Huang, founder and chief executive of Nvidia, said: "Reasoning and agentic AI adoption is incredible. NVIDIA's open reasoning models, software and tools give developers and enterprises everywhere the building blocks to create an accelerated agentic AI workforce." ^[1]

What models are in the Llama Nemotron family?

The launched family consists of three tiers, each derived from a different Meta Llama checkpoint and tuned for a distinct deployment target. The Nano tier delivers the highest accuracy on PCs and edge devices, Super offers the best accuracy and highest throughput on a single GPU, and Ultra provides maximum agentic accuracy on multi-GPU servers ^[1].

Tier	Base Llama model	Parameters	Context	Target deployment	Initial release
Llama 3.1 Nemotron Nano	Llama 3.1 8B Instruct	8B	128K	PCs and edge devices	March 18, 2025
Llama 3.3 Nemotron Super	Llama 3.3 70B Instruct	49B	128K	Single data-center GPU	March 18, 2025
Llama 3.1 Nemotron Ultra	Llama 3.1 405B Instruct	253B	128K	Multi-GPU servers	April 7, 2025

Each model is a dense decoder-only Transformer; none uses a mixture-of-experts design, so all parameters are active on every forward pass ^[8]. The Nano model is a fine-tune that preserves the Llama 3.1 8B architecture, while the Super and Ultra models are substantially restructured through neural architecture search to shrink the parameter count relative to their 70B and 405B base models ^[8]. The Ultra model represents roughly a 62 percent parameter reduction versus the 405B baseline, and Nvidia designed it to fit on a single eight-GPU node of H100 80GB accelerators for inference ^[6]^[8].

How is Llama Nemotron post-trained?

Llama Nemotron is the product of a multi-stage post-training pipeline applied to off-the-shelf Llama weights, described in Nvidia's technical report "Llama-Nemotron: Efficient Reasoning Models" (Bercovich et al., 2025) ^[8]. The pipeline combines architecture optimization with reasoning-focused training.

For the Super and Ultra tiers, Nvidia first applies neural architecture search using a system it calls Puzzle, which produces non-uniform, non-repetitive block structures ^[8]. Some attention blocks are skipped entirely or replaced with a single linear layer, and feed-forward network (FFN) layers are given variable expansion and compression ratios across the model. A technique called FFN Fusion merges consecutive layers whose attention has been removed into fewer, wider FFN layers, which reduces sequential depth and improves latency ^[8]. These structural changes lower parameter count and inference cost while a block-wise knowledge distillation step transfers capability from the larger reference model. For the Ultra tier, this stage used roughly 65 billion tokens of knowledge distillation followed by about 88 billion tokens of continued pretraining; for the Super tier, distillation used about 40 billion tokens drawn from corpora including FineWeb, Buzz-V1.2, and Dolma ^[8].

After the architecture is fixed, the models undergo reasoning-focused supervised fine-tuning on large volumes of synthetic data spanning math, code, science, instruction following, chat, safety, and tool calling ^[8]. A final stage applies large-scale reinforcement learning, including Group Relative Policy Optimization (GRPO) for reasoning, chat, and instruction following, together with reinforcement learning from human feedback for alignment. Training was performed on Nvidia DGX Cloud ^[8].

How does the toggleable reasoning mode work?

The defining feature of Llama Nemotron is its dynamic reasoning toggle. Nvidia describes the models as the first open-source models to support switching between standard chat and extended reasoning during inference without loading a separate model ^[8]. Control is exercised entirely through the system prompt: setting the system message to "detailed thinking on" causes the model to generate long internal reasoning traces before its answer, while "detailed thinking off" produces a direct response in the manner of a conventional instruction-tuned model ^[2]^[8]. According to the technical report, the two behaviors were trained simultaneously and differ only by this system prompt, so the same checkpoint serves both modes ^[8].

The reasoning and non-reasoning modes call for different decoding settings. Nvidia recommends temperature 0.6 with top-p 0.95 when reasoning is enabled and greedy decoding when it is disabled ^[4]^[5]. The mechanism lets developers trade latency and cost against accuracy on a per-request basis, using fast direct answers for simple queries and full reasoning for harder problems.

Is Llama Nemotron open source?

Nvidia released the model weights under the commercially permissive NVIDIA Open Model License, in combination with the relevant Llama Community License Agreement ^[4]^[8]. Beyond the weights, the company published the post-training corpus as the Llama-Nemotron-Post-Training-Dataset on Hugging Face, comprising roughly 30 million or more samples across math, code, science, instruction following, chat, and safety ^[7]. The dataset is dominated by mathematics and code, and its responses were synthetically generated by a mixture of open models, including DeepSeek-R1, several Qwen 2.5 variants, and Llama models, with the bulk released under the CC-BY-4.0 license ^[7]. Nvidia also released the HelpSteer preference data used for alignment and pointed to the open-source training stack, including NeMo, NeMo-Aligner, and Megatron-LM ^[8]. The technical report states that "we release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset" alongside the training codebases ^[8]. This combination of open weights, open data, and documented methods was intended to let enterprises reproduce or customize their own reasoning models.

How does Llama Nemotron perform on benchmarks?

Nvidia reported that the models achieve leading accuracy among open models on standard reasoning, math, science, and agentic benchmarks, with the largest gains visible when reasoning is enabled ^[1]^[8]. The tables below give published pass@1 scores from the model cards, illustrating the effect of the reasoning toggle.

Llama 3.1 Nemotron Nano (8B) ^[4]:

Benchmark	Reasoning off	Reasoning on
MATH500	36.6	95.4
AIME 2025	0.0	47.1
GPQA Diamond	39.4	54.1
MBPP (0-shot)	66.1	84.6

Llama 3.3 Nemotron Super (49B) ^[5]:

Benchmark	Reasoning off	Reasoning on
MATH500	74.0	96.6
AIME 2025	13.33	58.4
GPQA Diamond	50.0	66.67
Arena Hard	88.3	(not reported)
BFCL V2 Live	73.7	(not reported)

Llama 3.1 Nemotron Ultra (253B) ^[6]:

Benchmark	Reasoning off	Reasoning on
GPQA Diamond	56.60	76.01
AIME 2025	16.67	72.50
MATH500	80.40	97.00
LiveCodeBench	29.03	66.31
IFEval (instruction, strict)	88.85	89.45

Nvidia positioned the Ultra model as competitive with, and on several benchmarks ahead of, much larger open reasoning models, claiming roughly four times the inference throughput of DeepSeek-R1's 671-billion-parameter mixture-of-experts model while using fewer than half as many parameters ^[10]. The technical report concludes that the family "performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency" ^[8].

How is Llama Nemotron packaged as NIM microservices?

In addition to raw checkpoints, the Llama Nemotron models are distributed as NVIDIA NIM (Nvidia Inference Microservices) containers, part of the Nvidia AI Enterprise software platform ^[1]. NIM packages each model as a prebuilt, GPU-optimized inference service with a standard API, allowing deployment across clouds, data centers, workstations, and PCs ^[1]. The Nano tier was offered as a NIM microservice suitable for RTX-class PCs and workstations, while Super and Ultra target data-center GPUs. The Nano and Super models and NIM microservices were available at launch as a hosted API from build.nvidia.com and Hugging Face, free to members of the NVIDIA Developer Program for development, testing, and research ^[1]. Nvidia listed enterprise and partner adopters of the family including Accenture, Amdocs, Atlassian, Box, Cadence, CrowdStrike, Deloitte, IQVIA, Microsoft, SAP, and ServiceNow ^[1].

Why does Llama Nemotron matter?

Llama Nemotron illustrated several themes in the 2025 generative-AI landscape. It showed that reasoning behavior could be added to existing open base models through post-training rather than trained from scratch, and that a single model could expose reasoning as a switchable feature rather than a fixed property ^[8]. By coupling neural architecture search and distillation with reasoning-focused fine-tuning, Nvidia compressed very large Llama checkpoints into more deployable sizes while raising benchmark accuracy, and by releasing weights, datasets, and methods under permissive licenses it gave the open-source community a reproducible recipe ^[8]. The family also served Nvidia's commercial strategy: by making efficient open reasoning models that run well on its hardware and ship as NIM microservices, the company encouraged enterprises to build agentic applications on its software and accelerated computing stack ^[1]. Subsequent Nemotron releases continued the approach, extending it toward multimodal and newer-architecture open models.

References

NVIDIA Newsroom, "NVIDIA Launches Family of Open Reasoning AI Models for Developers and Enterprises to Build Agentic AI Platforms," March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-family-of-open-reasoning-ai-models-for-developers-and-enterprises-to-build-agentic-ai-platforms ↩
NVIDIA Technical Blog, "Build Enterprise AI Agents with Advanced Open NVIDIA Llama Nemotron Reasoning Models," March 18, 2025. https://developer.nvidia.com/blog/build-enterprise-ai-agents-with-advanced-open-nvidia-llama-nemotron-reasoning-models/ ↩
NVIDIA Blog, "NVIDIA Announces Nemotron Model Families to Advance Agentic AI," January 6, 2025. https://blogs.nvidia.com/blog/nemotron-model-families/ ↩
Hugging Face, "nvidia/Llama-3.1-Nemotron-Nano-8B-v1" model card. https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1 ↩
Hugging Face, "nvidia/Llama-3_3-Nemotron-Super-49B-v1" model card. https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 ↩
Hugging Face, "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" model card. https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 ↩
Hugging Face, "nvidia/Llama-Nemotron-Post-Training-Dataset." https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset ↩
A. Bercovich et al., "Llama-Nemotron: Efficient Reasoning Models," arXiv:2505.00949, 2025. https://arxiv.org/abs/2505.00949 ↩
VentureBeat, "Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI," March 18, 2025. https://venturebeat.com/ai/nvidia-debuts-llama-nemotron-open-reasoning-models-in-a-bid-to-advance-agentic-ai ↩
VentureBeat, "Nvidia's new Llama-3.1 Nemotron Ultra outperforms DeepSeek R1 at half the size," April 8, 2025. https://venturebeat.com/ai/nvidias-new-llama-3-1-nemotron-ultra-outperforms-deepseek-r1-at-half-the-size ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

ChipNeMo Jamba Reasoning 3B Llama-3.1-Nemotron-70B-Instruct Minitron NVIDIA Cosmos Reason NVLM Nemotron Nemotron Nano 2 ServiceNow

When was Llama Nemotron announced?

What models are in the Llama Nemotron family?

How is Llama Nemotron post-trained?

How does the toggleable reasoning mode work?

Is Llama Nemotron open source?

How does Llama Nemotron perform on benchmarks?

How is Llama Nemotron packaged as NIM microservices?

Why does Llama Nemotron matter?

References

Improve this article

Related Articles

OpenAI o1

OpenAI o3

DeepSeek-R1

OpenAI o-series

Test-time compute

GSM8K

What links here

Related Articles

OpenAI o1

OpenAI o3

DeepSeek-R1

OpenAI o-series

Test-time compute

GSM8K

What links here