Nemotron 3
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,268 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,268 words
Add missing citations, update stale details, or suggest a clearer explanation.
Nemotron 3 is a family of open-weights large language models released by NVIDIA on December 15, 2025, consisting of three sparse mixture-of-experts variants named Nano, Super, and Ultra. The family uses a hybrid Mamba 2 and Transformer architecture combined with a novel LatentMoE expert design, and ships with 1 million token context windows. Nemotron 3 Nano launched at 30 billion total parameters with 3 billion active per token, while Nemotron 3 Super at roughly 100 billion total with 10 billion active and Nemotron 3 Ultra at roughly 500 billion total with 50 billion active were announced for release in the first half of 2026. The release also included a separate multimodal variant, Nemotron 3 Nano Omni, which adds vision and audio encoders to the same base.
A naming clarification matters here. NVIDIA has used the "Nemotron-3" label twice. The first use was Nemotron-3 8B, a dense 8 billion parameter chatbot model released in November 2023 for the NeMo framework. That earlier model is a separate product and is not the subject of this article. The Nemotron 3 family discussed here is the December 2025 series of agentic Open Source AI models, which NVIDIA branded simply as "Nemotron 3" without any version suffix on the individual model names. NVIDIA's research site, press release, and technical report all use this convention. The reuse of the version number across two unrelated product lines is unusual but consistent across NVIDIA's official communications for the 2025 release.
The family was positioned as NVIDIA's answer to the wave of open-weights reasoning and agent models that defined 2025, including Llama 3.1, Qwen 3, GPT-OSS, and DeepSeek V3. NVIDIA emphasized three things at launch: throughput efficiency on its own Blackwell hardware, openness of weights and training data, and suitability for multi-agent workflows. The Nano variant was released the same day as the announcement, with weights, a technical report, and complete training recipes published on Hugging Face and GitHub under the NVIDIA Open Model License. Super and Ultra were promised in the months following, with the technical report covering all three.
The Nemotron name traces back to November 2023, when NVIDIA released Nemotron-3 8B for enterprise chatbot and copilot development through the NeMo framework. That first Nemotron was a dense decoder-only Transformer aimed at customer service and internal assistant deployments rather than as a competitor to OpenAI or Anthropic frontier systems. It received modest attention and was followed in February 2024 by Nemotron-4 15B, a 15 billion parameter model trained on 8 trillion tokens.
The family's profile changed in June 2024 with the release of Nemotron-4 340B, a triple of Base, Instruct, and Reward models with 340 billion parameters each. Nemotron-4 340B was specifically positioned as a synthetic data generator for training other models, and NVIDIA disclosed that more than 98 percent of the alignment data for the family had itself been synthetically generated. The release fit into a broader pattern in 2024 where labs began publishing models intended as building blocks for downstream training rather than as products in their own right.
In August 2025 NVIDIA shifted architectural direction with Nemotron Nano 2, a hybrid Mamba-Transformer model trained on roughly 20 trillion tokens. Nemotron Nano 2 was the first member of the family to use state-space layers alongside attention, an approach NVIDIA had been exploring through internal research and academic collaborations. The hybrid design draws on the Mamba 2 architecture introduced by Albert Gu and Tri Dao in 2024, which uses structured state-space models with selective state updates to achieve linear-time sequence processing where standard attention scales quadratically. Nemotron Nano 2 served as the direct architectural ancestor of the December 2025 Nemotron 3 release.
Four months later, on December 15, 2025, NVIDIA debuted what it called the Nemotron 3 family. The launch was accompanied by a press release from NVIDIA's investor relations group, a research microsite at research.nvidia.com, an arXiv preprint, and a 25-trillion-token data release. NVIDIA's framing emphasized that the models were built for agentic AI rather than chat, with explicit support for tool calling, structured output, and reasoning trace generation. Early enterprise partners named at launch included Accenture, Cadence, CrowdStrike, Cursor, Deloitte, EY, Oracle, Palantir, Perplexity, ServiceNow, Siemens, Synopsys, and Zoom.
The Nemotron 3 family at launch consisted of three text models plus one multimodal variant. The three sizes are differentiated by total parameter count, active parameter count, and intended use case rather than by architectural family. All three text models share the same hybrid Mamba-Transformer MoE skeleton and the same tokenizer.
| Model | Total parameters | Active parameters | Context window | Status at launch |
|---|---|---|---|---|
| Nemotron 3 Nano | 30B (31.6B with embeddings) | 3.5B (3.2B excluding embeddings) | 1M tokens | Released December 15, 2025 |
| Nemotron 3 Super | approximately 100B | approximately 10B | 1M tokens | Announced for H1 2026 release |
| Nemotron 3 Ultra | approximately 500B | approximately 50B | 1M tokens | Announced for H1 2026 release |
| Nemotron 3 Nano Omni | 30B with audio and vision encoders | 3B | not specified | Released December 15, 2025 |
Nemotron 3 Nano is the only fully text model that was available for download at the time of the announcement. It launched in two precision formats on Hugging Face: BF16 weights at NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and FP8 weights at NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. The naming convention "A3B" indicates 3 billion active parameters per forward pass.
Nemotron 3 Super and Nemotron 3 Ultra were described in the technical report but the weights were not released in December 2025. NVIDIA committed to publishing both later, with the Super model targeted at high-volume agent workloads such as IT ticket automation and Ultra positioned as a state-of-the-art reasoning engine for complex planning and research. The two larger models also use the NVFP4 training format introduced with NVIDIA's Blackwell GPUs, which represents weights in 4-bit floating point during training rather than in a higher precision format that is later quantized.
Nemotron 3 Nano Omni shares the 30B-A3B base with the text Nano model but adds a Parakeet audio encoder and a C-RADIOv4-H vision encoder, with 3D convolutions for capturing motion between video frames. The Omni variant targets document intelligence, video understanding, and multimodal agent reasoning. NVIDIA reported best-in-class results for the Omni model on MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni, and VoiceBench, claiming up to 9.2 times greater effective system capacity for video reasoning compared with other open omni models at matched interactivity thresholds.
The core architectural choice across the Nemotron 3 family is a hybrid stack of state-space layers, attention layers, and mixture-of-experts feedforward layers. The text Nano variant is the most thoroughly documented version of this design and has been described in detail in both the Hugging Face model card and the technical report.
Nemotron 3 Nano has 52 layers in total. The composition includes 23 Mamba 2 layers handling most of the sequence mixing work, 6 attention layers using Grouped Query Attention with 2 groups, and 23 mixture-of-experts feedforward layers. Each MoE layer contains 128 routed experts plus 1 shared expert, with 6 experts activated per token. Total active parameters land at 3.5 billion out of 30 billion total, giving an activation ratio of roughly 12 percent. The model also includes Multi-Token Prediction layers to speed up generation, predicting several tokens ahead during inference rather than one at a time.
The hybrid approach traces to a research observation that pure attention models become inefficient at very long contexts because of the quadratic memory and compute cost of attention. State-space models like Mamba 2 process sequences in linear time and retain selective memory through structured updates, but they tend to underperform attention on tasks that require precise lookup over recent tokens. The Nemotron 3 family interleaves the two so that attention layers handle the precision-critical work while Mamba layers carry the sequence-mixing load. The same general pattern appears in other 2025 hybrid releases including Jamba 2 and Jet Nemotron, though the specific layer ratios and routing strategies differ.
For the larger Super and Ultra variants, NVIDIA introduced a technique called LatentMoE, described in the research microsite as "a novel hardware-aware expert design for improved accuracy." LatentMoE projects expert computation into a lower-dimensional latent space before applying the routed experts, which reduces the parameter count required for a given level of capacity. The technical report frames this as an extension of the Multi-head Latent Attention idea popularized by DeepSeek, applied to feedforward rather than attention layers.
The tokenizer is shared across the family. Context length tops out at 1 million tokens, with a default deployment window of 256K tokens for most inference engines. NVIDIA reports retention of accuracy across context length using the RULER benchmark, with the Nano variant scoring 91.3 on RULER-100 at 512K tokens and 86.3 at the full 1 million token window. Supported languages are English, Spanish, French, German, Italian, and Japanese for natural language, plus 43 programming languages.
The Nano model is unified across reasoning and non-reasoning use cases. A single set of weights handles both modes, controlled by an enable_thinking flag in the chat template. With thinking enabled, the model emits a reasoning trace before its final answer; with thinking disabled, it responds directly. NVIDIA also exposes a budget control parameter for limiting the number of reasoning tokens generated, an interface choice that has become common across 2025 reasoning releases.
The Nano variant was pre-trained on 25 trillion tokens, with the training distribution broken down across multiple categories. The total training mix listed on the Hugging Face model card sums to 13.34 trillion tokens, indicating that some sources contributed multiple epochs to the full 25 trillion token count. Synthetic data made up 3.53 trillion tokens of the mix, the largest single category. English Common Crawl contributed 3.46 trillion tokens, multilingual data 1.74 trillion, code 1.04 trillion, and STEM supervised fine-tuning data 359.8 billion. The pre-training cutoff date was June 25, 2025, with post-training data extending to November 28, 2025.
NVIDIA released the underlying training data alongside the model. The main pretraining corpus is Nemotron-CC-v2.1, a 2.5 trillion token English web crawl derivative, paired with Nemotron-CC-Code-v1, a 428 billion token code corpus drawn primarily from GitHub. Additional sources include arXiv papers, Wikipedia and Wikimedia content, OpenStax textbooks, PubMed abstracts, NIH ExPorter records, and SEC EDGAR filings. NVIDIA also released three trillion tokens of new Nemotron pretraining, post-training, and reinforcement learning datasets as part of the launch package, along with the Nemotron Agentic Safety Dataset derived from real-world telemetry.
The training schedule used Warmup-Stable-Decay learning rate scheduling with an 8 billion token warmup, peak learning rate of 1e-3, and minimum learning rate of 1e-5. Training batch size was 3,072. The Super and Ultra variants were trained using NVFP4, a 4-bit floating point format introduced with NVIDIA's Blackwell GPU architecture. NVFP4 stores model weights and activations in 4-bit blocks during the training run itself, rather than in BF16 or FP8 and quantizing afterward, which reduces memory pressure and improves throughput on the supported hardware.
Post-training was structured around multi-environment reinforcement learning, covering reasoning, tool calling, instruction following, and multilingual response quality. NVIDIA released the underlying RL libraries, called NeMo Gym and NeMo RL, as open-source projects on GitHub at the same time as the model launch. NeMo Evaluator, a companion library for validating model performance and safety, was also released. The post-training mix included synthetic code, math, and science data, along with tool calling traces and multilingual reasoning examples.
NVIDIA has not published a complete count of GPU hours used to train the family, but the Hugging Face model card and the technical report describe training infrastructure as Blackwell-based for the Super and Ultra runs. The Nano model can run inference on a wide range of NVIDIA hardware including A100, H100, B200, RTX PRO 6000, Jetson Thor, and DGX Spark systems.
NVIDIA published a substantial benchmark table for Nemotron 3 Nano at launch, focusing on the standard reasoning, coding, math, and agentic suites that have become the de facto comparison points for open-weights models. The numbers below come from the official Hugging Face model card.
| Benchmark | Category | Nemotron 3 Nano score |
|---|---|---|
| MMLU-Pro | General knowledge | 78.3 |
| AIME 2025 (with tools) | Math reasoning | 99.2 |
| LiveCodeBench v6 | Code generation | 68.3 |
| MiniF2F pass@32 | Formal math proofs | 79.9 |
| SWE-Bench | Software engineering | 38.8 |
| Arena-Hard-V2 average | Chat quality | 67.7 |
| RULER-100 @ 512K | Long context retention | 91.3 |
| RULER-100 @ 1M | Long context retention | 86.3 |
The AIME 2025 result of 99.2 percent is one of the highest published scores on that benchmark for any open or closed model, though the tools-enabled qualifier is important. AIME with tools allows the model to use a calculator or code interpreter for arithmetic, which removes a class of failure modes that would otherwise hit any language model on hard math contests. Most labs report both tool-enabled and tool-free numbers; the tool-free AIME score for Nemotron 3 Nano was not the headline figure in NVIDIA's marketing.
The LiveCodeBench v6 score of 68.3 ranks Nano ahead of Qwen3-30B-A3B-Thinking-2507 at 66.0 and GPT-OSS-20B at 61.0, according to comparisons published by NVIDIA and reproduced by third-party trackers. The Arena-Hard-V2 average of 67.7 also leads the same two models, which scored 57.8 and 48.5 respectively in NVIDIA's reported comparison. On MMLU-Pro the model trails Qwen3-30B's 80.9 by roughly 2.5 points, an interesting result given Nano's lead on most other benchmarks in the same comparison set.
The long-context numbers are particularly strong. RULER is a benchmark suite that tests retrieval, multi-hop reasoning, and aggregation across long contexts, with each task scored at multiple context lengths. Nemotron 3 Nano holds 91.3 percent at the 512K mark and 86.3 percent at the full 1 million token window, indicating that the hybrid Mamba-Transformer architecture maintains reasonable accuracy well past the point where pure attention models typically begin to degrade noticeably.
On throughput, NVIDIA reports that Nemotron 3 Nano runs 2.2 times faster than GPT-OSS-20B and 3.3 times faster than Qwen3-30B-A3B-Thinking-2507 on the same H200 hardware in an 8K input, 16K output configuration. Compared with its own predecessor Nemotron 2 Nano, NVIDIA reports 4 times higher token throughput and a 60 percent reduction in reasoning-token generation for equivalent task completion. The reasoning-token reduction is meaningful because it directly maps to inference cost in production deployments, where a model that produces fewer thinking tokens per task is cheaper to serve.
Nemotron 3 Nano Omni was reported separately on multimodal benchmarks. The Omni model claimed best-in-class document intelligence results on MMlongbench-Doc and OCRBenchV2, leading scores on video and audio benchmarks including WorldSense, DailyOmni, and VoiceBench, and the 9.2x effective capacity figure for video reasoning relative to other open omni models. Detailed score tables for Omni were published alongside the announcement on the NVIDIA developer blog.
Nemotron 3 ships under the NVIDIA Nemotron Open Model License. This license is part of NVIDIA's broader Open Model License framework introduced earlier with Nemotron-4 340B and refined across subsequent releases. The license permits both commercial and non-commercial use of the model weights and derivatives, including for synthetic data generation, fine-tuning, and deployment of fine-tuned versions.
The license includes use restrictions that distinguish it from a pure permissive license such as MIT or Apache 2.0. NVIDIA retains the right to terminate the license if the model is used in ways that violate listed restrictions, which cover categories like critical infrastructure, weapons development, illegal activity, and certain regulated domains. Derivative models trained on outputs from Nemotron 3 are permitted but must maintain attribution and may not remove the license terms.
NVIDIA explicitly committed to releasing the training data, training code, evaluation harness, and reinforcement learning libraries alongside the weights. The release package therefore goes substantially beyond a typical "open weights" release. NeMo Gym, NeMo RL, and NeMo Evaluator are all published as separate open-source repositories on GitHub. The Nemotron-CC-v2.1 and Nemotron-CC-Code-v1 datasets are hosted on Hugging Face under their own data licenses, which generally follow the Creative Commons or similar permissive frameworks used by their underlying sources.
The model is listed as "Ready for commercial use" on the official Hugging Face card. Commercial deployment requires acceptance of the Nemotron Open Model License terms and adherence to NVIDIA's Trustworthy AI framework, which covers content filtering, bias mitigation, explainability requirements, and privacy considerations.
Nemotron 3 Nano competes most directly with the 20 to 30 billion parameter mixture-of-experts releases from other labs, including Qwen3-30B-A3B from Alibaba, GPT-OSS-20B from OpenAI, and the open variants of Llama 3.1 from Meta. The comparison table below uses NVIDIA's reported numbers for Nemotron 3 Nano and the developers' own reported numbers for the other models. Cross-lab benchmark comparison is always imperfect because labs use different prompts and evaluation harnesses, but the table gives a useful order-of-magnitude comparison.
| Model | Developer | Total / active params | Context | License | AIME 2025 | LiveCodeBench v6 |
|---|---|---|---|---|---|---|
| Nemotron 3 Nano | NVIDIA | 30B / 3.5B (MoE) | 1M | NVIDIA Open Model License | 99.2 (with tools) | 68.3 |
| Qwen3-30B-A3B | Alibaba | 30B / 3B (MoE) | 256K | Apache 2.0 | 70.9 | 66.0 |
| GPT-OSS-20B | OpenAI | 20B (dense) | 128K | Apache 2.0 | not reported | 61.0 |
| Llama 3.1 70B | Meta | 70B (dense) | 128K | Llama 3.1 Community | not reported | not reported |
| Nemotron 2 Nano | NVIDIA | similar scale | 128K | NVIDIA Open Model License | trails Nemotron 3 Nano | trails Nemotron 3 Nano |
Versus Qwen3-30B-A3B, Nemotron 3 Nano has the same total parameter scale and a comparable active parameter count, but extends the context window from 256K to 1M and adds the hybrid Mamba-Transformer architecture. Qwen3 holds the lead on MMLU-Pro by roughly 2.5 points but trails on AIME, LiveCodeBench, Arena-Hard, and throughput at NVIDIA's reported settings. The Apache 2.0 license on Qwen3 is more permissive than NVIDIA's Open Model License for some use cases, particularly research where the use restrictions in NVIDIA's license could be a complication.
Versus GPT-OSS-20B, Nemotron 3 Nano is larger in total parameters but lower in active parameters and uses a less common architecture. OpenAI's open release is a pure dense Transformer with a much shorter 128K context. NVIDIA's reported benchmarks place Nemotron 3 Nano ahead on every shared benchmark, with the throughput gap of 2.2x being the headline efficiency claim.
Versus Llama 3.1 from Meta, the comparison is harder to make cleanly because Llama 3.1 is a year older and was not designed for the agentic and reasoning evaluation suites that dominate late-2025 model marketing. The 70 billion parameter Llama 3.1 Instruct remains a baseline reference point for the open-weights field, and NVIDIA does not include it in its primary Nemotron 3 benchmark tables. On the kinds of tasks that Llama 3.1 was tuned for, such as general chat and instruction following, the gap between it and Nemotron 3 Nano is narrower than the headline math and code benchmarks would suggest.
Versus its own predecessor Nemotron 2 Nano, Nemotron 3 Nano is the headline efficiency story for the family. NVIDIA reports 4 times higher token throughput and 60 percent fewer reasoning tokens generated per task, achieved through a combination of the new LatentMoE design, the Mamba-Transformer ratio adjustments, and the move to FP8 inference precision by default. The context window extension from 128K to 1M tokens is a meaningful capability change, particularly for code agent applications where the model needs to hold large project contexts in working memory.
The Super and Ultra variants, when released, will compete in different weight classes. Super at roughly 100 billion total parameters falls into the same neighborhood as Llama 3.1 70B Instruct, Qwen2.5-72B, and Mistral Large 2. Ultra at roughly 500 billion total parameters competes with the frontier-scale open releases, particularly DeepSeek V3, Llama 3.1 405B, and the open variants of Mistral Large 3. Public benchmark numbers for Super and Ultra were limited at the December 2025 announcement; NVIDIA published preliminary internal results in the technical report but warned that final numbers might shift slightly with the full release.
Reception of Nemotron 3 Nano was broadly positive among open-source AI developers and frontier-model trackers, with a focus on three things: the throughput claims, the data release, and the unusually complete openness of the package.
HyperFRAME Research described the release as "a meaningful escalation in NVIDIA's open-source posture," noting that the company had moved from publishing models primarily as building blocks for synthetic data generation to publishing them as competitive end-user models. Several writers connected this shift to growing competitive pressure from Chinese open-weights labs, particularly Alibaba's Qwen team and DeepSeek, both of which had been publishing increasingly capable models on permissive licenses through 2024 and 2025.
The New Stack covered the launch under the headline "Nvidia Launches the Next Generation of Its Nemotron Models," emphasizing the data release and the agent-oriented framing rather than the raw benchmark numbers. Constellation Research's coverage was more strategic, describing Nemotron as "a much-needed open-source model champion in the US," against the backdrop of a perception that the leading open-weights frontier had shifted to Chinese labs over 2025.
Independent benchmark reproduction was generally consistent with NVIDIA's claims, though with the usual caveats. Medium technical reviewer Barnacle Goose wrote a detailed walk-through of the Nano model that confirmed the long-context retention claims on RULER and the throughput advantage over Qwen3 on H100 hardware, while noting that the SWE-Bench score of 38.8 lags behind specialized code models. LLM-stats.com published side-by-side comparison tables placing Nemotron 3 Nano ahead of Qwen3 on most agentic and math benchmarks but behind on raw MMLU-Pro and on some coding tasks.
The local-deployment community, which runs models on consumer or prosumer hardware, was particularly enthusiastic about the Nano variant. Compared with prior NVIDIA releases that targeted enterprise H100 deployments, Nano's 30B-A3B configuration runs well on the RTX PRO 6000 workstation cards and on DGX Spark units, both of which had been released earlier in 2025. The FP8 quantization variant in particular drew attention as fitting comfortably in 48 GB of VRAM with full 1M token context.
The agentic-AI community focused on the tool calling and reasoning trace features. NVIDIA's bundled vllm reasoning parser plugin for the Nano model became a reference implementation that other labs began copying for their own agentic releases. Several developer surveys in early 2026 listed Nemotron 3 Nano alongside Qwen3 and DeepSeek as the most-used open-weights models for agent prototyping work.
Critical reactions clustered around two concerns. The first was the NVIDIA Open Model License itself. While the license permits commercial use, the use restrictions and the absence of a clean MIT or Apache 2.0 grant raised concerns among some open-source advocates who argued that the restrictions made the model less "open" than the marketing implied. The second was the dependency on NVIDIA-specific tooling for optimal inference, with NVFP4 quantization on Super and Ultra effectively requiring Blackwell hardware to realize the full throughput claims.
NVIDIA's investor and enterprise framing also drew some commentary. Several writers noted that the Nemotron release was timed to coincide with broader NVIDIA messaging around agentic AI infrastructure, with the models functioning partly as a demonstration of the company's hardware capabilities and partly as standalone products. The investor relations press release on December 15, 2025 explicitly framed the release as part of NVIDIA's enterprise AI software strategy.
In the months following the launch, NVIDIA published incremental updates to the Nano model, including extended language support and improved tool calling. The Super and Ultra releases were tracked by the community against the announced H1 2026 timeline. As of May 2026 the Super model had been previewed but the Ultra model had not yet shipped publicly, though NVIDIA had repeatedly confirmed both were on the roadmap.