# Nemotron 3

> Source: https://aiwiki.ai/wiki/nemotron_3
> Updated: 2026-07-14
> Categories: AI Models, Large Language Models, NVIDIA, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Nemotron 3** is a family of open-weights [large language model](/wiki/large_language_model) systems released by [NVIDIA](/wiki/nvidia) beginning on December 15, 2025, built for [agentic AI](/wiki/agentic_ai) and consisting of three sparse mixture-of-experts variants named Nano, Super, and Ultra. [1] The family uses a hybrid [Mamba 2](/wiki/mamba_2) and Transformer architecture combined with a LatentMoE expert design on the larger models, and ships with context windows of up to 1 million tokens. [1][2] Nemotron 3 Nano launched first, at 30 billion total parameters with up to 3 billion active per token. NVIDIA then released Nemotron 3 Super on March 11, 2026, a 120.6 billion total parameter model with 12.7 billion active (branded 120B-A12B), and Nemotron 3 Ultra on June 4, 2026, a 550 billion total parameter model with 55 billion active (branded 550B-A55B), completing the original three-tier plan. [23][27] NVIDIA reported that Nemotron 3 Nano delivers up to 4 times higher token throughput than its predecessor Nemotron 2 Nano while reducing reasoning-token generation by up to 60 percent. [1] A separate multimodal variant, Nemotron 3 Nano Omni, which adds vision and audio encoders to the same base, was released later, on April 28, 2026. [7][8]

At launch NVIDIA founder and chief executive Jensen Huang framed the family as an open platform, stating: "Open innovation is the foundation of AI progress. With Nemotron, we're transforming advanced AI into an open platform that gives developers the transparency and efficiency they need to build agentic systems at scale." [1] The accompanying white paper summarized the lineup as models that "deliver strong agentic, reasoning, and conversational capabilities" across the three sizes. [3]

A naming clarification matters here. NVIDIA has used the "Nemotron-3" label twice. The first use was Nemotron-3 8B, a dense 8 billion parameter family released in November 2023 for the NeMo framework, with a 4,096 token context window and training on 3.8 trillion tokens spanning 53 human languages and 37 programming languages. [21] That earlier model is a separate product and is not the subject of this article. The Nemotron 3 family discussed here is the December 2025 series of agentic [Open Source AI](/wiki/open_source_ai) models, which NVIDIA branded simply as "Nemotron 3" without any version suffix on the individual model names. NVIDIA's research site, press release, and technical report all use this convention. [1][2][4] The reuse of the version number across two unrelated product lines is unusual but consistent across NVIDIA's official communications for the 2025 release.

The family was positioned as NVIDIA's answer to the wave of open-weights reasoning and agent models that defined 2025, including [Llama 3.1](/wiki/llama_3_1), [Qwen 3](/wiki/qwen3), [GPT-OSS](/wiki/gpt_oss), and [DeepSeek V3](/wiki/deepseek_v3). NVIDIA emphasized three things at launch: throughput efficiency on its own Blackwell hardware, openness of weights and training data, and suitability for multi-agent workflows. [1][2] The Nano variant was released the same day as the announcement, with weights, a technical report, and complete training recipes published on Hugging Face and GitHub under the NVIDIA Open Model License. [4][5] Super and Ultra followed over the next six months, released on March 11 and June 4, 2026 respectively, each with its own technical report and open weights. [23][25][27][29]

## What is NVIDIA Nemotron 3?

Nemotron 3 is NVIDIA's December 2025 family of open-weights large language models designed for building agentic AI applications, released in Nano, Super, and Ultra sizes. [1] The models combine a Mixture-of-Experts feedforward design with a hybrid stack of Mamba 2 state-space layers and Transformer attention layers, an approach NVIDIA describes as delivering "best-in-class throughput while having better or on-par accuracy than standard Transformers." [2] NVIDIA released not only the model weights but also the training data, training recipes, evaluation harness, and reinforcement learning libraries, going substantially beyond a typical open-weights release. [3][9]

The defining capabilities are long context, throughput efficiency, and agent readiness. All three sizes support context windows of up to 1 million tokens, and every model is post-trained with multi-environment reinforcement learning for reasoning, multi-step tool use, and granular reasoning-budget control. [1][3] The smallest model, Nano, is positioned as outperforming comparable open models on accuracy while remaining highly cost-efficient for inference; Super is optimized for high-volume agent workloads such as IT ticket automation and software development; and Ultra is positioned as a state-of-the-art reasoning engine for complex planning and research. [3]

## Background: how did the Nemotron line evolve?

The Nemotron name traces back to November 2023, when NVIDIA released Nemotron-3 8B for enterprise chatbot and copilot development through the NeMo framework. [21] That first Nemotron was a dense decoder-only Transformer family (base, chat, and question-answering variants) aimed at customer service and internal assistant deployments rather than at competing with OpenAI or Anthropic frontier systems. The base model carried 8 billion parameters, a 4,096 token context window, and was trained on 3.8 trillion tokens covering 53 human languages and 37 programming languages. [21] It received modest attention and was followed in February 2024 by Nemotron-4 15B, a 15 billion parameter decoder-only Transformer trained on 8 trillion tokens and designed to fit on a single A100 or H100 GPU. [22]

The family's profile changed in June 2024 with the release of Nemotron-4 340B, a triple of Base, Instruct, and Reward models with 340 billion parameters each. Nemotron-4 340B was specifically positioned as a synthetic data generator for training other models, and NVIDIA disclosed that more than 98 percent of the alignment data for the family had itself been synthetically generated. The release fit into a broader pattern in 2024 where labs began publishing models intended as building blocks for downstream training rather than as products in their own right.

In August 2025 NVIDIA shifted architectural direction with [Nemotron Nano 2](/wiki/nemotron_nano_2), a hybrid Mamba-Transformer model trained on roughly 20 trillion tokens. Nemotron Nano 2 was the first member of the family to use state-space layers alongside attention, an approach NVIDIA had been exploring through internal research and academic collaborations. The hybrid design draws on the [Mamba 2](/wiki/mamba_2) architecture introduced by Albert Gu and Tri Dao in 2024, which uses structured state-space models with selective state updates to achieve linear-time sequence processing where standard attention scales quadratically. Nemotron Nano 2 served as the direct architectural ancestor of the December 2025 Nemotron 3 release.

Four months later, on December 15, 2025, NVIDIA debuted what it called the Nemotron 3 family. [1] The launch was accompanied by a press release from NVIDIA's investor relations group, a research microsite at research.nvidia.com, an arXiv preprint, and a 3-trillion-token data release. [1][2][3] NVIDIA's framing emphasized that the models were built for agentic AI rather than chat, with explicit support for tool calling, structured output, and reasoning trace generation. Early enterprise partners named at launch included Accenture, Cadence, CrowdStrike, Cursor, Deloitte, EY, Oracle, Palantir, Perplexity, ServiceNow, Siemens, Synopsys, and Zoom. [1]

## What models are in the Nemotron 3 family?

The Nemotron 3 family consists of three text models plus one multimodal variant. The three sizes are differentiated by total parameter count, active parameter count, and intended use case rather than by architectural family. All three text models share the same hybrid Mamba-Transformer MoE skeleton and the same tokenizer.

| Model | Total parameters | Active parameters | Context window | Release |
|---|---|---|---|---|
| Nemotron 3 Nano | 30B (31.6B with embeddings) | 3.5B (3.2B excluding embeddings) | 1M tokens | Released December 15, 2025 |
| Nemotron 3 Super | 120.6B | 12.7B | 1M tokens | Released March 11, 2026 |
| Nemotron 3 Ultra | 550B | 55B | 1M tokens | Released June 4, 2026 |
| Nemotron 3 Nano Omni | 30B with audio and vision encoders | 3B | 256K tokens | Released April 28, 2026 |

Nemotron 3 Nano is the only fully text model that was available for download at the time of the December 2025 announcement. [1] It launched in two precision formats on Hugging Face: BF16 weights at NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and FP8 weights at NVIDIA-Nemotron-3-Nano-30B-A3B-FP8. [5][6] The naming convention "A3B" indicates 3 billion active parameters per forward pass.

Nemotron 3 Super and Nemotron 3 Ultra were described at the December 2025 launch, but their weights were not released then. NVIDIA published Super on March 11, 2026 and Ultra on June 4, 2026, each with its own technical report and Hugging Face checkpoints, keeping to the "high-volume agent" and "state-of-the-art reasoning" positioning set out at launch. [23][25][27][29] Both larger models use the [NVFP4](/wiki/nvfp4) training format introduced with NVIDIA's [Blackwell](/wiki/blackwell) GPUs, which represents weights in 4-bit floating point during training rather than in a higher precision format that is later quantized. [3][25]

Nemotron 3 Nano Omni shares the 30B-A3B base with the text Nano model but adds a Parakeet audio encoder and a C-RADIOv4-H vision encoder, with 3D convolutions for capturing motion between video frames. [8] It accepts text, images, audio, video, documents, charts, and user-interface screenshots as input and produces text output, with a context window of 256K tokens. [7][8] The Omni variant targets document intelligence, video understanding, and multimodal agent reasoning. NVIDIA reported best-in-class results for the Omni model on MMlongbench-Doc, OCRBenchV2, WorldSense, DailyOmni, and VoiceBench, claiming up to 9.2 times greater effective system capacity for video reasoning compared with other open omni models at matched interactivity thresholds. [7][8]

### The 2025 to 2026 rollout

NVIDIA shipped the Nemotron 3 family in stages across roughly six months rather than all at once. [1][23][27] The Nano text model and the initial data release arrived on the December 15, 2025 launch day. [1] Super shipped just ahead of NVIDIA's [GTC 2026](/wiki/gtc) conference (March 16 to 19 in San Jose), with a developer blog, a research microsite, and Hugging Face weights all going live around March 10 to 11, 2026, and the formal arXiv technical report following on April 14, 2026. [23][24][25] The multimodal Nemotron 3 Nano Omni followed on April 28, 2026. [7][8] Ultra was previewed by Jensen Huang during his Computex 2026 keynote in Taipei (June 1, 2026) and released on June 4, 2026, with its own technical report appearing on arXiv on June 12, 2026. [27][28][29][31] The staggered rollout gave each tier a distinct moment and let NVIDIA reuse lessons from the smaller models when finalizing the larger ones.

| Model | Announced | Released | Total / active params | Context |
|---|---|---|---|---|
| Nemotron 3 Nano | Dec 15, 2025 | Dec 15, 2025 | 30B / 3.5B | 1M tokens |
| Nemotron 3 Super | Dec 15, 2025 | Mar 11, 2026 | 120.6B / 12.7B | 1M tokens |
| Nemotron 3 Nano Omni | Apr 28, 2026 | Apr 28, 2026 | 30B / 3B (plus encoders) | 256K tokens |
| Nemotron 3 Ultra | Dec 15, 2025 | Jun 4, 2026 | 550B / 55B | 1M tokens |

## How is Nemotron 3 built? Architecture

The core architectural choice across the Nemotron 3 family is a hybrid stack of state-space layers, attention layers, and mixture-of-experts feedforward layers. NVIDIA describes the design as using a "Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens." [3] The text Nano variant is the most thoroughly documented version of this design and has been described in detail in both the Hugging Face model card and the technical report. [4][5]

Nemotron 3 Nano has 52 layers in total. The composition includes 23 [Mamba 2](/wiki/mamba_2) layers handling most of the sequence mixing work, 6 attention layers using Grouped Query Attention with 2 groups, and 23 mixture-of-experts feedforward layers. [5] Each MoE layer contains 128 routed experts plus 1 shared expert, with 6 experts activated per token. [5] Total active parameters land at 3.5 billion out of 30 billion total, giving an activation ratio of roughly 12 percent. The model also includes Multi-Token Prediction layers to speed up generation, predicting several tokens ahead during inference rather than one at a time.

The hybrid approach traces to a research observation that pure attention models become inefficient at very long contexts because of the quadratic memory and compute cost of attention. State-space models like Mamba 2 process sequences in linear time and retain selective memory through structured updates, but they tend to underperform attention on tasks that require precise lookup over recent tokens. The Nemotron 3 family interleaves the two so that attention layers handle the precision-critical work while Mamba layers carry the sequence-mixing load. The same general pattern appears in other 2025 hybrid releases including [Jamba 2](/wiki/jamba_2) and [Jet-Nemotron](/wiki/jet_nemotron), though the specific layer ratios and routing strategies differ.

For the larger Super and Ultra variants, NVIDIA introduced a technique called LatentMoE, described in the research microsite as "a novel hardware-aware expert design for improved accuracy." [2] LatentMoE projects expert computation into a lower-dimensional latent space before applying the routed experts, which reduces the parameter count required for a given level of capacity. NVIDIA's technical reports frame it as an expert design that "optimizes for both accuracy per FLOP and accuracy per parameter," in the same spirit as the Multi-head Latent Attention idea popularized by DeepSeek but applied to feedforward rather than attention layers. [3][25]

With Super and Ultra released, the concrete shapes of the larger models are now documented. Nemotron 3 Super is a 120.6 billion total parameter model that activates 12.7 billion parameters per token. [23][25] Its stack has 88 layers built from interleaved Mamba-2 and attention blocks, with LatentMoE feedforward layers carrying 512 experts each and 22 experts activated per token, plus 2 shared-weight Multi-Token Prediction layers for native speculative decoding. [25] Nemotron 3 Ultra scales the same recipe to 550 billion total and 55 billion active parameters, roughly 90 percent sparsity, across 108 layers that again use 512 experts per MoE layer with 22 active and 2 MTP layers. [29][31] Both larger models keep the 1 million token context window of the Nano variant. [23][27]

The tokenizer is shared across the family. Context length tops out at 1 million tokens for the text models, with a default deployment window of 256K tokens for most inference engines. NVIDIA reports retention of accuracy across context length using the RULER benchmark, with the Nano variant scoring 91.3 on RULER-100 at 512K tokens and 86.3 at the full 1 million token window. [5] Supported languages are English, Spanish, French, German, Italian, and Japanese for natural language, plus 43 programming languages. [5]

The Nano model is unified across reasoning and non-reasoning use cases. A single set of weights handles both modes, controlled by an enable_thinking flag in the chat template. With thinking enabled, the model emits a reasoning trace before its final answer; with thinking disabled, it responds directly. NVIDIA also exposes a budget control parameter for limiting the number of reasoning tokens generated, an interface choice that has become common across 2025 reasoning releases and that carries through to Super and Ultra. [3]

## How was Nemotron 3 trained?

The Nano variant was pre-trained on 25 trillion tokens, with the training distribution broken down across multiple categories. [5] The total training mix listed on the Hugging Face model card sums to 13.34 trillion tokens, indicating that some sources contributed multiple epochs to the full 25 trillion token count. Synthetic data made up 3.53 trillion tokens of the mix, the largest single category. English Common Crawl contributed 3.46 trillion tokens, multilingual data 1.74 trillion, code 1.04 trillion, and STEM supervised fine-tuning data 359.8 billion. The pre-training cutoff date was June 25, 2025, with post-training data extending to November 28, 2025. [5]

NVIDIA released the underlying training data alongside the model. The main pretraining corpus is Nemotron-CC-v2.1, a 2.5 trillion token English web crawl derivative, paired with Nemotron-CC-Code-v1, a 428 billion token code corpus drawn primarily from GitHub. Additional sources include arXiv papers, Wikipedia and Wikimedia content, OpenStax textbooks, PubMed abstracts, NIH ExPorter records, and SEC EDGAR filings. NVIDIA also released three trillion tokens of new Nemotron pretraining, post-training, and reinforcement learning datasets as part of the launch package, along with the Nemotron Agentic Safety Dataset derived from real-world telemetry. [1]

The training schedule used Warmup-Stable-Decay learning rate scheduling with an 8 billion token warmup, peak learning rate of 1e-3, and minimum learning rate of 1e-5. Training batch size was 3,072. Nemotron 3 Super, released in March 2026, became the first Nemotron model pre-trained end to end in NVFP4, a 4-bit floating point format (E2M1 elements in 16-element micro-blocks) introduced with NVIDIA's Blackwell GPU architecture. [23][25] Super applied NVFP4 to the weights, activations, and gradients of most linear layers while keeping the final 15 percent of the network, along with embeddings, latent projections, and attention projections, in higher precision (BF16 or MXFP8) for stability. [25] Storing values in 4-bit blocks during the training run itself, rather than training in BF16 or FP8 and quantizing afterward, reduces memory pressure and improves throughput on the supported hardware. Nemotron 3 Ultra used the same NVFP4 recipe at 550 billion parameters, again holding the final 15 percent of the network (16 layers) in higher precision, and was pre-trained on roughly 20 trillion tokens, somewhat fewer than the 25 trillion used for Nano and Super. [27][29]

Post-training was structured around multi-environment reinforcement learning, covering reasoning, tool calling, instruction following, and multilingual response quality. [3] NVIDIA released the underlying RL libraries, called NeMo Gym and NeMo RL, as open-source projects on GitHub at the same time as the model launch. [1] NeMo Evaluator, a companion library for validating model performance and safety, was also released. [1] The post-training mix included synthetic code, math, and science data, along with tool calling traces and multilingual reasoning examples. For Ultra, NVIDIA added Multi-teacher On-Policy Distillation (MOPD), a post-training method in which the student model learns on its own rollouts under scoring from more than ten domain-specialized teacher models, with rollout generation, teacher scoring, and student optimization pipelined asynchronously and run iteratively for continuous improvement. [27]

NVIDIA has not published a complete count of GPU hours used to train the family, but the Hugging Face model cards and the technical reports describe the training infrastructure as Blackwell-based for the Super and Ultra runs. [23][27][29] The Nano model can run inference on a wide range of NVIDIA hardware including A100, H100, B200, RTX PRO 6000, Jetson Thor, and DGX Spark systems.

## How well does Nemotron 3 perform on benchmarks?

NVIDIA published a substantial benchmark table for Nemotron 3 Nano at launch, focusing on the standard reasoning, coding, math, and agentic suites that have become the de facto comparison points for open-weights models. The numbers below come from the official Hugging Face model card. [5]

| Benchmark | Category | Nemotron 3 Nano score |
|---|---|---|
| MMLU-Pro | General knowledge | 78.3 |
| AIME 2025 (with tools) | Math reasoning | 99.2 |
| LiveCodeBench v6 | Code generation | 68.3 |
| MiniF2F pass@32 | Formal math proofs | 79.9 |
| SWE-Bench | Software engineering | 38.8 |
| Arena-Hard-V2 average | Chat quality | 67.7 |
| RULER-100 @ 512K | Long context retention | 91.3 |
| RULER-100 @ 1M | Long context retention | 86.3 |

The AIME 2025 result of 99.2 percent is one of the highest published scores on that benchmark for any open or closed model, though the tools-enabled qualifier is important. [5] AIME with tools allows the model to use a calculator or code interpreter for arithmetic, which removes a class of failure modes that would otherwise hit any language model on hard math contests. Most labs report both tool-enabled and tool-free numbers; the tool-free AIME score for Nemotron 3 Nano was not the headline figure in NVIDIA's marketing.

The LiveCodeBench v6 score of 68.3 ranks Nano ahead of Qwen3-30B-A3B-Thinking-2507 at 66.0 and GPT-OSS-20B at 61.0, according to comparisons published by NVIDIA and reproduced by third-party trackers. [5][19] The Arena-Hard-V2 average of 67.7 also leads the same two models, which scored 57.8 and 48.5 respectively in NVIDIA's reported comparison. On MMLU-Pro the model trails Qwen3-30B's 80.9 by roughly 2.5 points, an interesting result given Nano's lead on most other benchmarks in the same comparison set.

The long-context numbers are particularly strong. RULER is a benchmark suite that tests retrieval, multi-hop reasoning, and aggregation across long contexts, with each task scored at multiple context lengths. Nemotron 3 Nano holds 91.3 percent at the 512K mark and 86.3 percent at the full 1 million token window, indicating that the hybrid Mamba-Transformer architecture maintains reasonable accuracy well past the point where pure attention models typically begin to degrade noticeably. [5] NVIDIA states that Nano outperforms both GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507 "on RULER across different context lengths." [2]

On throughput, NVIDIA reports that Nemotron 3 Nano runs 2.2 times faster than GPT-OSS-20B and 3.3 times faster than Qwen3-30B-A3B-Thinking-2507 on the same H200 hardware in an 8K input, 16K output configuration. [2] Compared with its own predecessor Nemotron 2 Nano, NVIDIA reports up to 4 times higher token throughput and a 60 percent reduction in reasoning-token generation for equivalent task completion. [1] The reasoning-token reduction is meaningful because it directly maps to inference cost in production deployments, where a model that produces fewer thinking tokens per task is cheaper to serve.

Nemotron 3 Nano Omni was reported separately on multimodal benchmarks. The Omni model claimed best-in-class document intelligence results on MMlongbench-Doc and OCRBenchV2, leading scores on video and audio benchmarks including WorldSense, DailyOmni, and VoiceBench, and the 9.2x effective capacity figure for video reasoning relative to other open omni models. [7][8] Detailed score tables for Omni were published alongside the announcement on the NVIDIA developer blog. [8]

### Nemotron 3 Super benchmarks

NVIDIA positioned Nemotron 3 Super against the other leading open MoE models in the 100 to 120 billion parameter class, principally GPT-OSS-120B from OpenAI and Alibaba's Qwen3.5-122B. On throughput, NVIDIA reported that Super runs up to 2.2 times faster than GPT-OSS-120B and up to 7.5 times faster than Qwen3.5-122B on NVIDIA B200 GPUs at an 8K input, 64K output setting, and that it leads both on RULER at the 1 million token context length. [23][24][25] The accuracy numbers below are NVIDIA's own, taken from the model card and evaluated with the NeMo Evaluator harness. [23][26]

| Benchmark | Category | Nemotron 3 Super score |
|---|---|---|
| [AIME 2025](/wiki/aime) (no tools) | Math reasoning | 90.21 |
| [GPQA-Diamond](/wiki/gpqa) (no tools) | Science reasoning | 79.23 |
| GPQA-Diamond (with tools) | Science reasoning | 82.70 |
| MMLU-Pro | General knowledge | 83.73 |
| LiveCodeBench v5 | Code generation | 81.19 |
| SWE-Bench Verified | Software engineering | 60.47 |
| Terminal-Bench (hard) | Agentic tool use | 25.78 |
| [Arena-Hard V2](/wiki/arena_hard) | Chat quality | 73.88 |
| PinchBench | Agent brain | 85.6 |

The SWE-Bench Verified score of 60.47 sits a few points behind Qwen3.5-122B but is paired with the roughly 2.2 times throughput advantage, which is the trade NVIDIA emphasized for high-volume agent deployments. [23] NVIDIA reported that Super's multi-token prediction gives it strong native speculative decoding, with an average acceptance length of 3.45 on its internal speculative-decoding benchmark, ahead of DeepSeek-R1 at 2.70. [25] On the independent [Artificial Analysis](/wiki/artificial_analysis) Intelligence Index, Super was later measured at 36, ahead of GPT-OSS-120B at 33 but well below the frontier open models. [31]

### Nemotron 3 Ultra benchmarks

Nemotron 3 Ultra was the headline accuracy release of the family. Artificial Analysis, which NVIDIA partnered with for the evaluation, scored the model at 47.7 on its Intelligence Index and called it the most intelligent United States open-weights model at release. [31] That placed Ultra ahead of the other American open models it measured, including Google's Gemma 4 31B at 39, Nemotron 3 Super at 36, and GPT-OSS-120B at 33, but behind the Chinese-led open-weights frontier, where Kimi K2.6 scored 53.9. [31] NVIDIA reported that Ultra sustains more than 300 output tokens per second on Blackwell hardware, several times the 50 to 100 tokens per second typical of comparably sized Chinese open models on the same class of endpoint, and that its token efficiency lowers the cost to complete agent tasks by around 30 percent on SWE-Bench Verified and Terminal-Bench 2.0. [27][31]

On task-level benchmarks from NVIDIA's launch materials, the post-trained Ultra scored 91 percent on the PinchBench agent test, 54 percent on Terminal-Bench 2.0, 82 percent on the IFBench instruction-following suite, between 65 and 70.4 percent on SWE-Bench Verified across different agent scaffolds, and 95 percent on RULER at 1 million tokens. [27] The base model, before post-training, scored 79.07 on MMLU-Pro, 50.0 on GPQA-Diamond, and 83.84 on HumanEval in the technical report. [29] NVIDIA framed Ultra as delivering roughly 5 times higher throughput than other open models in its class while matching or approaching their accuracy, the same efficiency-first argument the company made for the smaller variants. [27][29]

### Use in agent harnesses: the LangChain Deep Agents playbook

On July 8, 2026, [LangChain](/wiki/langchain) published an engineering playbook titled "Tuning the harness, not the model: a Nemotron 3 Ultra playbook," describing how it adapted its [Deep Agents](/wiki/deep_agents) framework, an open agent [harness](/wiki/harness) built on LangChain's [LangGraph](/wiki/langgraph) orchestration runtime, to run Nemotron 3 Ultra as a long-running agent. [32] The write-up, by LangChain engineers Nick Hollon and Srimanth Tangedipalli, is a third-party validation of the "agent readiness" and "long-running agent" positioning NVIDIA set for Ultra, reached without any change to the model weights. NVIDIA published a companion announcement and a step-by-step technical guide the same day, and the two companies released a joint open reference blueprint called NemoClaw. [33][34][35]

LangChain's thesis is that "an agent is a model plus a harness," where the harness is the surrounding scaffold of system prompt, tool descriptions, and middleware. [32] The company argues that for a strong open model, the largest agent-performance gains come from matching that scaffold to the model rather than from fine-tuning weights: "When the two are matched, the model spends its capability on the task. When they are not, it spends capability fighting the scaffolding." [32] LangChain frames evaluations as "the training data for harness engineering," running its Deep Agents test suite, finding failing or wasteful patterns, adjusting prompts, tools, and middleware, then repeating until performance converges. [32][34] It is explicit about a ceiling: harness tuning "can't add what isn't in the model," and a score that holds steady through every scaffold change is "pointing past the harness" to a limitation only further model post-training can fix. [32]

The concrete techniques LangChain reports for Nemotron 3 Ultra fall into prompting and middleware, and it stresses that only targeted changes helped: broad "be a better agent" rewrites "tend to wash out." [32] On the prompting side it favored short, single-purpose instruction blocks aimed at a specific observed failure, giving the example of a `<final_answer_completeness>` block requiring the agent to state concrete results after a tool call succeeds. [32] It also found Nemotron followed guidance more reliably when it was delivered as an in-band conversation message at the point of need rather than as a standing rule in the system prompt. [32] On the middleware side, LangChain moved guidance out of tool descriptions and into the tool results themselves. NVIDIA's technical guide documents the canonical example, a ReadFileContinuationNoticeMiddleware that annotates a `read_file` result when the read hits its 100-line cap, telling the model the file likely continues and to call `read_file` again with an offset, so the agent paginates instead of treating a truncated read as the end of the file; in NVIDIA's own harness tests this fixed all three failing read-file cases (0 of 3 to 3 of 3) and lifted the score on its 127-test suite from 94 to 96. [34] Other reported middleware included an injected planning step (the agent is asked to draft a plan, then prompted with a second message to review it before executing), explicit compaction guidance so the model summarizes long conversations correctly before starting new work, and code-level enforcement hooks that cap model and tool calls to break loops and add a one-shot retry for transient tool failures. [32][34]

NVIDIA's technical guide frames these as edits to a "harness profile," which Deep Agents exposes as a per-model extension point for changing prompts, removing tools or middleware, or adding new middleware and sub-agents; the stated goal of the tuning is "to make the calls from the agent to the model more closely resemble what the model saw in the training data." [34] That goal ties directly back to NVIDIA's own design intent for the family, which was post-trained with multi-environment reinforcement learning for multi-step tool use rather than for chat. [3][27]

On results, LangChain reports that tuning took Nemotron 3 Ultra from a no-profile baseline of about 0.80 to about 0.84 on a typical run of its Deep Agents suite, with a best run of 0.86, "nearly matching Opus 4.8's best of 0.87" (LangChain notes that [Claude Opus 4.8](/wiki/claude_opus_4_8) typically runs about 0.86). [32] The suite scores five agent capabilities: file work, tool calling, information retrieval, multi-turn conversation, and long-context summarization. [32] LangChain says the summarization subscale, which had the most headroom, was cleared entirely, that tool use improved significantly, and that retrieval and file operations, already strong, edged higher, while the multi-turn conversation subscale stayed flat, which it read as a signal that further gains there need model post-training rather than harness work. [32] The headline was cost: at its best run Ultra reached an aggregate 0.86 at about $4.48 per run on the full suite against about $43.48 for Opus 4.8, roughly 10 times cheaper, while median latency held at parity with Opus at around ten seconds per test. [32][35] LangChain presents the method as general rather than Nemotron-specific, citing earlier harness-only wins including moving gpt-5.2-codex from 52.8 to 66.5 on Terminal-Bench 2.0 and improving a curated tau2-bench subset by 10 to 20 points. [32]

NVIDIA's own framing echoed the cost story, calling the result the "highest accuracy among open models" at "10x lower inference cost per run than leading closed models" and stressing that "every gain came from engineering the environment around the model, not the model itself." [33] The joint NemoClaw blueprint packages LangChain's Deep Agents Code, the Nemotron-3-Ultra-tuned harness profile, and NVIDIA's OpenShell secure runtime as an open reference stack that enterprises can customize, run evals against, and deploy. [33][35] The tuned harness was made available through hosted inference on Baseten, Crusoe, DeepInfra, Fireworks, Nebius, and Together AI, several of which independently repeated the untouched-model, harness-only result and the roughly 10 times cost advantage. [33][35][36][37] The episode is a concrete, independently reported instance of the efficiency-first, agent-ready argument NVIDIA made for Ultra: a major agent framework reached agent scores approaching those of the frontier closed model it was measured against, at a fraction of the per-run cost, purely by engineering the scaffold around an unchanged open model. [32][33]

## Is Nemotron 3 open source? Licensing

Nemotron 3 ships under the NVIDIA Nemotron Open Model License. [10] This license is part of NVIDIA's broader Open Model License framework introduced earlier with Nemotron-4 340B and refined across subsequent releases. The license permits both commercial and non-commercial use of the model weights and derivatives, including for synthetic data generation, fine-tuning, and deployment of fine-tuned versions. [10]

The license includes use restrictions that distinguish it from a pure permissive license such as MIT or Apache 2.0. NVIDIA retains the right to terminate the license if the model is used in ways that violate listed restrictions, which cover categories like critical infrastructure, weapons development, illegal activity, and certain regulated domains. Derivative models trained on outputs from Nemotron 3 are permitted but must maintain attribution and may not remove the license terms. [10]

NVIDIA explicitly committed to releasing the training data, training code, evaluation harness, and reinforcement learning libraries alongside the weights, stating in its white paper that it would "openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights." [3] The release package therefore goes substantially beyond a typical open weights release. NeMo Gym, NeMo RL, and NeMo Evaluator are all published as separate open-source repositories on GitHub. [1][11] The Nemotron-CC-v2.1 and Nemotron-CC-Code-v1 datasets are hosted on Hugging Face under their own data licenses, which generally follow the Creative Commons or similar permissive frameworks used by their underlying sources. When Super and Ultra shipped, NVIDIA followed the same pattern, releasing base, post-trained, and quantized checkpoints together with their datasets under the same Open Model License. [23][25][30]

The models are listed as "Ready for commercial use" on the official Hugging Face cards. [5][26][30] Commercial deployment requires acceptance of the Nemotron Open Model License terms and adherence to NVIDIA's Trustworthy AI framework, which covers content filtering, bias mitigation, explainability requirements, and privacy considerations.

## How does Nemotron 3 compare to peer models?

Nemotron 3 Nano competes most directly with the 20 to 30 billion parameter mixture-of-experts releases from other labs, including Qwen3-30B-A3B from Alibaba, GPT-OSS-20B from OpenAI, and the open variants of Llama 3.1 from Meta. The comparison table below uses NVIDIA's reported numbers for Nemotron 3 Nano and the developers' own reported numbers for the other models. Cross-lab benchmark comparison is always imperfect because labs use different prompts and evaluation harnesses, but the table gives a useful order-of-magnitude comparison.

| Model | Developer | Total / active params | Context | License | AIME 2025 | LiveCodeBench v6 |
|---|---|---|---|---|---|---|
| Nemotron 3 Nano | NVIDIA | 30B / 3.5B (MoE) | 1M | NVIDIA Open Model License | 99.2 (with tools) | 68.3 |
| Qwen3-30B-A3B | Alibaba | 30B / 3B (MoE) | 256K | Apache 2.0 | 70.9 | 66.0 |
| GPT-OSS-20B | OpenAI | 20B (dense) | 128K | Apache 2.0 | not reported | 61.0 |
| Llama 3.1 70B | Meta | 70B (dense) | 128K | Llama 3.1 Community | not reported | not reported |
| Nemotron 2 Nano | NVIDIA | similar scale | 128K | NVIDIA Open Model License | trails Nemotron 3 Nano | trails Nemotron 3 Nano |

Versus Qwen3-30B-A3B, Nemotron 3 Nano has the same total parameter scale and a comparable active parameter count, but extends the context window from 256K to 1M and adds the hybrid Mamba-Transformer architecture. Qwen3 holds the lead on MMLU-Pro by roughly 2.5 points but trails on AIME, LiveCodeBench, Arena-Hard, and throughput at NVIDIA's reported settings. [2][5] The Apache 2.0 license on Qwen3 is more permissive than NVIDIA's Open Model License for some use cases, particularly research where the use restrictions in NVIDIA's license could be a complication.

Versus GPT-OSS-20B, Nemotron 3 Nano is larger in total parameters but lower in active parameters and uses a less common architecture. OpenAI's open release is a pure dense Transformer with a much shorter 128K context. NVIDIA's reported benchmarks place Nemotron 3 Nano ahead on every shared benchmark, with the throughput gap of 2.2x being the headline efficiency claim. [2]

Versus [Llama 3.1](/wiki/llama_3_1) from Meta, the comparison is harder to make cleanly because Llama 3.1 is a year older and was not designed for the agentic and reasoning evaluation suites that dominate late-2025 model marketing. The 70 billion parameter Llama 3.1 Instruct remains a baseline reference point for the open-weights field, and NVIDIA does not include it in its primary Nemotron 3 benchmark tables. On the kinds of tasks that Llama 3.1 was tuned for, such as general chat and instruction following, the gap between it and Nemotron 3 Nano is narrower than the headline math and code benchmarks would suggest.

Versus its own predecessor Nemotron 2 Nano, Nemotron 3 Nano is the headline efficiency story for the family. NVIDIA reports up to 4 times higher token throughput and 60 percent fewer reasoning tokens generated per task, achieved through a combination of the new LatentMoE design, the Mamba-Transformer ratio adjustments, and the move to FP8 inference precision by default. [1] The context window extension from 128K to 1M tokens is a meaningful capability change, particularly for code agent applications where the model needs to hold large project contexts in working memory.

With Super and Ultra now shipped, the family spans three competitive weight classes. Nemotron 3 Super at 120.6 billion total parameters competes directly with GPT-OSS-120B and Qwen3.5-122B, and NVIDIA's own base-model comparisons in the Super technical report set it against Ling-flash-Base-2.0 and GLM-4.5-Air-Base; the general trade is a few points of coding accuracy given up in exchange for a large throughput advantage on Blackwell hardware. [23][25] Nemotron 3 Ultra at 550 billion total parameters competes with the frontier-scale open releases, particularly DeepSeek V3, Kimi K2.6, and the largest Qwen models. On the Artificial Analysis Intelligence Index it became the strongest United States open-weights model at 47.7, though it remained behind the leading Chinese systems, a gap that NVIDIA's launch materials and outside analysts both acknowledged. [31] The three tiers together let a developer keep one architecture and tokenizer while moving between a workstation-class Nano, an agent-workhorse Super, and a research-grade Ultra as workload demands change.

## How was Nemotron 3 received?

Reception of Nemotron 3 Nano was broadly positive among open-source AI developers and frontier-model trackers, with a focus on three things: the throughput claims, the data release, and the unusually complete openness of the package.

HyperFRAME Research described the release as "a meaningful escalation in NVIDIA's open-source posture," noting that the company had moved from publishing models primarily as building blocks for synthetic data generation to publishing them as competitive end-user models. [14] Several writers connected this shift to growing competitive pressure from Chinese open-weights labs, particularly Alibaba's Qwen team and DeepSeek, both of which had been publishing increasingly capable models on permissive licenses through 2024 and 2025.

The New Stack covered the launch under the headline "Nvidia Launches the Next Generation of Its Nemotron Models," emphasizing the data release and the agent-oriented framing rather than the raw benchmark numbers. [15] Constellation Research's coverage was more strategic, describing Nemotron as "a much-needed open-source model champion in the US," against the backdrop of a perception that the leading open-weights frontier had shifted to Chinese labs over 2025. [17]

Independent benchmark reproduction was generally consistent with NVIDIA's claims, though with the usual caveats. Medium technical reviewer Barnacle Goose wrote a detailed walk-through of the Nano model that confirmed the long-context retention claims on RULER and the throughput advantage over Qwen3 on H100 hardware, while noting that the SWE-Bench score of 38.8 lags behind specialized code models. [20] LLM-stats.com published side-by-side comparison tables placing Nemotron 3 Nano ahead of Qwen3 on most agentic and math benchmarks but behind on raw MMLU-Pro and on some coding tasks. [19]

The local-deployment community, which runs models on consumer or prosumer hardware, was particularly enthusiastic about the Nano variant. Compared with prior NVIDIA releases that targeted enterprise H100 deployments, Nano's 30B-A3B configuration runs well on the RTX PRO 6000 workstation cards and on DGX Spark units, both of which had been released earlier in 2025. The FP8 quantization variant in particular drew attention as fitting comfortably in 48 GB of VRAM with full 1M token context. [6]

The agentic-AI community focused on the tool calling and reasoning trace features. NVIDIA's bundled vllm reasoning parser plugin for the Nano model became a reference implementation that other labs began copying for their own agentic releases. Several developer surveys in early 2026 listed Nemotron 3 Nano alongside Qwen3 and DeepSeek as the most-used open-weights models for agent prototyping work. The mid-2026 LangChain Deep Agents playbook, which tuned an agent harness around Nemotron 3 Ultra rather than the model itself, extended this agent-focused reception to the largest tier of the family. [32]

Critical reactions clustered around two concerns. The first was the NVIDIA Open Model License itself. While the license permits commercial use, the use restrictions and the absence of a clean MIT or Apache 2.0 grant raised concerns among some open-source advocates who argued that the restrictions made the model less "open" than the marketing implied. [10] The second was the dependency on NVIDIA-specific tooling for optimal inference, with NVFP4 quantization on Super and Ultra effectively requiring Blackwell hardware to realize the full throughput claims. [3]

NVIDIA's investor and enterprise framing also drew some commentary. Several writers noted that the Nemotron release was timed to coincide with broader NVIDIA messaging around agentic AI infrastructure, with the models functioning partly as a demonstration of the company's hardware capabilities and partly as standalone products. The investor relations press release on December 15, 2025 explicitly framed the release as part of NVIDIA's enterprise AI software strategy. [1]

In the months following the launch, NVIDIA delivered the rest of the roadmap. Nemotron 3 Super shipped on March 11, 2026, around GTC 2026, and was received as a credible open competitor to GPT-OSS-120B, with reviewers focusing on its throughput lead and its status as the first Nemotron pre-trained end to end in NVFP4. [23][25] The multimodal Nemotron 3 Nano Omni followed on April 28, 2026, drawing interest for folding vision, audio, and video into the same efficient MoE base. [7] Nemotron 3 Ultra arrived on June 4, 2026, after Jensen Huang previewed it at Computex 2026, and drew the most attention of the three larger releases. [27][31] Coverage framed Ultra through the lens of the United States versus China open-weights race: at 47.7 on the Artificial Analysis Intelligence Index it was the strongest American open model, but it still trailed the Chinese frontier, where Kimi K2.6 scored 53.9. [31] Constellation Research's earlier description of Nemotron as a needed United States open-source champion was frequently revisited in this context, and the July 2026 LangChain result gave it a fresh angle: on an independent agent-task suite, a frontier agent framework reached near-parity with a leading closed model at roughly one tenth the per-run cost, purely by tuning the harness around the open model. [17][31][32]

## ELI5: what is Nemotron 3 in plain terms?

Think of Nemotron 3 as a free-to-use AI brain that NVIDIA built and then handed out, including the instructions for how it was made. It comes in three sizes: a small one called Nano you can run on a powerful desktop, a medium one called Super, and a very large one called Ultra. There is also a version called Omni that can see pictures and video and hear audio, not just read text. The clever trick inside is that it only switches on a small part of itself for each word it reads or writes, like a big office where only a few people work on any one task, so it stays fast and cheap to run. It can also read or write the equivalent of a very long book (up to a million words' worth of text) at once without forgetting the beginning, and it is good at doing step-by-step reasoning and using tools like a calculator or a code runner to get tasks done. NVIDIA released the small Nano first at the end of 2025, then the bigger Super and Ultra over the following months, so by mid-2026 the whole set was available to download. When another company, LangChain, wanted the biggest model to act as a hands-off worker that keeps going on its own, they found the best trick was not to retrain the brain at all but to redesign the "desk" it sits at (the prompts and tools around it), which let it do the job almost as well as a top paid model for about one tenth of the running cost.

## See also

- [NVIDIA](/wiki/nvidia)
- [Mamba 2](/wiki/mamba_2)
- [Nemotron Nano 2](/wiki/nemotron_nano_2)
- [Llama 3.1](/wiki/llama_3_1)
- [Open Source AI](/wiki/open_source_ai)
- [GPT-OSS](/wiki/gpt_oss)
- [Artificial Analysis](/wiki/artificial_analysis)
- [Jamba 2](/wiki/jamba_2)
- [Jet-Nemotron](/wiki/jet_nemotron)
- [Large Language Model](/wiki/large_language_model)
- [LangChain](/wiki/langchain)
- [Deep Agents](/wiki/deep_agents)

## References

1. NVIDIA. "NVIDIA Debuts Nemotron 3 Family of Open Models." NVIDIA Newsroom, December 15, 2025. https://nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models
2. NVIDIA Research. "NVIDIA Nemotron 3 Family of Models." https://research.nvidia.com/labs/nemotron/Nemotron-3/
3. NVIDIA. "NVIDIA Nemotron 3: Efficient and Open Intelligence." arXiv:2512.20856, December 2025. https://arxiv.org/abs/2512.20856
4. NVIDIA. "Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning." Technical Report. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
5. NVIDIA. "NVIDIA-Nemotron-3-Nano-30B-A3B-BF16." Hugging Face model card. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
6. NVIDIA. "NVIDIA-Nemotron-3-Nano-30B-A3B-FP8." Hugging Face model card. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
7. NVIDIA. "NVIDIA Launches Nemotron 3 Nano Omni Model, Unifying Vision, Audio and Language for AI Agents." NVIDIA Blog, April 28, 2026. https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/
8. NVIDIA. "NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model." NVIDIA Technical Blog, April 28, 2026. https://developer.nvidia.com/blog/nvidia-nemotron-3-nano-omni-powers-multimodal-agent-reasoning-in-a-single-efficient-open-model/
9. NVIDIA. "Nemotron AI Models." NVIDIA Developer. https://developer.nvidia.com/nemotron
10. NVIDIA. "NVIDIA Nemotron Open Model License." https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
11. GitHub. "NVIDIA-NeMo/Nemotron Developer Asset Hub." https://github.com/NVIDIA-NeMo/Nemotron
12. Wikipedia. "Nemotron." https://en.wikipedia.org/wiki/Nemotron
13. AIwire. "Nvidia Releases Nemotron 3, Expanding Its Open Models for Agentic AI." December 17, 2025. https://www.hpcwire.com/aiwire/2025/12/17/nvidia-releases-nemotron-3-expanding-its-open-models-for-agentic-ai/
14. HyperFRAME Research. "NVIDIA Releases Nemotron 3: A New Family of Open Models." December 18, 2025. https://hyperframeresearch.com/2025/12/18/nvidia-releases-nemotron-3-a-new-family-of-open-models/
15. The New Stack. "Nvidia Launches the Next Generation of Its Nemotron Models." https://thenewstack.io/nvidias-launches-the-next-generation-of-its-nemotron-models/
16. CIO Dive. "Nvidia Nemotron 3 seeks to power scalable, multi-agent systems." https://www.ciodive.com/news/nvidia-nemotron-3-power-multi-agent-systems/807952/
17. Constellation Research. "Nvidia Nemotron: Much needed open-source model champion in US." https://www.constellationr.com/insights/news/nvidia-nemotron-much-needed-open-source-model-champion-us
18. The Neuron. "NVIDIA Nemotron 3 Nano: An Open LLM With a 1-Million-Token Context Window for Agents." https://www.theneuron.ai/explainer-articles/nvidia-nemotron-3-nano-open-llm/
19. LLM-stats. "Nemotron 3 Nano (30B A3B) vs Qwen3 30B A3B Comparison." https://llm-stats.com/models/compare/nemotron-3-nano-30b-a3b-vs-qwen3-30b-a3b
20. Medium (Barnacle Goose). "A Technical Review of NVIDIA's Nemotron 3 Nano 30B A3B." https://medium.com/@leucopsis/a-technical-review-of-nvidias-nemotron-3-nano-30b-a3b-e91673f22df4
21. NVIDIA. "nemotron-3-8b-base-4k." Hugging Face model card. https://huggingface.co/nvidia/nemotron-3-8b-base-4k
22. NVIDIA. "Nemotron-4 15B Technical Report." arXiv:2402.16819, February 2024. https://arxiv.org/abs/2402.16819
23. NVIDIA. "Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning." NVIDIA Technical Blog, March 11, 2026. https://developer.nvidia.com/blog/introducing-nemotron-3-super-an-open-hybrid-mamba-transformer-moe-for-agentic-reasoning/
24. NVIDIA Research. "NVIDIA Nemotron 3 Super." Published March 10, 2026. https://research.nvidia.com/labs/nemotron/Nemotron-3-Super/
25. NVIDIA. "Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning." arXiv:2604.12374, April 14, 2026. https://arxiv.org/abs/2604.12374
26. NVIDIA. "NVIDIA-Nemotron-3-Super-120B-A12B-BF16." Hugging Face model card. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
27. NVIDIA. "NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents." NVIDIA Technical Blog, June 4, 2026. https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
28. NVIDIA Research. "NVIDIA Nemotron 3 Ultra." Published June 4, 2026. https://research.nvidia.com/labs/nemotron/Nemotron-3-Ultra/
29. NVIDIA. "Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning." arXiv:2606.15007, June 12, 2026. https://arxiv.org/abs/2606.15007
30. NVIDIA. "NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16." Hugging Face model card. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
31. Artificial Analysis. "Nemotron 3 Ultra announced: high-speed, leading US open weights intelligence." June 2026. https://artificialanalysis.ai/articles/nvidia-nemotron-3-ultra-launch-announced
32. LangChain (Nick Hollon and Srimanth Tangedipalli). "Tuning the harness, not the model: a Nemotron 3 Ultra playbook." LangChain Blog, July 8, 2026. https://www.langchain.com/blog/tuning-the-harness-not-the-model-a-nemotron-3-ultra-playbook
33. NVIDIA. "NVIDIA Nemotron Achieves Benchmark-Leading Performance With LangChain Deep Agents Harness." NVIDIA Blog, July 8, 2026. https://blogs.nvidia.com/blog/nemotron-langchain-agents-open-stack/
34. NVIDIA (Sean Lopp, Matthew Penn, and Sukrit Rao). "Create a LangChain Deep Agents Harness Profile for NVIDIA Nemotron 3 Ultra to Improve Performance." NVIDIA Technical Blog, July 8, 2026. https://developer.nvidia.com/blog/create-a-langchain-deep-agents-harness-profile-for-nvidia-nemotron-3-ultra-to-improve-performance/
35. LangChain. "LangChain and NVIDIA Launch the NemoClaw Deep Agents Blueprint." LangChain Blog, July 8, 2026. https://www.langchain.com/blog/langchain-and-nvidia-launch-the-nemoclaw-deep-agents-blueprint
36. Nebius. "LangChain tunes Deep Agents for NVIDIA Nemotron 3 Ultra: top open-model accuracy at 10x lower cost with Nebius Agents Blueprint." Nebius Blog, July 8, 2026. https://nebius.com/blog/posts/langchain-tunes-deep-agents-for-nemotron-3-ultra
37. Fireworks AI. "Open, frontier, and yours: LangChain Deep Agents on NVIDIA Nemotron 3 Ultra, running on Fireworks." Fireworks AI Blog, July 9, 2026. https://fireworks.ai/blog/Open-frontier-and-yours-LangChain-Deep-Agents-on-NVIDIA