Jet-Nemotron

AI Models AI Research Large Language Models NVIDIA Open Source AI

20 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 3,917 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Jet-Nemotron is a family of small hybrid-architecture language models released by NVIDIA Research in August 2025.^[1] The family ships in two sizes, Jet-Nemotron-2B and Jet-Nemotron-4B, and is built by an architecture exploration method the authors call Post Neural Architecture Search (PostNAS). Rather than training a new model from scratch, PostNAS starts from an existing full-attention checkpoint, freezes its MLP weights, and surgically replaces most of the attention layers with a new linear attention block named JetBlock. The result is a decoder-only model that on NVIDIA H100 hardware runs roughly 21 to 53.6 times faster than its Qwen3-1.7B-Base baseline at long context while, in the authors' words, matching "or exceeds the accuracy of leading full-attention models" such as Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of knowledge, math, and code benchmarks.^[1]

The paper introducing Jet-Nemotron, Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search, was posted to arXiv on August 21, 2025 (preprint 2508.15884) by Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai.^[1] The same group released the model weights and inference code on GitHub and Hugging Face under the jet-ai and NVlabs/Jet-Nemotron repositories on September 29, 2025, shortly after the paper was accepted to NeurIPS 2025 on September 18, 2025.^[3]^[9] Jet-Nemotron sits in the broader Nemotron research line at NVIDIA, but it is structurally separate from the production Nemotron 3 chat models released later in 2025 and is best understood as an efficiency research artifact rather than a general-purpose assistant.

Background

The Nemotron research line at NVIDIA

NVIDIA began publishing the Nemotron family of models in 2024 as part of a broader push to make GPU-accelerated language model training and serving a first-class product line. Early entries included Nemotron-4 15B and the Nemotron-4 340B reward and instruct models, followed in 2025 by Nemotron Nano, Llama Nemotron, and the Nemotron 3 family of open-weight chat models. These releases shared two themes: open weights with permissive enough licensing for research, and architecture choices motivated by inference economics on NVIDIA's own Hopper and Blackwell GPUs.

Jet-Nemotron belongs to the same product line by branding, but the people behind it come from the Efficient AI group at NVIDIA Research, which has worked for several years on hardware-aware model compression, quantization, and architecture search.^[2] Song Han, who co-authored the paper, is a long-time researcher on model efficiency and is also a faculty member at MIT. The Jet-Nemotron paper credits influences from earlier work on linear attention, gated state space models, and hybrid architectures such as Jamba and Zamba, but its specific contribution is the search procedure rather than the underlying block designs themselves.

Why use architecture search for language models?

The context for Jet-Nemotron is that pre-training a new foundation model from scratch is enormously expensive. Even modest 1 to 4 billion parameter models cost tens of thousands of GPU hours to train on a few trillion tokens of text. That cost makes large-scale neural architecture search impractical: a normal NAS run might evaluate hundreds or thousands of candidate architectures, and at this scale each candidate would itself cost a small fortune to train. Researchers had largely abandoned NAS for transformer language models by 2023 because the search loop was unaffordable.

A second pressure is that full softmax attention does not scale gracefully at long context. Memory for the KV cache grows linearly with sequence length and number of layers, and the attention computation itself grows quadratically in the prefill phase. At 256K tokens of context, a model like Qwen3-1.7B-Base has to maintain a KV cache of roughly 7 GB (7,168 MB), which limits batch size and pushes the achievable tokens per second on a single GPU into low double digits.^[1] Linear attention architectures, such as Mamba 2, RWKV-7, GLA, RetNet, DeltaNet, and Gated DeltaNet, address this by replacing the softmax kernel with a recurrent or kernelized formulation that uses a fixed-size state, but they have historically lagged on knowledge benchmarks such as MMLU.^[12]^[13]

Jet-Nemotron tries to combine the inference-time efficiency of linear attention with the accuracy of a transformer that has already done its pre-training. The PostNAS procedure makes that practical because the search loop never touches the expensive MLP weights, so each candidate architecture costs only a fraction of a full retrain to evaluate.

How does PostNAS work?

PostNAS, short for Post Neural Architecture Search, is the central contribution of the paper. The name is a deliberate inversion of the usual NAS workflow. Where conventional architecture search picks a design first and trains afterward, PostNAS picks a pre-trained model first and searches the architecture around its already-trained components. As the authors put it, "Rather than pre-training models from scratch, we explore novel architectures by building on top of existing full-attention models."^[1] The pipeline has four stages, described below in the order the authors apply them.

Stage 1: optimal full-attention layer placement and elimination

The first stage starts from a pre-trained full-attention model, in practice a Qwen2.5-1.5B or Qwen2.5-3B base checkpoint.^[14] All MLP weights are frozen. The search then asks which of the original full-attention layers can be removed entirely, which can be replaced with cheaper linear attention, and which must be kept as full attention to preserve accuracy.

The authors implement this as a super-network. Each transformer layer is augmented with an optional full-attention branch and an optional linear attention branch, both of which feed into the same residual stream. The system is trained briefly with both branches active and a gating function that learns to favor one or the other. A beam search over the resulting gates produces the best assignment of attention types per layer for a given target compute budget. The key empirical finding is that a small number of full-attention layers, placed at specific depths, is enough to preserve accuracy on retrieval-style tasks. The remaining layers can be replaced with linear attention without measurable loss on most benchmarks.^[1]

Stage 2: linear attention block selection

The second stage searches over what type of linear attention to use in the replaced layers. The authors evaluate six candidates from the recent literature: Mamba 2, RWKV-7, Gated Linear Attention (GLA), RetNet, DeltaNet, and Gated DeltaNet.^[1] Each candidate is plugged into the same super-network skeleton in turn and evaluated on a common downstream benchmark suite. Gated DeltaNet wins this round of the search, which the paper attributes to its combination of a delta rule update and a learned gating mechanism that together preserve more useful information in the recurrent state than the alternatives.^[1]

Stage 3: designing JetBlock

The third stage takes Gated DeltaNet as a starting point and modifies it. The output of this stage is the JetBlock, a new linear attention block that incorporates dynamic causal convolution into the value path. As the authors describe it, "JetBlock uses a kernel generator to produce dynamic causal convolution kernels conditioned on the input, which are then applied to the value (V) tokens."^[1] In standard Gated DeltaNet and similar designs, a static depthwise convolution is applied to the queries, keys, and values before the recurrent kernel. JetBlock removes the static convolutions from the query and key paths, on the grounds that they add latency without adding capacity, and replaces the static value convolution with a dynamic one. The kernel weights for the value convolution are produced by a small kernel generator network conditioned on the input itself, with an 8x reduction ratio and a SiLU activation. The generator adds only a small parameter overhead but lets the block adapt its local mixing pattern to the content of each sequence.

The authors report that on the same accuracy budget, JetBlock outperforms Gated DeltaNet and the other linear attention baselines, while costing essentially the same in tokens per second.^[1] They credit the dynamic kernel for the gain on tasks that involve in-context lookup, such as retrieval-augmented question answering.

Stage 4: hardware-aware hyperparameter search

The final stage is a hardware-aware search over the remaining hyperparameters: number of heads, head dimension, convolution kernel size, value expansion factor, and where exactly in the network to place the few remaining full-attention layers. Candidate configurations are scored using a combined objective that mixes accuracy on a held-out validation set with measured throughput on H100 GPUs at long context. Because each candidate inherits frozen MLP weights and only retrains the attention parameters, the search is cheap enough to evaluate hundreds of options.

The total compute budget for the full pipeline is around 18,328 H100 GPU hours: roughly 10,168 hours for the PostNAS search itself and 8,160 hours to fine-tune the chosen architecture on 400 billion tokens (50 billion in stage 1 and 350 billion in stage 2).^[1] That is roughly one to two orders of magnitude less than what a full pre-training run of a comparable model would cost, which is the central practical claim of the method.

What is the Jet-Nemotron architecture?

Jet-Nemotron is a decoder-only language model with a hybrid attention stack. Most of its layers use JetBlock; a small minority retain full softmax attention at depths chosen by the PostNAS search. For example, the Jet-Nemotron-2B stack has 28 blocks in total, of which just two are full-attention layers (at positions 15 and 20) and two are sliding window attention (SWA) layers (at positions 21 and 22), with the remaining 24 blocks using JetBlock.^[1] The authors note that sliding window attention "effectively preserves the accuracy" on multiple-choice tasks such as MMLU, while the two full-attention layers carry the retrieval-heavy work.^[1] The MLP weights, vocabulary, and tokenizer are inherited unchanged from the Qwen2.5 base checkpoints, which means the model uses the Qwen2.5 BPE tokenizer and has the same hidden dimension and feed-forward width as the corresponding Qwen2.5 base.^[14] Context length is supported up to 256,000 tokens during inference, although most accuracy benchmarks are evaluated at shorter context.^[4]

The full-attention layers in the hybrid stack are standard grouped-query attention modules carried over from the Qwen2.5 base. They consume a KV cache that grows with sequence length, but because there are only a handful of them, the total KV memory at long context is a small fraction of what a fully full-attention model would need. The JetBlock layers carry no KV cache in the usual sense; they maintain a fixed-size recurrent state that does not grow with sequence length. This is the source of most of the inference speedup at long context. As the paper puts it, "KV cache size is the most critical factor influencing long-context and long-generation throughput," more so even than raw parameter count.^[1]

The model supports flash_attention_2 for the residual full-attention layers and a custom CUDA kernel for the JetBlock layers, both of which are required to reproduce the throughput numbers.^[3]^[10] The Hugging Face model card notes that loading the model requires trust_remote_code=True because the JetBlock implementation is not yet part of the upstream transformers library.^[4]

What Jet-Nemotron model sizes are available?

The Jet-Nemotron family ships in two sizes, both released as base models without an instruction-tuned variant.^[4]^[5]

Variant	Parameters	Base checkpoint	Context length	Hugging Face repository
Jet-Nemotron-2B	~2 billion	Qwen2.5-1.5B	256K	jet-ai/Jet-Nemotron-2B
Jet-Nemotron-4B	~4 billion	Qwen2.5-3B	256K	jet-ai/Jet-Nemotron-4B

The two variants share the same PostNAS-derived architecture template but use different absolute layer counts and head dimensions, inherited from the corresponding Qwen2.5 base.^[1] The 2B model is the headline variant for the speed comparisons in the paper, while the 4B model is the headline variant for accuracy.

NVIDIA did not release a 1B, 7B, or larger Jet-Nemotron variant alongside the initial drop. The authors note in the paper that the PostNAS recipe should generalize to larger base checkpoints, but they did not test it on models above 3B parameters at the time of submission.

How accurate is Jet-Nemotron on benchmarks?

The paper reports benchmark numbers for both Jet-Nemotron variants against the Qwen2.5 and Qwen3 base models, and against several pure linear attention baselines. The abstract summarizes the headline result: Jet-Nemotron-2B "achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks."^[1] The 2B variant is built on Qwen2.5-1.5B and is compared most directly to Qwen3-1.7B-Base, which has a similar parameter budget. The 4B variant is built on Qwen2.5-3B and is compared to Qwen3-4B-Base. All numbers below are taken from the arXiv preprint and the model card on Hugging Face.^[1]^[4]

Benchmark	Jet-Nemotron-2B	Qwen3-1.7B-Base	Qwen2.5-1.5B	Mamba2-2.7B	RWKV7-1.5B
MMLU	60.8	60.3	59.5	25.6	41.0
MMLU-Pro	39.0	37.8	28.9	8.6	13.4
BBH	58.3	54.2	44.1	32.6	35.5
GSM8K	76.2	62.8	62.4	36.0	39.5
MATH	23.3	16.7	13.1	7.4	6.4
ARC-C	48.6	44.9	45.4	41.6	43.3
MMLU-Stem	62.7	not reported	not reported	not reported	not reported
EvalPlus	60.8	not reported	not reported	not reported	not reported
CruXEval-I-CoT	61.1	not reported	not reported	not reported	not reported
CruXEval-O-CoT	56.7	not reported	not reported	not reported	not reported
LongBench	41.1	not reported	not reported	not reported	not reported

For the 4B variant, the headline numbers are 65.2 on MMLU, 44.2 on MMLU-Pro, 65.0 on BBH, 78.7 on GSM8K, and 25.2 on MATH.^[1] The authors note that on MMLU and MMLU-Pro, Jet-Nemotron-2B actually scores higher than two recent MoE full-attention models, DeepSeek-V3-Small and Moonlight, despite using a fraction of the activated parameters.^[1] The gap is largest on math and reasoning benchmarks, where Jet-Nemotron-2B picks up 10 to 14 points over the comparable Qwen3-1.7B-Base baseline. Some of that gap likely reflects better post-training rather than the architecture itself, since the Qwen3 base checkpoints are not heavily tuned for math.

The comparison to pure linear attention baselines is more lopsided. Mamba2-2.7B and RWKV7-1.5B trail Jet-Nemotron-2B by 20 or more points on MMLU-Pro, BBH, and the math benchmarks.^[1]^[12]^[13] The authors argue that this is the cost of refusing to keep any full-attention layers at all, and that the small number of full-attention layers preserved by PostNAS recovers most of the lost accuracy at a marginal cost in throughput.

How much faster is Jet-Nemotron?

The inference benchmarks are run on a single NVIDIA H100 80GB GPU using the authors' custom CUDA kernels for JetBlock and flash_attention_2 for the residual full-attention layers.^[1] The headline speedups assume long context, where the KV cache savings of the hybrid architecture matter most.

Setting	Jet-Nemotron-2B	Qwen3-1.7B-Base	Speedup
Tokens per second at 64K context, max batch size	2,885	~61	47x
Tokens per second at 256K context, max batch size	2,885	~54	53.6x
Prefill speedup at 256K context	not reported as ratio	baseline	6.14x
KV cache size at 64K context	154 MB	7,168 MB	47x reduction

The 4B variant is slower in absolute terms (around 1,271 tokens per second on the same setup) but still about 21 times faster than the Qwen3-1.7B-Base throughput baseline; the paper notes that Jet-Nemotron-4B "still achieves higher generation throughput than all full-attention models with less than 2B parameters" despite its larger size.^[1] Its KV cache is correspondingly small at 258 MB.^[1] Pure linear attention models such as Mamba2-2.7B (2,507 tokens per second) and RWKV7-1.5B (3,050 tokens per second) are competitive on throughput, but the accuracy gap on knowledge benchmarks is large enough that the comparison is rarely cited.

The paper makes a derived cost claim from these numbers. A 53.6x decoding speedup at 256K context implies roughly 98% lower inference cost per million tokens of generation at that context length, if the rest of the deployment is held constant.^[1] The MarkTechPost coverage from August 26, 2025 led with that figure under the headline that Jet-Nemotron offers a 98% cost reduction for inference at scale.^[6] The 98% number is not a measured deployment cost but a direct algebraic consequence of the throughput speedup (1 minus 1 divided by 53.6), and it applies most cleanly to long-context workloads. At short context the speedup falls off quickly; prefilling is only about 1.1x faster at 4K to 8K tokens.^[1]

A secondary practical claim is that the 2B model's KV cache is small enough to deploy on commodity hardware. NVIDIA reports running Jet-Nemotron-2B inference on a Jetson Orin and on a single RTX 3090, where it achieves roughly 8.8x and 6.5x higher throughput than Qwen2.5-1.5B respectively, both of which would struggle with a full-attention model of similar accuracy at 64K context.^[1]

Is Jet-Nemotron open source?

Partly. The inference code is released as open source, but the model weights ship under a more restrictive NVIDIA license, so Jet-Nemotron is open weight rather than fully permissively licensed. The release uses two different licenses, one for the code and one for the model weights, in line with NVIDIA's standard practice for research model releases.^[3]^[4] The inference code on GitHub is released under a permissive open-source license, while the model weights on Hugging Face are released under the NVIDIA Open Model License (referred to in the model card as ncslv1 for the noncommercial variant on the original drop). The model license permits research use and modification but restricts commercial use, and it requires users to comply with NVIDIA's responsible AI guidelines.

This is a more restrictive arrangement than Qwen2.5, which Jet-Nemotron is built on top of. Qwen2.5-1.5B and Qwen2.5-3B are released under Apache 2.0, which would permit commercial use.^[14] The downstream Jet-Nemotron license tightens the terms because the resulting weights include NVIDIA's own contribution from PostNAS fine-tuning. Users who want a permissively licensed efficient model in this size range still have alternatives from the open-source community, including the original Qwen2.5 bases.

The authors have not, as of the initial release, published the PostNAS search code itself in a form that would let third parties run the procedure on their own base checkpoints. The released code is sufficient to load and run the published Jet-Nemotron weights but does not include the super-network training and beam search components in turn-key form.^[3]^[10]

How does Jet-Nemotron compare to other small models?

The most natural peers for Jet-Nemotron are other small efficient language models in the 1 to 4 billion parameter range that prioritize inference cost. The comparison below uses numbers from the Jet-Nemotron paper and from each peer's own technical report, evaluated at short context where most accuracy benchmarks live.^[1]

Model	Parameters	Architecture	MMLU	MMLU-Pro	Tokens/sec at long context (H100)	License
Jet-Nemotron-2B	~2B	Hybrid (JetBlock plus residual full attention)	60.8	39.0	2,885	NVIDIA Open Model License
Qwen3-1.7B-Base	1.7B	Full attention transformer	60.3	37.8	~54	Apache 2.0
Mamba 2 2.7B	2.7B	State space (pure linear)	25.6	8.6	2,507	Apache 2.0
RWKV-7 1.5B	1.5B	RNN-style linear attention	41.0	13.4	3,050	Apache 2.0
SmolLM3	3B	Full attention transformer	~59	~30	not directly comparable	Apache 2.0
Phi-4-mini	3.8B	Full attention transformer	~67	~45	not directly comparable	MIT

The table makes clear what Jet-Nemotron trades. Against a full-attention peer of similar size and similar training data such as Qwen3-1.7B-Base, it matches accuracy on knowledge benchmarks and improves on math and reasoning, while running 47 to 53.6 times faster at long context.^[1] Against pure linear attention peers such as Mamba 2 or RWKV-7 at similar parameter counts, it matches throughput in the same ballpark and is dramatically more accurate on knowledge-heavy benchmarks.^[12]^[13] Against more recent small full-attention models with stronger post-training such as Phi-4-mini, it still trails on raw MMLU and MMLU-Pro accuracy; the value proposition there is purely about throughput and KV cache footprint, not absolute leaderboard position.

How was Jet-Nemotron received?

Jet-Nemotron drew immediate attention in the technical press in late August 2025, driven by the headline 53x throughput claim. MarkTechPost, ChinaTechReview's 36Kr outlet, and several research-oriented Substacks ran detailed write-ups within a week of the arXiv posting.^[6]^[7]^[11] The 36Kr coverage framed the work as a direct answer to Mamba 2 and RWKV-7, emphasizing that NVIDIA had produced a hybrid that beats both on accuracy without losing much on throughput.^[7]^[8]

Reception in the research community has been more measured. The paper was accepted to NeurIPS 2025 in September after a relatively short review cycle, which the authors point to as validation of the search methodology.^[9] Critical commentary, mostly in independent technical newsletters, focused on three points.^[11] First, the headline 53.6x throughput number is a long-context decoding number on a single GPU, and the speedup at short context is much smaller. Second, the comparison to Mamba 2 and RWKV-7 uses smaller versions of those models than are typically run in production, and the gap narrows somewhat against the larger variants. Third, the PostNAS recipe is presented as general, but the published artifact is a specific Jet-Nemotron architecture rather than a turn-key search tool that other groups can run on their own checkpoints.

The practical impact within NVIDIA has been to put hybrid linear attention designs more squarely on the roadmap. The Nemotron 3 family announced later in 2025 also combines multiple attention types in a single stack, although it is not a direct descendant of the Jet-Nemotron architecture. As a public research result, Jet-Nemotron is most often cited as a proof point that architecture search on top of a frozen MLP backbone can produce useful efficient models without paying the full cost of a new pre-training run.^[2]

References

Gu, Y., Hu, Q., Yang, S., Xi, H., Chen, J., Han, S., and Cai, H. "Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search." arXiv:2508.15884, August 21, 2025. https://arxiv.org/abs/2508.15884 ↩
NVIDIA Research, Efficient AI Group. "Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search." Project page. https://research.nvidia.com/labs/eai/publication/jetnemotron/ ↩
NVlabs. "Jet-Nemotron GitHub repository." Released September 29, 2025. https://github.com/NVlabs/Jet-Nemotron ↩
jet-ai. "Jet-Nemotron-2B model card." Hugging Face, September 2025. https://huggingface.co/jet-ai/Jet-Nemotron-2B ↩
jet-ai. "Jet-Nemotron-4B model card." Hugging Face, September 2025. https://huggingface.co/jet-ai/Jet-Nemotron-4B ↩
Sharma, A. "NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale." MarkTechPost, August 26, 2025. https://www.marktechpost.com/2025/08/26/nvidia-ai-released-jet-nemotron-53x-faster-hybrid-architecture-language-model-series-that-translates-to-a-98-cost-reduction-for-inference-at-scale/ ↩
36Kr. "NVIDIA Unveils New Model: 53x Surge in 4B Inference Speed, New Attention Architecture Outperforms Mamba 2." August 2025. https://eu.36kr.com/en/p/3440425121944962 ↩
36Kr. "New Work from Han Song's Team at NVIDIA: An Efficient Language Model with Post-Neural Architecture Search." August 2025. https://eu.36kr.com/en/p/3440505371022981 ↩
OpenReview. "Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search." NeurIPS 2025. https://openreview.net/forum?id=WZQXaTNYEB ↩
DeepWiki. "NVlabs/Jet-Nemotron." Auto-generated repository documentation. https://deepwiki.com/NVlabs/Jet-Nemotron ↩
The Salt. "Jet-Nemotron: Searching for the Best Attention Architecture." Substack, August 2025. https://thesalt.substack.com/p/jet-nemotron-searching-for-the-best ↩
Dao, T., and Gu, A. "Mamba 2: Transformers are SSMs." arXiv:2405.21060, 2024. https://arxiv.org/abs/2405.21060 ↩
Peng, B., et al. "RWKV-7: Goose with Expressive Dynamic State Evolution." 2025. https://www.rwkv.com ↩
Qwen Team. "Qwen2.5 Technical Report." arXiv:2412.15115, December 2024. https://arxiv.org/abs/2412.15115 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Nemotron Nemotron 3 Nemotron Nano 2 Nvidia

Background

The Nemotron research line at NVIDIA

Why use architecture search for language models?

How does PostNAS work?

Stage 1: optimal full-attention layer placement and elimination

Stage 2: linear attention block selection

Stage 3: designing JetBlock

Stage 4: hardware-aware hyperparameter search

What is the Jet-Nemotron architecture?

What Jet-Nemotron model sizes are available?

How accurate is Jet-Nemotron on benchmarks?

How much faster is Jet-Nemotron?

Is Jet-Nemotron open source?

How does Jet-Nemotron compare to other small models?

How was Jet-Nemotron received?

See also

References

Improve this article

Related Articles

Nemotron 3

Nemotron

Nemotron-4

Nemotron-H

Llama-3.1-Nemotron-70B-Instruct

Meta AI

What links here

Related Articles

Nemotron 3

Nemotron

Nemotron-4

Nemotron-H

Llama-3.1-Nemotron-70B-Instruct

Meta AI

What links here