Jet-Nemotron
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,605 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,605 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jet-Nemotron is a family of small hybrid-architecture language models released by NVIDIA Research in August 2025. The family ships in two sizes, Jet-Nemotron-2B and Jet-Nemotron-4B, and is built by an architecture exploration method the authors call Post Neural Architecture Search (PostNAS). Rather than training a new model from scratch, PostNAS starts from an existing full-attention checkpoint, freezes its MLP weights, and surgically replaces most of the attention layers with a new linear attention block named JetBlock. The result is a decoder-only model that on NVIDIA H100 hardware runs roughly 21 to 53 times faster than its Qwen3-1.7B-Base baseline at long context while reporting equal or higher accuracy on standard knowledge, math, and code benchmarks.
The paper introducing Jet-Nemotron, Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search, was posted to arXiv on August 21, 2025 (preprint 2508.15884) by Yuxian Gu, Qinghao Hu, Haocheng Xi, Junyu Chen, Shang Yang, Song Han, and Han Cai. The same group released the model weights and inference code on GitHub and Hugging Face under the jet-ai and NVlabs/Jet-Nemotron repositories on September 29, 2025, shortly after the paper was accepted to NeurIPS 2025. Jet-Nemotron sits in the broader Nemotron research line at NVIDIA, but it is structurally separate from the production Nemotron 3 chat models released later in 2025 and is best understood as an efficiency research artifact rather than a general-purpose assistant.
NVIDIA began publishing the Nemotron family of models in 2024 as part of a broader push to make GPU-accelerated language model training and serving a first-class product line. Early entries included Nemotron-4 15B and the Nemotron-4 340B reward and instruct models, followed in 2025 by Nemotron Nano, Llama Nemotron, and the Nemotron 3 family of open-weight chat models. These releases shared two themes: open weights with permissive enough licensing for research, and architecture choices motivated by inference economics on NVIDIA's own Hopper and Blackwell GPUs.
Jet-Nemotron belongs to the same product line by branding, but the people behind it come from the Efficient AI group at NVIDIA Research, which has worked for several years on hardware-aware model compression, quantization, and architecture search. Song Han, who co-authored the paper, is a long-time researcher on model efficiency and is also a faculty member at MIT. The Jet-Nemotron paper credits influences from earlier work on linear attention, gated state space models, and hybrid architectures such as Jamba and Zamba, but its specific contribution is the search procedure rather than the underlying block designs themselves.
The context for Jet-Nemotron is that pre-training a new foundation model from scratch is enormously expensive. Even modest 1 to 4 billion parameter models cost tens of thousands of GPU hours to train on a few trillion tokens of text. That cost makes large-scale neural architecture search impractical: a normal NAS run might evaluate hundreds or thousands of candidate architectures, and at this scale each candidate would itself cost a small fortune to train. Researchers had largely abandoned NAS for transformer language models by 2023 because the search loop was unaffordable.
A second pressure is that full softmax attention does not scale gracefully at long context. Memory for the KV cache grows linearly with sequence length and number of layers, and the attention computation itself grows quadratically in the prefill phase. At 256K tokens of context, a model like Qwen3-1.7B has to maintain a KV cache of several gigabytes, which limits batch size and pushes the achievable tokens per second on a single GPU into low triple digits. Linear attention architectures, such as Mamba 2, RWKV-7, GLA, RetNet, DeltaNet, and Gated DeltaNet, address this by replacing the softmax kernel with a recurrent or kernelized formulation that uses a fixed-size state, but they have historically lagged on knowledge benchmarks such as MMLU.
Jet-Nemotron tries to combine the inference-time efficiency of linear attention with the accuracy of a transformer that has already done its pre-training. The PostNAS procedure makes that practical because the search loop never touches the expensive MLP weights, so each candidate architecture costs only a fraction of a full retrain to evaluate.
PostNAS, short for Post Neural Architecture Search, is the central contribution of the paper. The name is a deliberate inversion of the usual NAS workflow. Where conventional architecture search picks a design first and trains afterward, PostNAS picks a pre-trained model first and searches the architecture around its already-trained components. The pipeline has four stages, described below in the order the authors apply them.
The first stage starts from a pre-trained full-attention model, in practice a Qwen2.5-1.5B or Qwen2.5-3B base checkpoint. All MLP weights are frozen. The search then asks which of the original full-attention layers can be removed entirely, which can be replaced with cheaper linear attention, and which must be kept as full attention to preserve accuracy.
The authors implement this as a super-network. Each transformer layer is augmented with an optional full-attention branch and an optional linear attention branch, both of which feed into the same residual stream. The system is trained briefly with both branches active and a gating function that learns to favor one or the other. A beam search over the resulting gates produces the best assignment of attention types per layer for a given target compute budget. The key empirical finding is that a small number of full-attention layers, placed at specific depths, is enough to preserve accuracy on retrieval-style tasks. The remaining layers can be replaced with linear attention without measurable loss on most benchmarks.
The second stage searches over what type of linear attention to use in the replaced layers. The authors evaluate six candidates from the recent literature: Mamba 2, RWKV-7, Gated Linear Attention (GLA), RetNet, DeltaNet, and Gated DeltaNet. Each candidate is plugged into the same super-network skeleton in turn and evaluated on a common downstream benchmark suite. Gated DeltaNet wins this round of the search, which the paper attributes to its combination of a delta rule update and a learned gating mechanism that together preserve more useful information in the recurrent state than the alternatives.
The third stage takes Gated DeltaNet as a starting point and modifies it. The output of this stage is the JetBlock, a new linear attention block that incorporates dynamic causal convolution into the value path. In standard Gated DeltaNet and similar designs, a static depthwise convolution is applied to the queries, keys, and values before the recurrent kernel. JetBlock removes the static convolutions from the query and key paths, on the grounds that they add latency without adding capacity, and replaces the static value convolution with a dynamic one. The kernel weights for the value convolution are produced by a small kernel generator network conditioned on the input itself, with an 8x reduction ratio and a SiLU activation. The generator adds only a small parameter overhead but lets the block adapt its local mixing pattern to the content of each sequence.
The authors report that on the same accuracy budget, JetBlock outperforms Gated DeltaNet and the other linear attention baselines, while costing essentially the same in tokens per second. They credit the dynamic kernel for the gain on tasks that involve in-context lookup, such as retrieval-augmented question answering.
The final stage is a hardware-aware search over the remaining hyperparameters: number of heads, head dimension, convolution kernel size, value expansion factor, and where exactly in the network to place the few remaining full-attention layers. Candidate configurations are scored using a combined objective that mixes accuracy on a held-out validation set with measured throughput on H100 GPUs at long context. Because each candidate inherits frozen MLP weights and only retrains the attention parameters, the search is cheap enough to evaluate hundreds of options.
The total compute budget for the full pipeline, including the post-search fine-tuning of the chosen architecture on 400 billion tokens (50 billion in stage 1 and 350 billion in stage 2), is around 18,328 GPU hours on 32 H100 nodes. That is roughly one to two orders of magnitude less than what a full pre-training run of a comparable model would cost, which is the central practical claim of the method.
Jet-Nemotron is a decoder-only language model with a hybrid attention stack. Most of its layers use JetBlock; a small minority retain full softmax attention at depths chosen by the PostNAS search. The MLP weights, vocabulary, and tokenizer are inherited unchanged from the Qwen2.5 base checkpoints, which means the model uses the Qwen2.5 BPE tokenizer and has the same hidden dimension and feed-forward width as the corresponding Qwen2.5 base. Context length is supported up to 256,000 tokens during inference, although most accuracy benchmarks are evaluated at shorter context.
The full-attention layers in the hybrid stack are standard grouped-query attention modules carried over from the Qwen2.5 base. They consume a KV cache that grows with sequence length, but because there are only a handful of them, the total KV memory at long context is a small fraction of what a fully full-attention model would need. The JetBlock layers carry no KV cache in the usual sense; they maintain a fixed-size recurrent state that does not grow with sequence length. This is the source of most of the inference speedup at long context.
The model supports flash_attention_2 for the residual full-attention layers and a custom CUDA kernel for the JetBlock layers, both of which are required to reproduce the throughput numbers. The Hugging Face model card notes that loading the model requires trust_remote_code=True because the JetBlock implementation is not yet part of the upstream transformers library.
The Jet-Nemotron family ships in two sizes, both released as base models without an instruction-tuned variant.
| Variant | Parameters | Base checkpoint | Context length | Hugging Face repository |
|---|---|---|---|---|
| Jet-Nemotron-2B | ~2 billion | Qwen2.5-1.5B | 256K | jet-ai/Jet-Nemotron-2B |
| Jet-Nemotron-4B | ~4 billion | Qwen2.5-3B | 256K | jet-ai/Jet-Nemotron-4B |
The two variants share the same PostNAS-derived architecture template but use different absolute layer counts and head dimensions, inherited from the corresponding Qwen2.5 base. The 2B model is the headline variant for the speed comparisons in the paper, while the 4B model is the headline variant for accuracy.
NVIDIA did not release a 1B, 7B, or larger Jet-Nemotron variant alongside the initial drop. The authors note in the paper that the PostNAS recipe should generalize to larger base checkpoints, but they did not test it on models above 3B parameters at the time of submission.
The paper reports benchmark numbers for both Jet-Nemotron variants against the Qwen2.5 and Qwen3 base models, and against several pure linear attention baselines. The 2B variant is built on Qwen2.5-1.5B and is compared most directly to Qwen3-1.7B-Base, which has a similar parameter budget. The 4B variant is built on Qwen2.5-3B and is compared to Qwen3-4B-Base. All numbers below are taken from the arXiv preprint and the model card on Hugging Face.
| Benchmark | Jet-Nemotron-2B | Qwen3-1.7B-Base | Qwen2.5-1.5B | Mamba2-2.7B | RWKV7-1.5B |
|---|---|---|---|---|---|
| MMLU | 60.8 | 60.3 | 59.5 | 25.6 | 41.0 |
| MMLU-Pro | 39.0 | 37.8 | 28.9 | 8.6 | 13.4 |
| BBH | 58.3 | 54.2 | 44.1 | 32.6 | 35.5 |
| GSM8K | 76.2 | 62.8 | 62.4 | 36.0 | 39.5 |
| MATH | 23.3 | 16.7 | 13.1 | 7.4 | 6.4 |
| ARC-C | 48.6 | 44.9 | 45.4 | 41.6 | 43.3 |
| MMLU-Stem | 62.7 | not reported | not reported | not reported | not reported |
| EvalPlus | 60.8 | not reported | not reported | not reported | not reported |
| CruXEval-I-CoT | 61.1 | not reported | not reported | not reported | not reported |
| CruXEval-O-CoT | 56.7 | not reported | not reported | not reported | not reported |
| LongBench | 41.1 | not reported | not reported | not reported | not reported |
For the 4B variant, the headline numbers are 65.2 on MMLU, 44.2 on MMLU-Pro, and 65.0 on BBH. The authors note that on MMLU and MMLU-Pro, Jet-Nemotron-2B actually scores higher than two recent MoE full-attention models, DeepSeek-V3-Small and Moonlight, despite using a fraction of the activated parameters. The gap is largest on math and reasoning benchmarks, where Jet-Nemotron-2B picks up 10 to 14 points over the comparable Qwen3-1.7B-Base baseline. Some of that gap likely reflects better post-training rather than the architecture itself, since the Qwen3 base checkpoints are not heavily tuned for math.
The comparison to pure linear attention baselines is more lopsided. Mamba2-2.7B and RWKV7-1.5B trail Jet-Nemotron-2B by 20 or more points on MMLU-Pro, BBH, and the math benchmarks. The authors argue that this is the cost of refusing to keep any full-attention layers at all, and that the small number of full-attention layers preserved by PostNAS recovers most of the lost accuracy at a marginal cost in throughput.
The inference benchmarks are run on a single NVIDIA H100 80GB GPU using the authors' custom CUDA kernels for JetBlock and flash_attention_2 for the residual full-attention layers. The headline speedups assume long context, where the KV cache savings of the hybrid architecture matter most.
| Setting | Jet-Nemotron-2B | Qwen3-1.7B-Base | Speedup |
|---|---|---|---|
| Tokens per second at 64K context, max batch size | 2,885 | ~61 | 47x |
| Tokens per second at 256K context, max batch size | 2,885 | ~54 | 53.6x |
| Prefill speedup at 256K context | not reported as ratio | baseline | 6.1x |
| KV cache size at 64K context | 154 MB | 7,168 MB | 47x reduction |
The 4B variant is slower in absolute terms (around 1,271 tokens per second on the same setup) but still about 21 times faster than Qwen3-4B-Base, which is the relevant accuracy peer at that size. The cache reductions are roughly proportional. Pure linear attention models such as Mamba2-2.7B (2,507 tokens per second) and RWKV7-1.5B (3,050 tokens per second) are competitive on throughput, but the accuracy gap on knowledge benchmarks is large enough that the comparison is rarely cited.
The paper makes a derived cost claim from these numbers. A 53.6x decoding speedup at 256K context implies roughly 98% lower inference cost per million tokens of generation at that context length, if the rest of the deployment is held constant. The MarkTechPost coverage from August 26, 2025 led with that figure under the headline that Jet-Nemotron offers a 98% cost reduction for inference at scale. The 98% number is not a measured deployment cost but a direct algebraic consequence of the throughput speedup, and it applies most cleanly to long-context workloads. At short context the speedup falls off quickly.
A secondary practical claim is that the 2B model's KV cache is small enough to deploy on commodity hardware. NVIDIA reports running Jet-Nemotron-2B inference on a Jetson Orin and on a single RTX 3090, both of which would struggle with a full-attention model of similar accuracy at 64K context.
The Jet-Nemotron release uses two different licenses, one for the code and one for the model weights, in line with NVIDIA's standard practice for research model releases. The inference code on GitHub is released under a permissive open-source license, while the model weights on Hugging Face are released under the NVIDIA Open Model License (referred to in the model card as ncslv1 for the noncommercial variant on the original drop). The model license permits research use and modification but restricts commercial use, and it requires users to comply with NVIDIA's responsible AI guidelines.
This is a more restrictive arrangement than Qwen2.5, which Jet-Nemotron is built on top of. Qwen2.5-1.5B and Qwen2.5-3B are released under Apache 2.0, which would permit commercial use. The downstream Jet-Nemotron license tightens the terms because the resulting weights include NVIDIA's own contribution from PostNAS fine-tuning. Users who want a permissively licensed efficient model in this size range still have alternatives from the open-source community, including the original Qwen2.5 bases.
The authors have not, as of the initial release, published the PostNAS search code itself in a form that would let third parties run the procedure on their own base checkpoints. The released code is sufficient to load and run the published Jet-Nemotron weights but does not include the super-network training and beam search components in turn-key form.
The most natural peers for Jet-Nemotron are other small efficient language models in the 1 to 4 billion parameter range that prioritize inference cost. The comparison below uses numbers from the Jet-Nemotron paper and from each peer's own technical report, evaluated at short context where most accuracy benchmarks live.
| Model | Parameters | Architecture | MMLU | MMLU-Pro | Tokens/sec at long context (H100) | License |
|---|---|---|---|---|---|---|
| Jet-Nemotron-2B | ~2B | Hybrid (JetBlock plus residual full attention) | 60.8 | 39.0 | 2,885 | NVIDIA Open Model License |
| Qwen3-1.7B-Base | 1.7B | Full attention transformer | 60.3 | 37.8 | ~54 | Apache 2.0 |
| Mamba 2 2.7B | 2.7B | State space (pure linear) | 25.6 | 8.6 | 2,507 | Apache 2.0 |
| RWKV-7 1.5B | 1.5B | RNN-style linear attention | 41.0 | 13.4 | 3,050 | Apache 2.0 |
| SmolLM3 | 3B | Full attention transformer | ~59 | ~30 | not directly comparable | Apache 2.0 |
| Phi-4-mini | 3.8B | Full attention transformer | ~67 | ~45 | not directly comparable | MIT |
The table makes clear what Jet-Nemotron trades. Against a full-attention peer of similar size and similar training data such as Qwen3-1.7B-Base, it matches accuracy on knowledge benchmarks and improves on math and reasoning, while running 47 to 53 times faster at long context. Against pure linear attention peers such as Mamba 2 or RWKV-7 at similar parameter counts, it matches throughput in the same ballpark and is dramatically more accurate on knowledge-heavy benchmarks. Against more recent small full-attention models with stronger post-training such as Phi-4-mini, it still trails on raw MMLU and MMLU-Pro accuracy; the value proposition there is purely about throughput and KV cache footprint, not absolute leaderboard position.
Jet-Nemotron drew immediate attention in the technical press in late August 2025, driven by the headline 53x throughput claim. MarkTechPost, ChinaTechReview's 36Kr outlet, and several research-oriented Substacks ran detailed write-ups within a week of the arXiv posting. The 36Kr coverage framed the work as a direct answer to Mamba 2 and RWKV-7, emphasizing that NVIDIA had produced a hybrid that beats both on accuracy without losing much on throughput.
Reception in the research community has been more measured. The paper was accepted to NeurIPS 2025 in September after a relatively short review cycle, which the authors point to as validation of the search methodology. Critical commentary, mostly in independent technical newsletters, focused on three points. First, the headline 53x throughput number is a long-context decoding number on a single GPU, and the speedup at short context is much smaller. Second, the comparison to Mamba 2 and RWKV-7 uses smaller versions of those models than are typically run in production, and the gap narrows somewhat against the larger variants. Third, the PostNAS recipe is presented as general, but the published artifact is a specific Jet-Nemotron architecture rather than a turn-key search tool that other groups can run on their own checkpoints.
The practical impact within NVIDIA has been to put hybrid linear attention designs more squarely on the roadmap. The Nemotron 3 family announced later in 2025 also combines multiple attention types in a single stack, although it is not a direct descendant of the Jet-Nemotron architecture. As a public research result, Jet-Nemotron is most often cited as a proof point that architecture search on top of a frozen MLP backbone can produce useful efficient models without paying the full cost of a new pre-training run.