Hymba
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,555 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,555 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hymba is a hybrid-head neural-network architecture for small language models introduced by NVIDIA researchers in November 2024. Its defining idea is to run softmax attention heads and state space model (SSM) heads in parallel inside every layer, rather than alternating attention blocks and SSM blocks across separate layers as earlier hybrids do. Within each Hymba layer the same input is processed simultaneously by both head types, and the two results are fused, so every layer combines the high-resolution, content-addressable recall of attention with the efficient, constant-cost context summarization of an SSM [1][2].
The work was published as "Hymba: A Hybrid-head Architecture for Small Language Models" by Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Jan Kautz, Pavlo Molchanov and colleagues at NVIDIA. It was posted to arXiv on November 20, 2024 and accepted as a Spotlight paper at the International Conference on Learning Representations (ICLR) 2025 [1][6]. The reference model, Hymba-1.5B, has about 1.5 billion parameters and was released openly on Hugging Face and GitHub under the NVIDIA Open Model License Agreement, in both a base model (Hymba-1.5B-Base) and an instruction-tuned model (Hymba-1.5B-Instruct) [4][5]. NVIDIA reported that Hymba-1.5B-Base outperforms every published sub-2B model available at the time and even surpasses the larger Llama-3.2-3B while using a far smaller cache and running several times faster [1][2].
Standard Transformer language models rely on softmax attention, which compares every token against every other token. This delivers precise, high-resolution recall of specific earlier tokens, but its compute cost grows quadratically with sequence length, and the model must store a key-value (KV) cache that grows linearly with context, making long-context inference memory-bound [1][2].
State space models such as Mamba take the opposite approach. They compress the past into a fixed-size recurrent state, so compute scales linearly and the memory footprint stays constant regardless of sequence length. This efficiency comes at a cost: because the state is a lossy summary, SSMs are weaker at tasks that demand exact recall of a specific earlier token, such as copying a string or retrieving a value from somewhere in the prompt [1][2].
Earlier hybrids tried to capture both strengths by interleaving the two mechanisms across depth, placing some attention layers and some SSM layers in sequence. NVIDIA argued that this sequential stacking creates bottlenecks: information is forced through whichever block sits at a given depth, so a token that needs precise recall still has to pass through SSM-only layers, and a token that only needs cheap summarization still pays for attention layers. Hymba's response is to make both mechanisms available within every layer at once [1][3].
In a Hymba layer the projected input is fed in parallel to two groups of heads: attention heads that perform softmax attention, and SSM heads built on Mamba-2. Both groups read the same input, so the layer simultaneously produces a precise attention-based representation and an efficient recurrent summary [1][3].
The two streams are combined inside a single hybrid-head module. The authors observed that raw SSM outputs consistently carry larger magnitudes than attention outputs, which would let them dominate a naive sum. Each stream is therefore normalized first; the normalized attention and SSM outputs are then scaled by separate learnable per-channel vectors, averaged, and passed through a final output projection [3]. The learnable scaling factors let the model decide, channel by channel, how much weight to give precise recall versus summarized context. NVIDIA frames the two pathways as analogous to human memory: attention heads act like detailed snapshot memories, while SSM heads behave like fading, gist-level memories [1][2].
Hymba allocates roughly a 5:1 parameter ratio between the SSM heads and the attention heads inside the module, so most of the layer's capacity sits in the cheap SSM path while a thin slice of attention supplies exact recall [2]. To isolate the benefit of fusing rather than stacking, the authors ran a controlled ablation at the 300M-parameter scale: the parallel hybrid-head design reached 45.19 average commonsense accuracy, versus 44.07 for an otherwise-matched architecture that stacked the same attention and SSM components sequentially [3].
Hymba adds three further elements that improve both quality and efficiency.
The first is a set of 128 learnable meta tokens prepended to every input sequence. These embeddings are trained jointly with the model weights and act as a learned initialization for the KV cache and the SSM state. The authors show that they mitigate the "attention sink" effect, in which a model dumps large amounts of attention onto the beginning-of-sequence token, and they relieve the "forced-to-attend" burden that arises when no real token is relevant to a query. In this respect the meta tokens function much like the register tokens used in vision transformers, providing a place to park attention; analysis indicates that different meta tokens specialize and activate for different tasks and domains [1][2][3].
The second element is a mix of global and local attention. Hymba aggressively replaces full global attention with sliding-window (local) attention in most layers: only three layers, the first, the middle, and the last, retain full global attention, while the remaining roughly 90% of layers use sliding-window attention. Because local attention caches only a fixed window of recent tokens, this change alone shrinks the KV cache by about 3.8x [2][3].
The third element is cross-layer KV-cache sharing. Hymba shares the KV cache between every two consecutive attention layers, so adjacent layers reuse the same keys and values rather than each storing its own. This is layered on top of grouped-query attention, which already shares keys and values across query heads within a layer [2][3]. Together, the global-versus-local mix and the cross-layer sharing are what let a model with attention in every layer keep a cache small enough to compete with pure SSMs.
Hymba-1.5B was pretrained on 1.5 trillion tokens drawn from public corpora such as DCLM-Baseline-1.0 and SmolLM-Corpus together with proprietary data, using a two-stage schedule that begins with a large general corpus and then anneals on higher-quality data with a continuous learning-rate decay. The base model was trained between September 1, 2024 and November 10, 2024 [3][4]. Hymba-1.5B-Base reached an average of 61.06 across NVIDIA's reported evaluation suite, including 51.19 on 5-shot MMLU, 45.90 on ARC-Challenge, and 77.31 on PIQA [2][3].
The headline efficiency comparisons, all reported by NVIDIA relative to the Hymba-1.5B-Base model, are summarized below [1][2].
| Comparison | Average accuracy | KV cache size | Throughput |
|---|---|---|---|
| Hymba-1.5B-Base vs Llama-3.2-3B | +1.32% | 11.67x smaller | 3.49x higher |
| Hymba-1.5B-Base vs Qwen2.5-1.5B | +1.55% | 2.90x smaller | 1.41x higher |
The comparison against Llama-3.2-3B is notable because Hymba's 1.5B model is roughly half the size yet still wins on average accuracy. Restricted to commonsense-reasoning benchmarks, NVIDIA reported that Hymba-1.5B matches Llama-3.2-3B accuracy while using a 14.72x smaller cache and running 3.49x faster. The win over Qwen2.5-1.5B is significant because that model was trained on roughly 13 times as many tokens as Hymba [1][2].
The instruction-tuned model, Hymba-1.5B-Instruct, was produced by full fine-tuning of the base model followed by direct preference optimization, using learning rates of 5e-5 and 3e-6 respectively. NVIDIA reported that it achieved best-in-class results among sub-2B instruction-tuned models on GSM8K (58.76), GPQA, and the Berkeley Function-Calling Leaderboard [3][5]. The combination of small cache and high throughput makes Hymba a candidate for on-device and edge deployment, where memory and latency budgets are tight.
Hymba sits in a broader line of work that mixes attention with state space models, but it is distinguished by fusing them inside the layer rather than across layers. Mamba and Mamba-2, introduced by Albert Gu and Tri Dao, are the pure SSM ancestors whose efficiency Hymba borrows for its SSM heads. Sequential hybrids such as Jamba from AI21 Labs and Zamba from Zyphra instead build a network by stacking blocks of different types: Jamba interleaves Transformer layers, Mamba layers, and mixture-of-experts modules, while Zamba alternates Mamba blocks with a shared attention block. Hymba's authors position their parallel hybrid-head module as a way to avoid the information bottleneck that such depth-wise stacking can impose, since in Hymba every layer offers both exact recall and cheap summarization simultaneously [1][3].
NVIDIA's small-language-model work continued in this direction. The hybrid Mamba-Transformer ideas behind Hymba are related to the company's later Nemotron-H and Nemotron Nano model families, which also blend SSM and attention layers for efficient inference, reflecting a sustained interest in architectures that reduce the KV-cache and quadratic-attention costs of pure Transformers while preserving their recall ability [2].