Mamba-3
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,794 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,794 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mamba-3 is a sequence-modeling architecture in the state space model (SSM) family, introduced in March 2026 by researchers at Carnegie Mellon University and Princeton. It is the third major iteration of the Mamba line, following Mamba (2023) and Mamba-2 (2024). The paper, titled "Mamba-3: Improved Sequence Modeling using State Space Principles," was authored by Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu, and was accepted to ICLR 2026.[1][2] Where earlier Mamba versions were designed largely around training efficiency, Mamba-3 takes an inference-first view: it tries to answer what an SSM would look like if you optimized it for the way models are actually deployed and served. The result combines three changes, a more expressive recurrence derived from SSM discretization, a complex-valued state update that improves state tracking, and a multi-input multi-output (MIMO) formulation that raises throughput without adding decode latency.[1]
The motivation for Mamba comes from a practical problem with the Transformer. Attention is quadratic in sequence length and its memory footprint (the KV cache) grows linearly as you generate, which makes long-context inference expensive. State space models offer a different trade. They carry a fixed-size hidden state forward through the sequence, so compute scales linearly and the memory cost stays constant no matter how long the context gets. The catch is that a fixed state is a bottleneck. You cannot remember everything, so SSMs have historically lagged Transformers on tasks that need precise recall.
The original Mamba, released by Albert Gu and Tri Dao in late 2023, made the SSM "selective." Its state transitions became input-dependent, letting the model decide what to keep and what to forget based on the content of each token rather than applying the same dynamics everywhere.[1] Mamba-2 followed in May 2024 and reframed the whole thing through what the authors called structured state space duality (SSD), showing that SSMs and attention are two views of the same underlying structured-matrix computation.[3] That duality let Mamba-2 use matrix multiplications instead of a custom scan, which made it considerably faster to train and allowed a much larger state (the internal state size jumped from 16 in Mamba-1 to 256).[3]
That speedup came at a cost the Mamba-3 authors are candid about. Mamba-2 simplified the SSM mechanism to make training efficient, but the simplification left the recurrence "too simple" and the resulting inference path memory-bound.[2] Mamba-3 is the attempt to fix that without giving up the linear-time, constant-memory properties that make SSMs attractive in the first place.
The first change is the most mathematical. An SSM is fundamentally a continuous-time linear system that gets discretized into a recurrence you can run step by step. Mamba and Mamba-2 used an exponential-Euler scheme, which is a first-order approximation of the integral that maps the input into the state.[2] Mamba-3 replaces it with an exponential-trapezoidal scheme, a second-order approximation that is more accurate over each step.[1][2]
In concrete terms, the Euler rule updates the state from two terms (the previous state and the current input). The trapezoidal rule uses three: it blends the current input with the previous one. The paper writes the update as
h_t = α_t h_{t-1} + β_t B_{t-1} x_{t-1} + γ_t B_t x_t,
where α_t is the decay applied to the prior state and β_t, γ_t weight the previous and current inputs.[1] A data-dependent scalar λ_t in [0, 1] interpolates between the two: setting λ_t = 1 recovers Mamba-2's Euler update, while λ_t = 1/2 gives the classical trapezoidal rule, so Mamba-2 is a special case of the new recurrence.[1] One side effect the authors highlight is that this richer update implicitly applies a short convolution to the input on its way into the state. That turns out to make the explicit short causal convolution from Mamba-1 and Mamba-2 redundant. They tried adding it back and found it did not help and slightly hurt, so Mamba-3 drops it.[2]
The second change targets a known weakness of linear-attention and SSM models: state tracking. These are tasks where the model has to maintain and update an internal status across a long sequence, such as tracking parity over a stream of bits or following nested modular arithmetic. Real-valued state transitions struggle here because a real-valued decay can only shrink or grow a value; it cannot rotate. Many state-tracking problems are essentially about periodicity and rotation, the kind of thing you need oscillatory dynamics to represent.[1]
Mamba-3 addresses this by treating the underlying SSM as complex-valued. A complex eigenvalue has both a magnitude and a phase, and the phase lets the state rotate as it evolves, which is exactly what periodic and group-structured tasks require.[1][2] The clever part is the implementation. Rather than rewrite the kernels for complex arithmetic, the authors express the complex transition as a sequence of 2x2 rotations applied through a block-diagonal rotation matrix, with rotation angles that depend on the data.[1] This is mathematically equivalent to applying rotary position embeddings (RoPE) to the model's B and C projections, so the complex dynamics ride on machinery that already exists and is cheap to compute.[2] On the formal-language benchmarks from the Chomsky hierarchy, the payoff is visible: Mamba-3 solves Parity and modular arithmetic without brackets (reported at about 98.5% accuracy) and nearly closes the gap on modular arithmetic with brackets (about 87.8%), tasks where the earlier real-valued SSMs fail.[1]
The third change is about hardware, not expressivity. Earlier Mamba layers were single-input single-output (SISO): each channel got its own scalar recurrence. Mamba-3 introduces a multi-input multi-output (MIMO) formulation that processes several input and output dimensions together through a shared state, using ideas from the SSD framework.[1][2]
What makes this worthwhile is a detail about how inference actually behaves on a GPU. Training an SSM is compute-bound, so adding work makes it slower. Decoding one token at a time is memory-bound: the bottleneck is moving the state in and out of memory, and the arithmetic units sit mostly idle.[2] MIMO spends those idle FLOPs. It increases the compute done per step (the paper notes up to roughly 4x more decoding FLOPs at rank R = 4) but, because decode was memory-limited anyway, the extra arithmetic overlaps with the memory traffic and decode latency barely moves.[2][4] In effect you get a more capable model for free at inference time. The authors keep the parameter count comparable by shrinking the MLP inner dimension only slightly (for the 1.5B model, from 4096 to 3824, about 6.6%).[1] MIMO also improves retrieval without enlarging the state, which matters because retrieval is precisely where fixed-state models tend to fall behind attention.[2]
Beyond the three headline ideas, Mamba-3 adopts several smaller changes that align it with modern Transformer and Gated DeltaNet designs. It adds normalization on the B and C projections (a QK-norm-style stabilizer the authors call BCNorm) that empirically steadies training.[2] RMSNorm becomes optional but is kept in hybrid models for length extrapolation, and MLP blocks are interleaved in the standard Transformer arrangement.[2] The short convolution is gone, replaced by the implicit convolution baked into the trapezoidal recurrence. The kernels themselves were written across Triton for prefill, TileLang for the MIMO prefill path, and a CuTe DSL implementation for decode on Hopper-class GPUs.[2]
The headline comparisons come from models trained on 100B tokens of FineWeb-Edu with the Llama-3.1 tokenizer at a 2K context length, evaluated at sizes from 180M up to 1.5B parameters.[1] At the 1.5B scale, Mamba-3's SISO variant improves average downstream accuracy by 0.6 points over the next best model, Gated DeltaNet, and the MIMO variant adds another 1.2 points for a total gain of 1.8 points.[1] On state-size experiments, Mamba-3 matches Mamba-2's perplexity while using half the state (state size 64 versus 128), which is the "2x smaller state" claim that headlined much of the coverage.[1][4] The architecture also posts the fastest reported prefill-plus-decode latency among the models tested, ahead of Mamba-2, Gated DeltaNet, and a vLLM Transformer baseline, with the MIMO variant matching Mamba-2's speed despite its higher capacity.[2]
| Model (1.5B) | FineWeb-Edu perplexity | Avg. downstream accuracy |
|---|---|---|
| Transformer | 10.51 | 55.4% |
| Mamba-2 | 10.47 | 55.7% |
| Gated DeltaNet | 10.45 | 55.8% |
| Mamba-3 (SISO) | 10.35 | 56.4% |
| Mamba-3 (MIMO, R=4) | 10.24 | 57.6% |
The lineage looks like this in summary:
| Mamba (2023) | Mamba-2 (2024) | Mamba-3 (2026) | |
|---|---|---|---|
| Core idea | Selective (input-dependent) SSM | State space duality (SSD) | Inference-first SSM |
| Discretization | Exponential-Euler | Exponential-Euler | Exponential-trapezoidal |
| State values | Real | Real | Complex (via RoPE-style rotations) |
| Channel structure | SISO | SISO | SISO and MIMO |
| Main optimization target | Recall via selection | Training speed, larger state | Inference efficiency, state tracking |
What I find genuinely interesting about Mamba-3 is the framing. A lot of the linear-attention literature has been a race for training throughput, sometimes trading away the capabilities that made Transformers worth replacing. Mamba-3 inverts the priority and asks what matters when a model is being served, not trained, which is where most of the compute now goes given how much the field has shifted toward inference-time compute and heavy post-training.[2] The three pieces fit that goal cleanly: trapezoidal discretization buys expressivity at almost no inference cost, complex states fix a long-standing failure mode without new kernels, and MIMO turns a quirk of GPU memory behavior into extra capability you do not pay for at decode time.
None of this makes the fixed-state bottleneck disappear. Linear models still trail Transformers on the hardest retrieval tasks, a limit the paper acknowledges rather than papers over.[2] But matching Mamba-2 quality at half the state, beating Gated DeltaNet on downstream accuracy, and closing state-tracking gaps that real-valued SSMs simply could not reach is a meaningful step. The authors frame it as advancing the performance-efficiency Pareto frontier, and on the evidence reported, that is a fair description.[1]