Albert Gu
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,270 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,270 words
Add missing citations, update stale details, or suggest a clearer explanation.
Albert Gu is an American computer scientist, Assistant Professor of Machine Learning at Carnegie Mellon University, and co-founder and Chief Scientist of Cartesia AI.[^1][^2] He is best known as the principal originator of the modern line of structured state-space models (SSMs) for deep learning, beginning with the HiPPO theory of online memory (2020), continuing through the S4 architecture (2021–2022) and its diagonal variants, and culminating in the Mamba family of selective state-space models that he developed jointly with Tri Dao.[^3][^4][^5][^6] These architectures have become the most widely studied non-attentional alternative to the Transformer.[^7]
Gu completed his PhD in Computer Science at Stanford University under Christopher Ré in 2023, defending a dissertation titled Modeling Sequences with Structured State Spaces.[^8] He joined the faculty of the Machine Learning Department at Carnegie Mellon in 2024, where he leads the Goomba Lab.[^9][^2] In parallel, he is one of five co-founders of Cartesia AI, a San Francisco-based company applying state-space models to real-time generative audio and other streaming modalities.[^10][^11]
In 2024 Gu was named to TIME magazine's TIME100 AI list of the most influential people in artificial intelligence, in recognition of his contributions to non-attentional sequence modeling.[^12] His paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces, co-authored with Tri Dao, received an Outstanding Paper Award at the inaugural Conference on Language Modeling (COLM) in 2024.[^13][^7]
Beyond its specific empirical results, the structured state-space line associated with Gu is often credited with reopening a substantive theoretical question that had been largely dormant since the rise of self-attention: whether efficient recurrent architectures, when properly parameterized, can match Transformers at scale. The success of Mamba and Mamba-2 — together with the broader family of selective SSMs, hybrid SSM/attention models, and linear-attention variants whose connections to SSMs were formalized by the State Space Duality framework — has made the question of "attention versus recurrence" once again a central one in the sequence-modeling literature.[^6][^7][^25]
| Field | Details |
|---|---|
| Born | Active research career begins c. 2015; year of birth not publicly reported |
| Nationality | American |
| Education | B.S. Mathematical Sciences and Computer Science, Carnegie Mellon University (2015); PhD Computer Science, Stanford University (2023)[^14][^8] |
| Doctoral advisor | Christopher Ré[^8] |
| Doctoral committee | Christopher Ré, Percy Liang, Scott Linderman[^8] |
| Current positions | Assistant Professor, Machine Learning Department, Carnegie Mellon University (2024–present); Co-founder and Chief Scientist, Cartesia AI (2023–present)[^2][^10] |
| Notable areas | Structured state-space models, long-sequence modeling, efficient deep learning architectures, generative audio |
| Lab | Goomba Lab, CMU Machine Learning Department[^9] |
| Best-known works | HiPPO (2020); S4 (2021/2022); Mamba (2023); Mamba-2 / State Space Duality (2024)[^3][^4][^5][^6] |
| Recognition | TIME100 AI 2024; COLM 2024 Outstanding Paper Award (Mamba); ICLR 2022 Outstanding Paper Honorable Mention (S4)[^12][^13][^4] |
Albert Gu attended Saratoga High School in Saratoga, California, where he graduated as class valedictorian in 2012.[^15] During high school he was a prominent competitor in mathematics and informatics olympiads, attending the Mathematical Olympiad Summer Program (MOSP) and winning medals at the Asian Pacific Mathematics Olympiad and the International Olympiad in Informatics.[^15]
Gu pursued his undergraduate studies at Carnegie Mellon University, where he double-majored in Mathematical Sciences and Computer Science.[^14][^16] He was admitted as a Knaster–McWilliams Scholar, the honors program of CMU's Mellon College of Science.[^17] As an undergraduate, he was a leading member of CMU's competitive mathematics team: he was named a Putnam Fellow region for placing in the top ten of the William Lowell Putnam Mathematical Competition as a first-year student, and as a senior he ranked among the top sixteen in the 2014 Putnam, contributing to CMU's fifth-place team finish that year.[^17][^16]
After graduating from CMU in 2015, Gu entered the doctoral program in Computer Science at Stanford University, where he joined Christopher Ré's Hazy Research group at the Stanford AI Lab.[^8] He later credited Hazy Research's broader collaborative environment — including frequent interactions with PhD-stage collaborators such as Tri Dao, Karan Goel, Atri Rudra, and Stefano Ermon — with shaping the line of research that would become structured state spaces.[^18][^3] Hazy Research, run by Ré, was concurrently developing complementary work on efficient attention (most prominently FlashAttention, led by Dao), data-centric machine learning, and weakly supervised systems; Gu's structured state-space line emerged as one of the lab's principal architectural contributions of the 2020–2024 period.[^18]
Gu's doctoral research at Stanford focused on developing a principled, theoretically grounded approach to sequence modeling. The line of work began in 2020 with the HiPPO framework (High-order Polynomial Projection Operators), introduced in the NeurIPS 2020 paper HiPPO: Recurrent Memory with Optimal Polynomial Projections, co-authored with Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.[^3] HiPPO formalized the problem of online sequence memorization as the optimal projection of a function's history onto a basis of orthogonal polynomials with respect to a chosen measure. From this single principle the authors derived classical and new memory cells (including the Legendre Memory Unit and the novel HiPPO-LegS unit), unifying the design of recurrent memory with continuous-time approximation theory.[^3]
In 2021 Gu and collaborators extended HiPPO into a general deep-learning primitive in the NeurIPS 2021 paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers (LSSL), which showed that a single linear state-space layer could simulate recurrent, convolutional, and continuous-time models — but was computationally impractical at scale.[^19] The decisive engineering breakthrough came with the 2022 paper Efficiently Modeling Long Sequences with Structured State Spaces, in which Gu, Karan Goel, and Christopher Ré introduced S4, a structured state-space layer whose state transition matrix is parameterized as a low-rank correction of a normal (diagonalizable) matrix.[^4] This decomposition allowed the model's convolution kernel to be evaluated in near-linear time, enabling the architecture to learn on sequences with tens of thousands of tokens. S4 achieved state-of-the-art results on the Long Range Arena benchmark, in particular solving the previously intractable Path-X task at length 16,384, and received an Outstanding Paper Honorable Mention at ICLR 2022.[^4]
Gu's dissertation, Modeling Sequences with Structured State Spaces, was filed at Stanford in 2023, with Christopher Ré as advisor and Percy Liang and Scott Linderman on the reading committee.[^8] It synthesized the HiPPO theory, the LSSL framework, and the S4 architecture into a unified mathematical and empirical account of structured state-space sequence modeling. The thesis argued that existing approaches to deep sequence modeling — recurrent networks, convolutional networks, and Transformers — each suffer from a mixture of efficiency limitations, theoretical opacity, and difficulty handling very long dependencies, and proposed structured state spaces as a single primitive that addresses all three concerns.[^8]
A recurring theme in Gu's work, traceable from HiPPO through Mamba-2, is the use of classical mathematical structure — orthogonal polynomial families, continuous-time linear dynamical systems, semiseparable matrices — to design deep learning primitives whose computational and statistical properties can be analyzed rather than treated as empirical phenomena.[^3][^4][^6] This emphasis distinguishes the structured state-space line from the more empirically driven evolution of attention-based architectures.
Following the success of S4, multiple research groups, including Gu's collaborators, pursued simplifications and variants of the architecture. In 2022 Ankit Gupta and co-authors at IBM Research proposed DSS ("Diagonal State Spaces are as Effective as Structured State Spaces"), demonstrating that constraining S4's state matrix to be diagonal, with a specific initialization, recovered most of S4's performance with a far simpler kernel.[^20] Gu and Karan Goel built on this with S4D, presented at NeurIPS 2022 in the paper On the Parameterization and Initialization of Diagonal State Space Models, which systematically analyzed how to parameterize and initialize fully diagonal SSMs and provided rigorous theoretical justification for the diagonal-plus-low-rank decomposition.[^21]
A separate group at Stanford — Jimmy T. H. Smith, Andrew Warrington, and Scott Linderman — published the closely related S5 layer (Simplified State Space Layers for Sequence Modeling) at ICLR 2023, replacing S4's bank of single-input/single-output SSMs with a single multi-input/multi-output SSM evaluated by parallel scan.[^22] S5 was independent of Gu's group but built directly on S4 and shared its theoretical foundations.[^22]
A further influential paper in this line was H3 (Hungry Hungry Hippos: Towards Language Modeling with State Space Models), published at ICLR 2023 by Daniel Y. Fu, Tri Dao, Khaled Saab and collaborators; while Gu was not a primary author, H3 used the S4 backbone and pointed to the associative-recall weaknesses that would motivate his next line of work.[^23] Empirically, H3 was the first state-space-based model to come close to Transformer perplexity on standard language-modeling benchmarks, suggesting that the gap between SSMs and attention on natural language could be closed if a way were found to make state-space dynamics input-dependent.[^23]
Across the S4 → DSS → S4D → S5 → H3 sequence, the architectural simplifications progressively reduced the complexity of the structured state-space layer — from S4's diagonal-plus-low-rank parameterization to fully diagonal SSMs and finally, in Mamba-2, to scalar-times-identity transition matrices — without sacrificing expressivity once the parameters were made input-dependent and the kernels were paired with hardware-aware scans.[^4][^21][^22][^6] This trajectory is sometimes summarized as the demonstration that the expressive power of structured SSMs comes primarily from selectivity and from depth, not from a complex per-layer transition matrix.[^6][^25]
By 2023, the principal limitation of S4 and its diagonal variants was their reliance on linear time-invariant dynamics: the same recurrence was applied to every input, which prevented the model from selectively focusing on or ignoring particular tokens — a basic capability of attention. In December 2023 Gu and Tri Dao posted Mamba: Linear-Time Sequence Modeling with Selective State Spaces on arXiv.[^5][^24] The paper introduced selective SSMs, in which the state-transition, input-projection, and output-projection parameters become input-dependent functions, allowing the model to gate information flow based on the content of each token.[^5]
Because input-dependent parameters break the convolutional form that made S4 efficient, the authors paired the new recurrence with a hardware-aware parallel-scan implementation that uses kernel fusion and recomputation to keep memory traffic low on modern GPUs. Combined with a simplified architectural block that interleaves selective SSM layers with gating and projection rather than attention or MLP blocks, Mamba achieved roughly 5× higher inference throughput than Transformers of comparable size while matching or exceeding their quality on language, audio, and genomics benchmarks.[^5][^25] Mamba was presented at the first Conference on Language Modeling (COLM) in 2024, where it received an Outstanding Paper Award.[^13][^25]
In May 2024 Dao and Gu followed Mamba with Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, presented at ICML 2024.[^6] The paper introduced the State Space Duality (SSD) framework, showing that a broad class of selective SSMs and a broad class of attention variants are equivalent computations on structured semiseparable matrices. From this duality the authors derived Mamba-2, a refinement of Mamba whose state-transition matrix is restricted to scalar-times-identity form. This simplification allowed Mamba-2 to be expressed simultaneously as a recurrence and as a matrix product, yielding training-time speedups of 2–8× over Mamba while preserving its inference-time advantages.[^6] The SSD framework also clarified the precise sense in which Mamba and attention are dual representations of the same underlying computation, providing a unifying view of two of the dominant sequence-modeling paradigms.[^6][^25]
Gu and Dao's collaboration on Mamba reunited the pair after several years of largely separate work: their previous joint publication had been the original HiPPO paper in 2020.[^26] Whereas Gu is generally identified with the theoretical and architectural side of the structured state-space line, Dao is identified with the systems-and-kernels side (he is the lead author of FlashAttention and a principal contributor to the hardware-aware implementations of Mamba).[^5][^26] The Mamba and Mamba-2 papers credit Gu with leading the architectural design of selective SSMs and the formulation of the SSD framework, and credit Dao with leading the hardware-aware implementations and many of the systems-level engineering decisions that made selective SSMs practical at scale on contemporary accelerators.[^5][^6]
Mamba's release in late 2023 attracted unusually broad attention for an academic architecture paper: within weeks of the preprint's appearance, third-party reimplementations, scaling experiments, and surveys began to appear, and the Mamba blocks became a baseline against which subsequent efficient-sequence-model proposals were measured.[^7][^25] By 2024 the architecture had been ported to domains including computer vision, computational biology, time-series forecasting, and reinforcement learning, often via hybrid models that interleave Mamba blocks with attention.[^7][^25] Gu has consistently characterized the goal of the line as architectural diversification rather than wholesale replacement of attention, noting that hybrid models combining selective SSMs with small amounts of attention often outperform either pure approach.[^2][^18]
In 2023 Gu co-founded Cartesia AI together with Karan Goel (CEO), Arjun Desai, Brandon Yang, and his doctoral advisor Christopher Ré.[^10][^27] The founding team had overlapped at the Stanford AI Lab during the development of HiPPO, S4, and their successors, and the company was incorporated to commercialize the line of structured state-space models the team had developed.[^10] Gu holds the title of Chief Scientist.[^11][^10]
Cartesia's central technical bet is that selective state-space models are unusually well suited to real-time, streaming applications because they admit a constant-memory recurrent inference mode in addition to their parallel training mode. This duality — also formalized in Gu and Dao's SSD paper — means that an SSM can be trained in parallel like a Transformer but deployed as a streaming RNN, processing each new input in constant time and memory regardless of the length of the conversation so far.[^6][^28] For voice interfaces, where end-to-end latency is the dominant user-experience constraint, that property is a substantial advantage over autoregressive Transformer decoders whose per-token cost grows linearly with context.[^28]
The company's flagship product line is the Sonic family of text-to-speech and conversational voice models. The first version, Sonic, was launched on 31 May 2024 and was advertised as the fastest voice generation model of its class, with a time-to-first-audio in the 100-millisecond range.[^28] Successive versions — Sonic 2 and Sonic-3 (the latter current as of 2026) — further reduced latency, expanded the model to over forty languages, and added features such as expressive emotion, AI-generated laughter, and enterprise-grade accuracy on acronyms and identifiers.[^29][^30] In March 2025 Cartesia announced a $64 million Series A round led by Kleiner Perkins, with participation from Index Ventures, Lightspeed, A*, Factory, Greycroft, Dell Technologies Capital, and Samsung Ventures.[^31] By 2026 the company reported tens of thousands of customers and engineering integrations with major enterprise and consumer-electronics buyers.[^10][^31]
Gu joined the Machine Learning Department at Carnegie Mellon University in 2024 as an Assistant Professor.[^2][^9] At CMU he leads the Goomba Lab, a research group focused on the foundations of deep learning, sequence modeling, and structured architectures.[^9] The lab's stated agenda covers theoretical and empirical aspects of deep learning, with current emphasis on deep sequence models — a continuation of the program begun during Gu's PhD.[^9] His CMU appointment is unusual in that it coincides with his role as Chief Scientist of a venture-backed startup, with the two activities sharing scientific subject matter (efficient sequence modeling) but separate institutional contexts.[^11][^2]
Gu also continues to publish actively with collaborators outside CMU. In 2026, the Mamba line was extended in Mamba-3: Improved Sequence Modeling using State Space Principles, an architectural redesign aimed at faster inference and improved language-modeling perplexity at a fraction of the decoding cost of strong Transformer baselines.[^32] The Goomba Lab has also published a series of expository blog posts on the SSD framework and on the practical design of selective SSMs, intended to make the underlying mathematics accessible to a broader machine-learning audience.[^24]