Albert Gu

Model Architecture People

16 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v5 · 3,243 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Albert Gu is an American computer scientist, Assistant Professor of Machine Learning at Carnegie Mellon University, and co-founder and Chief Scientist of Cartesia AI.^[1]^[2] He is best known as the principal originator of the modern line of structured state-space models (SSMs) for deep learning, beginning with the HiPPO theory of online memory (2020), continuing through the S4 architecture (2021-2022) and its diagonal variants, and culminating in the Mamba family of selective state-space models that he developed jointly with Tri Dao.^[3]^[4]^[5]^[6] These architectures have become the most widely studied non-attentional alternative to the Transformer.^[7]

Gu completed his PhD in Computer Science at Stanford University under Christopher Ré in 2023, defending a dissertation titled Modeling Sequences with Structured State Spaces.^[8] He joined the faculty of the Machine Learning Department at Carnegie Mellon in 2024, where he leads the Goomba Lab.^[9]^[2] In parallel, he is one of five co-founders of Cartesia AI, a San Francisco-based company applying state-space models to real-time generative audio and other streaming modalities.^[10]^[11]

In 2024 Gu was named to TIME magazine's TIME100 AI list of the most influential people in artificial intelligence, in recognition of his contributions to non-attentional sequence modeling.^[12] His paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces, co-authored with Tri Dao, received an Outstanding Paper Award at the inaugural Conference on Language Modeling (COLM) in 2024.^[13]^[7]

Beyond its specific empirical results, the structured state-space line associated with Gu is often credited with reopening a substantive theoretical question that had been largely dormant since the rise of self-attention: whether efficient recurrent architectures, when properly parameterized, can match Transformers at scale. The success of Mamba and Mamba-2, together with the broader family of selective SSMs, hybrid SSM/attention models, and linear-attention variants whose connections to SSMs were formalized by the State Space Duality framework, has made the question of "attention versus recurrence" once again a central one in the sequence-modeling literature.^[6]^[7]^[25]

Key facts

Field	Details
Born	Active research career begins c. 2015; year of birth not publicly reported
Nationality	American
Education	B.S. Mathematical Sciences and Computer Science, Carnegie Mellon University (2015); PhD Computer Science, Stanford University (2023)^[14]^[8]
Doctoral advisor	Christopher Ré^[8]
Doctoral committee	Christopher Ré, Percy Liang, Scott Linderman^[8]
Current positions	Assistant Professor, Machine Learning Department, Carnegie Mellon University (2024-present); Co-founder and Chief Scientist, Cartesia AI (2023-present)^[2]^[10]
Notable areas	Structured state-space models, long-sequence modeling, efficient deep learning architectures, generative audio
Lab	Goomba Lab, CMU Machine Learning Department^[9]
Best-known works	HiPPO (2020); S4 (2021/2022); Mamba (2023); Mamba-2 / State Space Duality (2024)^[3]^[4]^[5]^[6]
Recognition	TIME100 AI 2024; COLM 2024 Outstanding Paper Award (Mamba); ICLR 2022 Outstanding Paper Honorable Mention (S4)^[12]^[13]^[4]

Early life and education

Albert Gu attended Saratoga High School in Saratoga, California, where he graduated as class valedictorian in 2012.^[15] During high school he was a prominent competitor in mathematics and informatics olympiads, attending the Mathematical Olympiad Summer Program (MOSP) and winning medals at the Asian Pacific Mathematics Olympiad and the International Olympiad in Informatics.^[15]

Gu pursued his undergraduate studies at Carnegie Mellon University, where he double-majored in Mathematical Sciences and Computer Science.^[14]^[16] He was admitted as a Knaster-McWilliams Scholar, the honors program of CMU's Mellon College of Science.^[17] As an undergraduate, he was a leading member of CMU's competitive mathematics team: he was named a Putnam Fellow region for placing in the top ten of the William Lowell Putnam Mathematical Competition as a first-year student, and as a senior he ranked among the top sixteen in the 2014 Putnam, contributing to CMU's fifth-place team finish that year.^[17]^[16]

After graduating from CMU in 2015, Gu entered the doctoral program in Computer Science at Stanford University, where he joined Christopher Ré's Hazy Research group at the Stanford AI Lab.^[8] He later credited Hazy Research's broader collaborative environment (including frequent interactions with PhD-stage collaborators such as Tri Dao, Karan Goel, Atri Rudra, and Stefano Ermon) with shaping the line of research that would become structured state spaces.^[18]^[3] Hazy Research, run by Ré, was concurrently developing complementary work on efficient attention (most prominently FlashAttention, led by Dao), data-centric machine learning, and weakly supervised systems; Gu's structured state-space line emerged as one of the lab's principal architectural contributions of the 2020-2024 period.^[18]

Stanford PhD and the HiPPO/S4 origins

Gu's doctoral research at Stanford focused on developing a principled, theoretically grounded approach to sequence modeling. The line of work began in 2020 with the HiPPO framework (High-order Polynomial Projection Operators), introduced in the NeurIPS 2020 paper HiPPO: Recurrent Memory with Optimal Polynomial Projections, co-authored with Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré.^[3] HiPPO formalized the problem of online sequence memorization as the optimal projection of a function's history onto a basis of orthogonal polynomials with respect to a chosen measure. From this single principle the authors derived classical and new memory cells (including the Legendre Memory Unit and the novel HiPPO-LegS unit), unifying the design of recurrent memory with continuous-time approximation theory.^[3]

In 2021 Gu and collaborators extended HiPPO into a general deep-learning primitive in the NeurIPS 2021 paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers (LSSL), which showed that a single linear state-space layer could simulate recurrent, convolutional, and continuous-time models, but was computationally impractical at scale.^[19] The decisive engineering breakthrough came with the 2022 paper Efficiently Modeling Long Sequences with Structured State Spaces, in which Gu, Karan Goel, and Christopher Ré introduced S4, a structured state-space layer whose state transition matrix is parameterized as a low-rank correction of a normal (diagonalizable) matrix.^[4] This decomposition allowed the model's convolution kernel to be evaluated in near-linear time, enabling the architecture to learn on sequences with tens of thousands of tokens. S4 achieved state-of-the-art results on the Long Range Arena benchmark, in particular solving the previously intractable Path-X task at length 16,384, and received an Outstanding Paper Honorable Mention at ICLR 2022.^[4]

Gu's dissertation, Modeling Sequences with Structured State Spaces, was filed at Stanford in 2023, with Christopher Ré as advisor and Percy Liang and Scott Linderman on the reading committee.^[8] It synthesized the HiPPO theory, the LSSL framework, and the S4 architecture into a unified mathematical and empirical account of structured state-space sequence modeling. The thesis argued that existing approaches to deep sequence modeling (recurrent networks, convolutional networks, and Transformers) each suffer from a mixture of efficiency limitations, theoretical opacity, and difficulty handling very long dependencies, and proposed structured state spaces as a single primitive that addresses all three concerns.^[8]

A recurring theme in Gu's work, traceable from HiPPO through Mamba-2, is the use of classical mathematical structure (orthogonal polynomial families, continuous-time linear dynamical systems, semiseparable matrices) to design deep learning primitives whose computational and statistical properties can be analyzed rather than treated as empirical phenomena.^[3]^[4]^[6] This emphasis distinguishes the structured state-space line from the more empirically driven evolution of attention-based architectures.

The structured state space line: S4 → DSS → S4D and beyond

Following the success of S4, multiple research groups, including Gu's collaborators, pursued simplifications and variants of the architecture. In 2022 Ankit Gupta and co-authors at IBM Research proposed DSS ("Diagonal State Spaces are as Effective as Structured State Spaces"), demonstrating that constraining S4's state matrix to be diagonal, with a specific initialization, recovered most of S4's performance with a far simpler kernel.^[20] Gu and Karan Goel built on this with S4D, presented at NeurIPS 2022 in the paper On the Parameterization and Initialization of Diagonal State Space Models, which systematically analyzed how to parameterize and initialize fully diagonal SSMs and provided rigorous theoretical justification for the diagonal-plus-low-rank decomposition.^[21]

A separate group at Stanford (Jimmy T. H. Smith, Andrew Warrington, and Scott Linderman) published the closely related S5 layer (Simplified State Space Layers for Sequence Modeling) at ICLR 2023, replacing S4's bank of single-input/single-output SSMs with a single multi-input/multi-output SSM evaluated by parallel scan.^[22] S5 was independent of Gu's group but built directly on S4 and shared its theoretical foundations.^[22]

A further influential paper in this line was H3 (Hungry Hungry Hippos: Towards Language Modeling with State Space Models), published at ICLR 2023 by Daniel Y. Fu, Tri Dao, Khaled Saab and collaborators; while Gu was not a primary author, H3 used the S4 backbone and pointed to the associative-recall weaknesses that would motivate his next line of work.^[23] Empirically, H3 was the first state-space-based model to come close to Transformer perplexity on standard language-modeling benchmarks, suggesting that the gap between SSMs and attention on natural language could be closed if a way were found to make state-space dynamics input-dependent.^[23]

Across the S4 → DSS → S4D → S5 → H3 sequence, the architectural simplifications progressively reduced the complexity of the structured state-space layer, from S4's diagonal-plus-low-rank parameterization to fully diagonal SSMs and finally, in Mamba-2, to scalar-times-identity transition matrices, without sacrificing expressivity once the parameters were made input-dependent and the kernels were paired with hardware-aware scans.^[4]^[21]^[22]^[6] This trajectory is sometimes summarized as the demonstration that the expressive power of structured SSMs comes primarily from selectivity and from depth, not from a complex per-layer transition matrix.^[6]^[25]

Mamba and Mamba-2

By 2023, the principal limitation of S4 and its diagonal variants was their reliance on linear time-invariant dynamics: the same recurrence was applied to every input, which prevented the model from selectively focusing on or ignoring particular tokens, a basic capability of attention. In December 2023 Gu and Tri Dao posted Mamba: Linear-Time Sequence Modeling with Selective State Spaces on arXiv.^[5]^[24] The paper introduced selective SSMs, in which the state-transition, input-projection, and output-projection parameters become input-dependent functions, allowing the model to gate information flow based on the content of each token.^[5]

Because input-dependent parameters break the convolutional form that made S4 efficient, the authors paired the new recurrence with a hardware-aware parallel-scan implementation that uses kernel fusion and recomputation to keep memory traffic low on modern GPUs. Combined with a simplified architectural block that interleaves selective SSM layers with gating and projection rather than attention or MLP blocks, Mamba achieved roughly 5× higher inference throughput than Transformers of comparable size while matching or exceeding their quality on language, audio, and genomics benchmarks.^[5]^[25] Mamba was presented at the first Conference on Language Modeling (COLM) in 2024, where it received an Outstanding Paper Award.^[13]^[25]

In May 2024 Dao and Gu followed Mamba with Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality, presented at ICML 2024.^[6] The paper introduced the State Space Duality (SSD) framework, showing that a broad class of selective SSMs and a broad class of attention variants are equivalent computations on structured semiseparable matrices. From this duality the authors derived Mamba-2, a refinement of Mamba whose state-transition matrix is restricted to scalar-times-identity form. This simplification allowed Mamba-2 to be expressed simultaneously as a recurrence and as a matrix product, yielding training-time speedups of 2-8× over Mamba while preserving its inference-time advantages.^[6] The SSD framework also clarified the precise sense in which Mamba and attention are dual representations of the same underlying computation, providing a unifying view of two of the dominant sequence-modeling paradigms.^[6]^[25]

Gu and Dao's collaboration on Mamba reunited the pair after several years of largely separate work: their previous joint publication had been the original HiPPO paper in 2020.^[26] Whereas Gu is generally identified with the theoretical and architectural side of the structured state-space line, Dao is identified with the systems-and-kernels side (he is the lead author of FlashAttention and a principal contributor to the hardware-aware implementations of Mamba).^[5]^[26] The Mamba and Mamba-2 papers credit Gu with leading the architectural design of selective SSMs and the formulation of the SSD framework, and credit Dao with leading the hardware-aware implementations and many of the systems-level engineering decisions that made selective SSMs practical at scale on contemporary accelerators.^[5]^[6]

Mamba's release in late 2023 attracted unusually broad attention for an academic architecture paper: within weeks of the preprint's appearance, third-party reimplementations, scaling experiments, and surveys began to appear, and the Mamba blocks became a baseline against which subsequent efficient-sequence-model proposals were measured.^[7]^[25] By 2024 the architecture had been ported to domains including computer vision, computational biology, time-series forecasting, and reinforcement learning, often via hybrid models that interleave Mamba blocks with attention.^[7]^[25] Gu has consistently characterized the goal of the line as architectural diversification rather than wholesale replacement of attention, noting that hybrid models combining selective SSMs with small amounts of attention often outperform either pure approach.^[2]^[18]

Cartesia AI and audio applications

In 2023 Gu co-founded Cartesia AI together with Karan Goel (CEO), Arjun Desai, Brandon Yang, and his doctoral advisor Christopher Ré.^[10]^[27] The founding team had overlapped at the Stanford AI Lab during the development of HiPPO, S4, and their successors, and the company was incorporated to commercialize the line of structured state-space models the team had developed.^[10] Gu holds the title of Chief Scientist.^[11]^[10]

Cartesia's central technical bet is that selective state-space models are unusually well suited to real-time, streaming applications because they admit a constant-memory recurrent inference mode in addition to their parallel training mode. This duality, also formalized in Gu and Dao's SSD paper, means that an SSM can be trained in parallel like a Transformer but deployed as a streaming RNN, processing each new input in constant time and memory regardless of the length of the conversation so far.^[6]^[28] For voice interfaces, where end-to-end latency is the dominant user-experience constraint, that property is a substantial advantage over autoregressive Transformer decoders whose per-token cost grows linearly with context.^[28]

The company's flagship product line is the Sonic family of text-to-speech and conversational voice models. The first version, Sonic, was launched on 31 May 2024 and was advertised as the fastest voice generation model of its class, with a time-to-first-audio in the 100-millisecond range.^[28] Successive versions (Sonic 2 and Sonic-3, the latter current as of 2026) further reduced latency, expanded the model to over forty languages, and added features such as expressive emotion, AI-generated laughter, and enterprise-grade accuracy on acronyms and identifiers.^[29]^[30] In March 2025 Cartesia announced a $64 million Series A round led by Kleiner Perkins, with participation from Index Ventures, Lightspeed, A*, Factory, Greycroft, Dell Technologies Capital, and Samsung Ventures.^[31] By 2026 the company reported tens of thousands of customers and engineering integrations with major enterprise and consumer-electronics buyers.^[10]^[31]

CMU faculty role

Gu joined the Machine Learning Department at Carnegie Mellon University in 2024 as an Assistant Professor.^[2]^[9] At CMU he leads the Goomba Lab, a research group focused on the foundations of deep learning, sequence modeling, and structured architectures.^[9] The lab's stated agenda covers theoretical and empirical aspects of deep learning, with current emphasis on deep sequence models, a continuation of the program begun during Gu's PhD.^[9] His CMU appointment is unusual in that it coincides with his role as Chief Scientist of a venture-backed startup, with the two activities sharing scientific subject matter (efficient sequence modeling) but separate institutional contexts.^[11]^[2]

Gu also continues to publish actively with collaborators outside CMU. In 2026, the Mamba line was extended in Mamba-3: Improved Sequence Modeling using State Space Principles, an architectural redesign aimed at faster inference and improved language-modeling perplexity at a fraction of the decoding cost of strong Transformer baselines.^[32] The Goomba Lab has also published a series of expository blog posts on the SSD framework and on the practical design of selective SSMs, intended to make the underlying mathematics accessible to a broader machine-learning audience.^[24]

Selected publications

A. Gu, T. Dao, S. Ermon, A. Rudra, C. Ré. HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020.^[3]
A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, C. Ré. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. NeurIPS 2021.^[19]
A. Gu, K. Goel, C. Ré. Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022 (Outstanding Paper Honorable Mention).^[4]
A. Gu, K. Goel, A. Gupta, C. Ré. On the Parameterization and Initialization of Diagonal State Space Models. NeurIPS 2022.^[21]
A. Gu, T. Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024 (Outstanding Paper Award).^[5]^[13]
T. Dao, A. Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024.^[6]
A. Gu. Modeling Sequences with Structured State Spaces. PhD dissertation, Stanford University, 2023.^[8]

References

Carnegie Mellon University, Machine Learning Department, Albert Gu faculty profile. https://www.ml.cmu.edu/people/core-faculty-people/agu ↩
Cognitive Revolution, "The State Space Model Revolution, with Albert Gu". https://www.cognitiverevolution.ai/the-state-space-model-revolution-with-albert-gu/ ↩
A. Gu, T. Dao, S. Ermon, A. Rudra, C. Ré, "HiPPO: Recurrent Memory with Optimal Polynomial Projections," NeurIPS 2020 (arXiv:2008.07669). https://arxiv.org/abs/2008.07669 ↩
A. Gu, K. Goel, C. Ré, "Efficiently Modeling Long Sequences with Structured State Spaces," ICLR 2022 (arXiv:2111.00396). https://arxiv.org/abs/2111.00396 ↩
A. Gu, T. Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752, December 2023. https://arxiv.org/abs/2312.00752 ↩
T. Dao, A. Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," ICML 2024 (arXiv:2405.21060). https://arxiv.org/abs/2405.21060 ↩
Wikipedia, "Mamba (deep learning architecture)". https://en.wikipedia.org/wiki/Mamba_(deep_learning_architecture) ↩
A. Gu, "Modeling sequences with structured state spaces," PhD dissertation, Stanford University, 2023. https://stacks.stanford.edu/file/druid:mb976vf9362/gu_dissertation-augmented.pdf ↩
Goomba Lab, group website led by Albert Gu, Carnegie Mellon University. https://goombalab.github.io/ ↩
Cartesia AI, Company page. https://cartesia.ai/company ↩
Crunchbase, "Albert Gu: Chief Scientist & Co-Founder @ Cartesia". https://www.crunchbase.com/person/albert-gu-da9e ↩
TIME, "Albert Gu: The 100 Most Influential People in AI 2024". https://time.com/collections/time100-ai-2024/7012853/albert-gu/ ↩
Conference on Language Modeling (COLM) 2024, Accepted Papers and awards. https://2024.colmweb.org/AcceptedPapers.html ↩
Carnegie Mellon University, "Carnegie Mellon Places Fifth in 2014 Putnam Mathematics Competition," April 2015. https://www.cmu.edu/news/stories/archives/2015/april/fifth-in-putnam.html ↩
The Saratoga Falcon, "From the archives: 'Champ Gu' conquers at the IOI". https://saratogafalcon.org/2778/news/archives-champ-gu-conquers-ioi/ ↩
Carnegie Mellon School of Computer Science, "Carnegie Mellon Places Fifth in 2014 Putnam Mathematics Competition". https://www.scs.cmu.edu/news/2015/carnegie-mellon-places-fifth-2014-putnam-mathematics-competition ↩
Carnegie Mellon University, "Mathletes," Spring 2012. https://www.cmu.edu/homepage/society/2012/spring/mathletes.shtml ↩
Tower Research Capital, "Exploring State Space Models with Dr. Albert Gu". https://tower-research.com/exploring-state-space-models-with-dr-albert-gu/ ↩
A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, C. Ré, "Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers," NeurIPS 2021 (arXiv:2110.13985). https://arxiv.org/abs/2110.13985 ↩
A. Gupta, A. Gu, J. Berant, "Diagonal State Spaces are as Effective as Structured State Spaces," NeurIPS 2022 (arXiv:2203.14343). https://arxiv.org/abs/2203.14343 ↩
A. Gu, K. Goel, A. Gupta, C. Ré, "On the Parameterization and Initialization of Diagonal State Space Models," NeurIPS 2022 (arXiv:2206.11893). https://arxiv.org/abs/2206.11893 ↩
J. T. H. Smith, A. Warrington, S. W. Linderman, "Simplified State Space Layers for Sequence Modeling," ICLR 2023 (arXiv:2208.04933). https://arxiv.org/abs/2208.04933 ↩
D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, C. Ré, "Hungry Hungry Hippos: Towards Language Modeling with State Space Models," ICLR 2023 (arXiv:2212.14052). https://arxiv.org/abs/2212.14052 ↩
Goomba Lab blog, "State Space Duality (Mamba-2) Part I: The Model". https://goombalab.github.io/blog/2024/mamba2-part1-model/ ↩
IBM Research, "What Is A Mamba Model?". https://www.ibm.com/think/topics/mamba-model ↩
A. Gu (@_albertgu), X post on resuming joint work with Tri Dao, December 2023. https://x.com/_albertgu/status/1731728789502140909 ↩
Index Ventures, "Building the Next Generation of Real-Time AI Models: Our Investment in Cartesia". https://www.indexventures.com/perspectives/building-the-next-generation-of-real-time-ai-models-our-investment-in-cartesia/ ↩
Cartesia AI, "Announcing Sonic: a low-latency voice model for lifelike speech," 31 May 2024. https://cartesia.ai/blog/sonic ↩
Cartesia AI, "Real-time TTS API with AI laughter and emotion: Cartesia Sonic-3". https://cartesia.ai/sonic ↩
Cartesia Docs, "Sonic 3". https://docs.cartesia.ai/build-with-cartesia/tts-models/latest ↩
Fortune, "Exclusive: Cartesia, voice AI startup, raises $64 million Series A," 11 March 2025. https://fortune.com/2025/03/11/exclusive-cartesia-voice-ai-startup-raises-64-million-series-a/ ↩
Princeton Language and Intelligence, "Mamba-3: Improved Sequence Modeling using State Space Principles," 2026. https://pli.princeton.edu/blog/2026/mamba-3-improved-sequence-modeling-using-state-space-principles ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

H-Net (dynamic chunking)Mamba-3 Tri Dao

Key facts

Early life and education

Stanford PhD and the HiPPO/S4 origins

The structured state space line: S4 → DSS → S4D and beyond

Mamba and Mamba-2

Cartesia AI and audio applications

CMU faculty role

Selected publications

See also

References

Improve this article

Related Articles

LSTM

Bidirectional

Depthwise separable convolutional neural network (sepCNN)

Encoder

Graph Machine Learning Models

Long Short-Term Memory (LSTM)

What links here

Related Articles

LSTM

Bidirectional

Depthwise separable convolutional neural network (sepCNN)

Encoder

Graph Machine Learning Models

Long Short-Term Memory (LSTM)

What links here