Tri Dao

16 min read

Updated Jul 23, 2026

Tri Dao is a computer scientist who created the FlashAttention family of GPU attention algorithms and co-created the Mamba selective state-space architecture, two of the most widely adopted efficiency techniques in modern deep learning. Born in Vietnam, he is an assistant professor of computer science at Princeton University, a position he took up in September 2024, and is concurrently co-founder and chief scientist of the AI cloud and research company Together AI.^[1]^[2]^[3] He directs the Dao AI Lab at Princeton, which he describes as working on "machine learning and systems, with a focus on efficient training and inference," "hardware-aware algorithms," and "sequence models with long-range memory."^[1]

Dao is best known for two lines of work. The first is the FlashAttention family of algorithms, comprising FlashAttention (2022), FlashAttention-2 (2023), FlashAttention-3 (2024), and FlashAttention-4 (2026), a sequence of exact, IO-aware GPU kernels for the attention operation that became a de facto standard in modern transformer training and inference stacks.^[4]^[5]^[6]^[7] The second is the Mamba family of selective state space model architectures, developed with collaborator Albert Gu, beginning with Mamba in late 2023 and continuing with Mamba-2 (the structured state space duality framework) in 2024.^[8]^[9]

His work has been recognized by an Outstanding Paper runner-up at the International Conference on Machine Learning (ICML) in 2022 for the Monarch paper, an Outstanding Paper award at the inaugural Conference on Language Modeling (COLM) in 2024 for Mamba, an Outstanding Paper Honorable Mention at MLSys 2025, a Best Paper Honorable Mention at MLSys 2026, the inaugural Stanford Open Source Software Prize (2024), the Schmidt Sciences AI2050 Early Career Fellowship (2025), a Google Research Scholar award (2025), and a Google ML and Systems Junior Faculty Award (2025).^[1]^[10]

Key facts

Field	Value
Born	Vietnam (year not publicly disclosed)
Nationality	Vietnamese; based in the United States
Education	B.S., Stanford University; Ph.D. in Computer Science, Stanford University (2023)
Doctoral advisors	Christopher Re and Stefano Ermon
Current positions	Assistant Professor of Computer Science, Princeton University (since September 2024); Co-founder and Chief Scientist, Together AI (since July 2023)
Lab	Dao AI Lab, Princeton University
Notable work	FlashAttention 1/2/3/4; Mamba; Mamba-2 (structured state space duality); Monarch matrices
Selected honors	ICML Outstanding Paper runner-up (2022); Stanford Open Source Software Prize (2024); COLM Outstanding Paper (2024); MLSys Outstanding Paper Honorable Mention (2025); MLSys Best Paper Honorable Mention (2026); Schmidt Sciences AI2050 Fellow (2025); Google Research Scholar (2025); Google ML and Systems Junior Faculty Award (2025)
Citations (Google Scholar, May 2026)	approximately 33,000; h-index near 38
Website	tridao.me

Early life and education

Tri Dao was born in Vietnam and moved to the United States for his university studies.^[11] He matriculated at Stanford University as an undergraduate around 2012-2013 and, by his own account on the Latent Space podcast, initially intended to major in economics before switching toward mathematics after taking introductory math classes in his first weeks of college; this redirection ultimately led him into computer science.^[11] He earned a Bachelor of Science from Stanford, as confirmed by Princeton University's official announcement of his faculty appointment, which lists both a "Ph.D. and a B.S. from Stanford University".^[3]

He remained at Stanford for graduate study, entering the Computer Science Ph.D. program in 2016.^[12] He completed his doctorate in 2023, co-advised by Christopher Re of the Hazy Research lab and Stefano Ermon.^[1]^[13] His doctoral dissertation, titled Hardware-aware Algorithms for Efficient Machine Learning, gathers together his Stanford work on structured-matrix methods (butterfly and Monarch matrices), sparsity-aware training, and IO-aware attention algorithms, all themed around co-designing algorithms with the memory hierarchy and parallelism of modern accelerators.^[14]

What did Tri Dao work on during his Stanford Ph.D.?

Dao's doctoral research at Stanford, conducted within Christopher Re's Hazy Research group, focused on co-designing machine learning algorithms with the hardware on which they actually run. A recurring theme is exploiting structure (sparsity, low rank, block-diagonal factorizations, or fixed access patterns) to reduce wall-clock cost while preserving model quality.^[14]

Two early threads from this period are particularly important context for his later work:

Butterfly and Monarch matrices. With Beidi Chen, Atri Rudra, Christopher Re and others, Dao developed expressive structured matrix classes that can replace dense linear layers in neural networks while admitting hardware-efficient implementations. The Monarch paper, Monarch: Expressive Structured Matrices for Efficient and Accurate Training, appeared at ICML 2022 and was recognized as an Outstanding Paper runner-up at that conference.^[15]^[10] A related 2022 paper, Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models (ICLR 2022), refined earlier butterfly-matrix constructions for better hardware utilization.^[16]
State space models with the S4 line. Although Albert Gu was the primary architect of the S4 family of structured state space sequence models during his own Stanford Ph.D., Dao was a frequent collaborator with the Hazy Research SSM line, which formed the technical and intellectual basis for the later Mamba work.^[17]

This combination of structured-matrix thinking and SSM thinking, fused with a deep concern for what GPUs are actually fast at, motivated the central contribution of his thesis: FlashAttention.

What is FlashAttention?

FlashAttention is, in Dao and his co-authors' own words, "an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM."^[4] It is the centerpiece of Dao's attention-efficiency line of work and is exact rather than approximate, meaning it computes the same output as standard softmax attention.

FlashAttention (2022)

In May 2022, Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra and Christopher Re published FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness on arXiv (2205.14135).^[4] FlashAttention is an exact (not approximate) algorithm for the self-attention layer used in Transformers. Its key insight is that, on modern GPUs, attention is dominated not by floating-point operations but by reads and writes between high-bandwidth memory (HBM) and on-chip SRAM. By tiling the computation so that small blocks of queries, keys and values are loaded once into SRAM and the softmax is computed in a numerically stable, fused manner using online softmax tricks, FlashAttention reduces the number of HBM accesses asymptotically and trains substantially faster than standard implementations while using less memory.^[4] The paper reports a 15% end-to-end wall-clock speedup on BERT-large at sequence length 512 versus the MLPerf 1.1 training record, a 3 times speedup on GPT-2 at sequence length 1K, and a 2.4 times speedup on the Long Range Arena benchmark at sequence lengths 1K-4K.^[4] It also enabled the first Transformers to solve the Path-X task at sequence length 16K (61.4% accuracy) and Path-256 at sequence length 64K (63.1% accuracy).^[4]

FlashAttention appeared at NeurIPS 2022 and won the Best Paper award at the 2022 ICML workshop on Hardware-Aware Efficient Training.^[1] It was rapidly adopted in production training and inference stacks; by 2023 it was integrated into mainline PyTorch, the Hugging Face Transformers library, and inference engines such as vLLM, and was widely credited as a default building block in essentially every large language model trained from late 2022 onward.^[18]

FlashAttention-2 (2023)

In July 2023 Dao released FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv:2307.08691) as a sole-author technical report, his first major release since joining Together AI as chief scientist.^[5]^[2] FlashAttention-2 retains the IO-aware tiling of the original but rewrites the work partitioning across thread blocks and warps to better match GPU execution units, reducing non-matmul work and improving parallelism along the sequence dimension. The paper reports roughly 2 times speedup over FlashAttention, reaching 50-73% of theoretical peak FLOPs/s on NVIDIA A100, with end-to-end training throughputs around 225 TFLOPs/s per GPU (approximately 72% model FLOPs utilization).^[5] It was published at ICLR 2024.^[19]

FlashAttention-3 (2024)

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (arXiv:2407.08608, July 2024) was a collaboration between Dao and engineers from NVIDIA, Together AI, Colfax Research, and Meta (Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Dao).^[6] FlashAttention-3 targets NVIDIA's Hopper architecture (H100) and exploits warp-specialization, the asynchrony of Tensor Cores and the Tensor Memory Accelerator (TMA), and interleaved block-wise matmul and softmax, together with block quantization and incoherent processing to support FP8. It reaches 1.5-2.0 times speedup over FlashAttention-2 on H100, around 740 TFLOPs/s in FP16 (about 75% of theoretical max) and close to 1.2 PFLOPs/s in FP8, while reducing FP8 numerical error roughly 2.6 times relative to a baseline FP8 attention implementation.^[6] It was published at NeurIPS 2024.^[19]

FlashAttention-4 (2026)

A fourth generation, FlashAttention-4: Algorithm and Kernel Pipelining Co-design for Asymmetric Hardware Scaling, was published at MLSys 2026 with co-authors Ted Zadouri, Jay Shah, Markus Hohnerbach and others, with Dao as senior author. It received a Best Paper Honorable Mention at that venue.^[19]^[1]

Across the series, the open-source flash-attention library (hosted on GitHub under the Dao-AILab organization) became one of the most widely deployed pieces of AI infrastructure of the early 2020s, and was a primary citation for Dao's inaugural Stanford Open Source Software Prize in 2024.^[1]^[20]

What is Mamba and how does it differ from a Transformer?

In parallel with the FlashAttention line, Dao co-developed a family of recurrent, attention-free architectures based on structured state space models, in close collaboration with Albert Gu (then a fellow Stanford Ph.D. student and later assistant professor at Carnegie Mellon University).^[17] The defining difference from a transformer is that Mamba contains no softmax-attention layers: it scales linearly rather than quadratically in sequence length, yet was the first SSM-based architecture to match or exceed Transformer perplexity on standard language-modeling benchmarks at the multi-billion-parameter scale.^[8]

Mamba (2023)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces, by Albert Gu and Tri Dao, was posted to arXiv on 1 December 2023 (arXiv:2312.00752).^[8] Mamba builds on the S4 line of state space models but introduces an input-dependent selection mechanism: the SSM parameters (the discrete time step and the matrices controlling state evolution) are made functions of the current token, so the model can selectively propagate or forget information along the sequence. This breaks linear time invariance and substantially improves the language-modeling capability of pure SSMs, which had previously underperformed Transformers at scale.^[8]

The authors also designed a hardware-aware parallel scan algorithm that lets Mamba reach throughput competitive with FlashAttention-equipped Transformers while preserving linear-time scaling in sequence length and a roughly five times improvement in inference throughput at long context.^[8] Mamba is "attention-free": it contains no softmax-attention layers, and was the first SSM-based architecture to match or exceed Transformer perplexity on standard language-modeling benchmarks at the multi-billion-parameter scale.

Mamba received an Outstanding Paper award at the inaugural Conference on Language Modeling (COLM) in 2024.^[1] It sparked a large follow-up literature in 2024-2025 on selective SSMs and hybrid Transformer/SSM architectures.

Mamba-2 and structured state space duality (2024)

In Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (arXiv:2405.21060, ICML 2024), Dao and Gu show that a broad class of selective SSMs and a particular family of attention variants are two views of the same underlying object, products of structured semiseparable matrices, and that algorithms designed for one side translate to the other.^[9] The paper introduces the structured state space duality (SSD) framework and the Mamba-2 model whose core layer (the SSD layer) is a refinement of Mamba's selective SSM that is 2-8 times faster on modern GPUs while preserving language-modeling quality. SSD makes pure SSMs matmul-friendly in a way that more directly leverages Tensor Cores, narrowing the gap between SSM training efficiency and Transformer training efficiency.^[9]

Subsequent work

Dao and collaborators have continued to publish on the SSM/hybrid architecture line, including work on hybrid Transformer-SSM systems for long-context language modeling and, in 2025-2026, Mamba-3: Improved Sequence Modeling Using State Space Principles (ICLR 2026), as well as systems work such as Marconi: Prefix Caching for the Era of Hybrid LLMs (MLSys 2025, Outstanding Paper Honorable Mention).^[1]^[10]

What is Tri Dao's role at Together AI?

In July 2023, immediately after completing his Stanford Ph.D., Dao joined Together AI as Chief Scientist; he is described by the company as a co-founder and "founding chief scientist". Together AI was founded in June 2022 by Vipul Ved Prakash (CEO), Ce Zhang, Christopher Re and Percy Liang, and Dao came on board roughly a year later, alongside the public release of FlashAttention-2.^[2]^[21]

At Together AI, Dao leads research on model architecture and training and inference algorithms. The company brands itself as an "AI Acceleration Cloud" focused on open-source models and high-throughput training and inference, and Dao's work (FlashAttention, Mamba and follow-ons) supplies a substantial fraction of the underlying algorithmic stack. Together AI raised a US$102.5 million funding round in 2023 and subsequent rounds in 2024, with Dao's chief-scientist appointment prominently featured in its press materials and partner communications.^[21]^[2]

Dao has continued to publish actively from Together AI in parallel with his Princeton role; the FlashAttention-3 paper, for instance, was a multi-institution collaboration that included Together AI.^[6]

When did Tri Dao join Princeton?

Princeton University's Board of Trustees approved Dao's appointment as Assistant Professor in the Department of Computer Science at its January 31, 2024 meeting, with a start date of September 2024.^[3] His Princeton CS faculty profile lists him as Assistant Professor and notes that he holds a Ph.D. in Computer Science from Stanford (2023).^[22]

At Princeton he leads the Dao AI Lab, which works on efficient training and inference, hardware-aware algorithms, and sequence models with long-range memory.^[1] He has built a sizable group; his personal site as of 2025-2026 lists approximately seven doctoral students, several jointly advised with collaborators at other institutions.^[1] He is also affiliated with Princeton Language and Intelligence (PLI), the university's AI initiative, and contributes to its public-facing technical writing on topics including Mamba-2.^[23]

In November 2025, the Princeton Office of the Dean of the Faculty announced that Dao had been named an Early Career Fellow in the Schmidt Sciences AI2050 program, alongside structural biologist Ellen Zhong, with funding to pursue AI systems that combine novel architectures and tool use to attempt expert-level scientific discovery in domains where labeled data is scarce.^[24]^[10]

Public profile

Dao maintains an active public profile in the AI research community. He posts research updates on X (Twitter) at the handle @tri_dao and operates an open-source software footprint through the tridao and Dao-AILab GitHub organizations, where the flash-attention and mamba code lives.^[25]^[20]

He has appeared as a guest on prominent AI podcasts, including Latent Space: The AI Engineer Podcast (July 2023, on FlashAttention-2 and Together AI) and Imbue's Generally Intelligent podcast (February 2024, on FlashAttention, sparsity, quantization, and efficient inference).^[26]^[11] He has been a speaker at TEDAI San Francisco, MLSys, COLM, NeurIPS, the Stanford MLSys Seminar, the Kempner Institute at Harvard, and MIT CSAIL, typically on hardware-aware algorithms and modern AI architectures.^[27]^[28]

Within the field he is often grouped with a cohort of "ML systems" researchers, alongside collaborators such as Albert Gu, Daniel Fu, Michael Poli, and his former advisor Christopher Re, who emphasize co-design of architectures and accelerators rather than treating GPUs as black-box compute.

Selected publications

The following list, in chronological order, gives a representative selection of Dao's most-cited and most-discussed papers.

Tri Dao, Beidi Chen, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Re. Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models. International Conference on Learning Representations (ICLR), 2022.^[16]
Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Re. Monarch: Expressive Structured Matrices for Efficient and Accurate Training. ICML 2022. Outstanding Paper runner-up.^[15]^[10]
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. arXiv:2205.14135.^[4]
Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691, 2023; ICLR 2024.^[5]
Albert Gu, Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023; COLM 2024, Outstanding Paper.^[8]^[1]
Tri Dao, Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2). ICML 2024. arXiv:2405.21060.^[9]
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. NeurIPS 2024. arXiv:2407.08608.^[6]
Rui Pan, Zhuang Wang, Zhen Jia and co-authors including Tri Dao. Marconi: Prefix Caching for the Era of Hybrid LLMs. MLSys 2025. Outstanding Paper Honorable Mention.^[1]
Ted Zadouri, Jay Shah, Markus Hohnerbach and co-authors including Tri Dao. FlashAttention-4: Algorithm and Kernel Pipelining Co-design for Asymmetric Hardware Scaling. MLSys 2026. Best Paper Honorable Mention.^[19]^[1]
Aakash Lahoti, Kevin Li, Berlin Chen and co-authors including Tri Dao. Mamba-3: Improved Sequence Modeling Using State Space Principles. ICLR 2026.^[1]

As of May 2026 his Google Scholar profile reports approximately 33,000 citations, an h-index near 38, and an i10-index near 52, dominated by the Mamba and FlashAttention papers.^[29]

References

^Tri Dao, personal website. tridao.me
^"Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference." Together AI blog, 17 July 2023. together.ai/...tri-dao-flash-attention
^"Board approves six faculty appointments." Princeton University news, 31 January 2024. princeton.edu/...approves-six-faculty-appointments
^Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135, 27 May 2022. arxiv.org/...2205.14135
^Tri Dao. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691, 17 July 2023. arxiv.org/...2307.08691
^Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." arXiv:2407.08608, 11 July 2024. arxiv.org/...2407.08608
^"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." PyTorch blog, 2024. pytorch.org/...flashattention-3
^Albert Gu, Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, 1 December 2023. arxiv.org/...2312.00752
^Tri Dao, Albert Gu. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." arXiv:2405.21060, 31 May 2024; ICML 2024. arxiv.org/...2405.21060
^"Tri Dao." AI2050 Fellows, Schmidt Sciences. ai2050.schmidtsciences.org/...tri-dao
^"FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI." Latent Space podcast, 26 July 2023. latent.space/...flashattention
^Tri Dao, OpenReview profile. openreview.net/profile
^"Stefano Ermon Group People." Stanford Computer Science. cs.stanford.edu/...people
^Tri Dao Phuc Quang. *Hardware-aware Algorithms for Efficient Machine Learning*. Ph.D. dissertation, Stanford University, 2023. books.google.com/...aware_Algorithms_for_Efficient
^Tri Dao, Beidi Chen, Nimit Sohoni and co-authors. "Monarch: Expressive Structured Matrices for Efficient and Accurate Training." ICML 2022. proceedings.mlr.press/...dao22a
^Tri Dao, Beidi Chen and co-authors. "Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models." ICLR 2022.
^Albert Gu, *Modeling Sequences with Structured State Spaces*, Ph.D. dissertation, Stanford University, 2023. stacks.stanford.edu/...gu_dissertation-augmented.pdf
^"FlashAttention." Hugging Face Transformers documentation. huggingface.co/...attention_interface
^"Publications." Tri Dao personal website. tridao.me/publications
^Dao-AILab/flash-attention GitHub repository. github.com/...flash-attention
^"About Us." Together AI. together.ai/about-us
^"Tri Dao." Princeton Computer Science faculty page. cs.princeton.edu/...td8762
^"Mamba-2: Algorithms and Systems." Princeton Language and Intelligence blog. pli.princeton.edu/...mamba-2-algorithms-and-systems
^"Tri Dao and Ellen Zhong named AI2050 Fellows by Schmidt Sciences." Princeton Office of the Dean of the Faculty news, 2025. dof.princeton.edu/...2050-fellows-schmidt-sciences
^Tri Dao on X (Twitter), `@tri_dao`. x.com/tri_dao
^"Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference." Imbue *Generally Intelligent* podcast, 8 February 2024. imbue.com/...2024-02-08-podcast-episode-33-tri-dao
^"Tri Dao." TEDAI San Francisco speakers. tedai-sanfrancisco.ted.com/...tri-dao
^"EECS Special Seminar: Tri Dao, 'Hardware-Aware Algorithms and Architectures for Modern AI'." MIT CSAIL. csail.mit.edu/...ithms-and-architectures-modern-ai
^Tri Dao, Google Scholar profile. scholar.google.com/citations

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · v5 · 3,230 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

Albert Gu Axolotl Christopher Ré Flash Attention Flash Attention 3 Flash-Decoding H-Net (dynamic chunking)Hyena Mamba Mamba 2 Mamba-3 Medusa Multi-Head Self-Attention Retentive Network (RetNet)State space model (deep learning)ThunderKittens Together AI Triton (OpenAI GPU programming language)

Key facts

Early life and education

What did Tri Dao work on during his Stanford Ph.D.?

What is FlashAttention?

FlashAttention (2022)

FlashAttention-2 (2023)

FlashAttention-3 (2024)

FlashAttention-4 (2026)

What is Mamba and how does it differ from a Transformer?

Mamba (2023)

Mamba-2 and structured state space duality (2024)

Subsequent work

What is Tri Dao's role at Together AI?

When did Tri Dao join Princeton?

Public profile

Selected publications

References

Improve this article

Related Articles

Noam Shazeer

Percy Liang

Yejin Choi

Richard S. Sutton

Oriol Vinyals

Quoc V. Le

What links here

Related Articles

Noam Shazeer

Percy Liang

Yejin Choi

Richard S. Sutton

Oriol Vinyals

Quoc V. Le

What links here