Tri Dao
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,027 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,027 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tri Dao (born in Vietnam) is a computer scientist whose research lies at the intersection of machine learning and computer systems. He is an assistant professor of computer science at princeton university, a position he took up in September 2024, and is concurrently co-founder and chief scientist of the AI cloud and research company together ai.[^1][^2][^3] He directs the Dao AI Lab at Princeton.[^1]
Dao is best known for two lines of work. The first is the FlashAttention family of algorithms — FlashAttention (2022), FlashAttention-2 (2023), FlashAttention-3 (2024), and FlashAttention-4 (2026) — a sequence of exact, IO-aware GPU kernels for the attention operation that became a de facto standard in modern transformer training and inference stacks.[^4][^5][^6][^7] The second is the Mamba family of selective state space model architectures, developed with collaborator Albert Gu, beginning with Mamba in late 2023 and continuing with Mamba-2 (the structured state space duality framework) in 2024.[^8][^9]
His work has been recognized by an Outstanding Paper runner-up at the International Conference on Machine Learning (ICML) in 2022 for the Monarch paper, an Outstanding Paper award at the inaugural Conference on Language Modeling (COLM) in 2024 for Mamba, an Outstanding Paper Honorable Mention at MLSys 2025, a Best Paper Honorable Mention at MLSys 2026, the inaugural Stanford Open Source Software Prize (2024), the Schmidt Sciences AI2050 Early Career Fellowship (2025), a Google Research Scholar award (2025), and a Google ML and Systems Junior Faculty Award (2025).[^1][^10]
| Field | Value |
|---|---|
| Born | Vietnam (year not publicly disclosed) |
| Nationality | Vietnamese; based in the United States |
| Education | B.S., Stanford University; Ph.D. in Computer Science, Stanford University (2023) |
| Doctoral advisors | Christopher Re and Stefano Ermon |
| Current positions | Assistant Professor of Computer Science, Princeton University (since September 2024); Co-founder and Chief Scientist, Together AI (since July 2023) |
| Lab | Dao AI Lab, Princeton University |
| Notable work | FlashAttention 1/2/3/4; Mamba; Mamba-2 (structured state space duality); Monarch matrices |
| Selected honors | ICML Outstanding Paper runner-up (2022); Stanford Open Source Software Prize (2024); COLM Outstanding Paper (2024); MLSys Outstanding Paper Honorable Mention (2025); MLSys Best Paper Honorable Mention (2026); Schmidt Sciences AI2050 Fellow (2025); Google Research Scholar (2025); Google ML and Systems Junior Faculty Award (2025) |
| Website | tridao.me |
Tri Dao was born in Vietnam and moved to the United States for his university studies.[^11] He matriculated at Stanford University as an undergraduate around 2012-2013 and, by his own account on the Latent Space podcast, initially intended to major in economics before switching toward mathematics after taking introductory math classes in his first weeks of college; this redirection ultimately led him into computer science.[^11] He earned a Bachelor of Science from Stanford, as confirmed by Princeton University's official announcement of his faculty appointment, which lists both a "Ph.D. and a B.S. from Stanford University".[^3]
He remained at Stanford for graduate study, entering the Computer Science Ph.D. program in 2016.[^12] He completed his doctorate in 2023, co-advised by Christopher Re of the Hazy Research lab and Stefano Ermon.[^1][^13] His doctoral dissertation, titled Hardware-aware Algorithms for Efficient Machine Learning, gathers together his Stanford work on structured-matrix methods (butterfly and Monarch matrices), sparsity-aware training, and IO-aware attention algorithms, all themed around co-designing algorithms with the memory hierarchy and parallelism of modern accelerators.[^14]
Dao's doctoral research at Stanford, conducted within Christopher Re's Hazy Research group, focused on co-designing machine learning algorithms with the hardware on which they actually run. A recurring theme is exploiting structure — sparsity, low rank, block-diagonal factorizations, or fixed access patterns — to reduce wall-clock cost while preserving model quality.[^14]
Two early threads from this period are particularly important context for his later work:
This combination of structured-matrix thinking and SSM thinking, fused with a deep concern for what GPUs are actually fast at, motivated the central contribution of his thesis: FlashAttention.
In May 2022, Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra and Christopher Re published FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness on arXiv (2205.14135).[^4] FlashAttention is an exact (not approximate) algorithm for the self-attention layer used in Transformers. Its key insight is that, on modern GPUs, attention is dominated not by floating-point operations but by reads and writes between high-bandwidth memory (HBM) and on-chip SRAM. By tiling the computation so that small blocks of queries, keys and values are loaded once into SRAM and the softmax is computed in a numerically stable, fused manner using online softmax tricks, FlashAttention reduces the number of HBM accesses asymptotically and trains substantially faster than standard implementations while using less memory.[^4] The paper reports a 15% end-to-end wall-clock speedup on BERT-large at sequence length 512 versus the MLPerf 1.1 training record, roughly 3 times speedup on GPT-2 at sequence length 1K, and 2.4 times speedup on the Long Range Arena benchmark at sequence lengths 1K-4K.[^4]
FlashAttention appeared at NeurIPS 2022 and won the Best Paper award at the 2022 ICML workshop on Hardware-Aware Efficient Training.[^1] It was rapidly adopted in production training and inference stacks; by 2023 it was integrated into mainline PyTorch, the Hugging Face Transformers library, and inference engines such as vLLM, and was widely credited as a default building block in essentially every large language model trained from late 2022 onward.[^18]
In July 2023 Dao released FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (arXiv:2307.08691) as a sole-author technical report — his first major release since joining together ai as chief scientist.[^5][^2] FlashAttention-2 retains the IO-aware tiling of the original but rewrites the work partitioning across thread blocks and warps to better match GPU execution units, reducing non-matmul work and improving parallelism along the sequence dimension. The paper reports roughly 2 times speedup over FlashAttention, reaching 50-73% of theoretical peak FLOPs/s on NVIDIA A100, with end-to-end training throughputs around 225 TFLOPs/s per GPU (approximately 72% model FLOPs utilization).[^5]
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (arXiv:2407.08608, July 2024) was a collaboration between Dao and engineers from NVIDIA, Together AI, Colfax Research, and Meta (Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Dao).[^6] FlashAttention-3 targets NVIDIA's Hopper architecture (H100) and exploits warp-specialization, the asynchrony of Tensor Cores and the Tensor Memory Accelerator (TMA), and interleaved block-wise matmul and softmax, together with block quantization and incoherent processing to support FP8. It reaches 1.5-2.0 times speedup over FlashAttention-2 on H100, around 740 TFLOPs/s in FP16 (about 75% of theoretical max) and close to 1.2 PFLOPs/s in FP8, while reducing FP8 numerical error roughly 2.6 times relative to a baseline FP8 attention implementation.[^6]
A fourth generation, FlashAttention-4: Algorithm and Kernel Pipelining Co-design for Asymmetric Hardware Scaling, was published at MLSys 2026 with co-authors Ted Zadouri, Jay Shah, Markus Hohnerbach and others, with Dao as senior author. It received a Best Paper Honorable Mention at that venue.[^19][^1]
Across the series, the open-source flash-attention library (hosted on GitHub under the Dao-AILab organization) became one of the most widely deployed pieces of AI infrastructure of the early 2020s, and was a primary citation for Dao's inaugural Stanford Open Source Software Prize in 2024.[^1][^20]
In parallel with the FlashAttention line, Dao co-developed a family of recurrent, attention-free architectures based on structured state space models, in close collaboration with Albert Gu (then a fellow Stanford Ph.D. student and later assistant professor at Carnegie Mellon University).[^17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, by Albert Gu and Tri Dao, was posted to arXiv on 1 December 2023 (arXiv:2312.00752).[^8] Mamba builds on the S4 line of state space models but introduces an input-dependent selection mechanism: the SSM parameters (the discrete time step and the matrices controlling state evolution) are made functions of the current token, so the model can selectively propagate or forget information along the sequence. This breaks linear time invariance and substantially improves the language-modeling capability of pure SSMs, which had previously underperformed Transformers at scale.[^8]
The authors also designed a hardware-aware parallel scan algorithm that lets Mamba reach throughput competitive with FlashAttention-equipped Transformers while preserving linear-time scaling in sequence length and a roughly five times improvement in inference throughput at long context.[^8] Mamba is "attention-free" — it contains no softmax-attention layers — and was the first SSM-based architecture to match or exceed Transformer perplexity on standard language-modeling benchmarks at the multi-billion-parameter scale.
Mamba received an Outstanding Paper award at the inaugural Conference on Language Modeling (COLM) in 2024.[^1] It sparked a large follow-up literature in 2024-2025 on selective SSMs and hybrid Transformer/SSM architectures.
In Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (arXiv:2405.21060, ICML 2024), Dao and Gu show that a broad class of selective SSMs and a particular family of attention variants are two views of the same underlying object — products of structured semiseparable matrices — and that algorithms designed for one side translate to the other.[^9] The paper introduces the structured state space duality (SSD) framework and the Mamba-2 model whose core layer (the SSD layer) is a refinement of Mamba's selective SSM that is 2-8 times faster on modern GPUs while preserving language-modeling quality. SSD makes pure SSMs matmul-friendly in a way that more directly leverages Tensor Cores, narrowing the gap between SSM training efficiency and Transformer training efficiency.[^9]
Dao and collaborators have continued to publish on the SSM/hybrid architecture line, including work on hybrid Transformer-SSM systems for long-context language modeling and, in 2025-2026, Mamba-3: Improved Sequence Modeling Using State Space Principles (ICLR 2026), as well as systems work such as Marconi: Prefix Caching for the Era of Hybrid LLMs (MLSys 2025, Outstanding Paper Honorable Mention).[^1][^10]
In July 2023, immediately after completing his Stanford Ph.D., Dao joined Together AI as Chief Scientist; he is described by the company as a co-founder and "founding chief scientist". Together AI was founded in June 2022 by Vipul Ved Prakash (CEO), Ce Zhang, Christopher Re and Percy Liang, and Dao came on board roughly a year later, alongside the public release of FlashAttention-2.[^2][^21]
At Together AI, Dao leads research on model architecture and training and inference algorithms. The company brands itself as an "AI Acceleration Cloud" focused on open-source models and high-throughput training and inference, and Dao's work — FlashAttention, Mamba and follow-ons — supplies a substantial fraction of the underlying algorithmic stack. Together AI raised a US$102.5 million funding round in 2023 and subsequent rounds in 2024, with Dao's chief-scientist appointment prominently featured in its press materials and partner communications.[^21][^2]
Dao has continued to publish actively from Together AI in parallel with his Princeton role; the FlashAttention-3 paper, for instance, was a multi-institution collaboration that included Together AI.[^6]
Princeton University's Board of Trustees approved Dao's appointment as Assistant Professor in the Department of Computer Science at its January 31, 2024 meeting, with a start date of September 2024.[^3] His Princeton CS faculty profile lists him as Assistant Professor and notes that he holds a Ph.D. in Computer Science from Stanford (2023).[^22]
At Princeton he leads the Dao AI Lab, which works on efficient training and inference, hardware-aware algorithms, and sequence models with long-range memory.[^1] He has built a sizable group; his personal site as of 2025-2026 lists approximately seven doctoral students, several jointly advised with collaborators at other institutions.[^1] He is also affiliated with Princeton Language and Intelligence (PLI), the university's AI initiative, and contributes to its public-facing technical writing on topics including Mamba-2.[^23]
In November 2025, the Princeton Office of the Dean of the Faculty announced that Dao had been named an Early Career Fellow in the Schmidt Sciences AI2050 program, alongside structural biologist Ellen Zhong, with funding to pursue AI systems that combine novel architectures and tool use to attempt expert-level scientific discovery in domains where labeled data is scarce.[^24][^10]
Dao maintains an active public profile in the AI research community. He posts research updates on X (Twitter) at the handle @tri_dao and operates an open-source software footprint through the tridao and Dao-AILab GitHub organizations, where the flash-attention and mamba code lives.[^25][^20]
He has appeared as a guest on prominent AI podcasts, including Latent Space: The AI Engineer Podcast (July 2023, on FlashAttention-2 and Together AI) and Imbue's Generally Intelligent podcast (February 2024, on FlashAttention, sparsity, quantization, and efficient inference).[^26][^11] He has been a speaker at TEDAI San Francisco, MLSys, COLM, NeurIPS, the Stanford MLSys Seminar, the Kempner Institute at Harvard, and MIT CSAIL, typically on hardware-aware algorithms and modern AI architectures.[^27][^28]
Within the field he is often grouped with a cohort of "ML systems" researchers — alongside collaborators such as Albert Gu, Daniel Fu, Michael Poli, and his former advisor Christopher Re — who emphasize co-design of architectures and accelerators rather than treating GPUs as black-box compute.
The following list, in chronological order, gives a representative selection of Dao's most-cited and most-discussed papers.
As of May 2026 his Google Scholar profile reports approximately 33,000 citations, an h-index near 38, and an i10-index near 52, dominated by the Mamba and FlashAttention papers.[^29]