ThunderKittens
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,303 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,303 words
Add missing citations, update stale details, or suggest a clearer explanation.
ThunderKittens (often abbreviated TK) is an embedded C++ domain-specific language and header-only library for writing high-performance AI kernels on modern NVIDIA GPUs. It was developed by Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré at the Hazy Research lab at Stanford University, and was publicly released on 12 May 2024 alongside a blog post titled "GPUs Go Brrr."[^1][^2] The framework introduces a small set of tile-based primitives (register tiles, register vectors, shared tiles, and shared vectors) that sit between raw CUDA C++ and the Triton (compiler) language, allowing researchers to write kernels that directly invoke warp-group and warp-level hardware features such as asynchronous WGMMA matrix multiplication and the Tensor Memory Accelerator (TMA) on the NVIDIA H100.[^1][^3] Subsequent releases extended TK to Apple Silicon (ThunderMittens), AMD GPUs (HipKittens), NVIDIA Blackwell B200, and multi-GPU clusters (ParallelKittens).[^4][^5][^6][^7]
| Field | Value |
|---|---|
| Initial release | 12 May 2024[^1] |
| Latest major release | ThunderKittens 2.0, 19 February 2026[^8] |
| Developers | Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré[^2] |
| Lab | Hazy Research, Stanford University |
| Language | C++20 (CUDA, header-only)[^9] |
| License | MIT[^9] |
| Repository | github.com/HazyResearch/ThunderKittens[^9] |
| Supported GPUs (current) | Hopper (H100), Blackwell (B200); Ampere legacy[^9] |
| Build requirements | CUDA 12.8+, GCC 11+ or Clang 11+, PyTorch 2.8+ for bindings[^9] |
By early 2024 the GPU programming landscape for AI workloads was bifurcated. On one end, NVIDIA's CUTLASS/CuTe library offered a rich C++ template system for tensor-core kernels, but its deep template metaprogramming was widely perceived as difficult to read and modify. On the other end, OpenAI's Triton (compiler) language provided a Python-style JIT compiler that produced competitive kernels for many transformer operations, but contributors at Hazy Research argued it was difficult to map Triton onto newly introduced Hopper-class hardware features such as warp-group MMA (WGMMA) and the Tensor Memory Accelerator (TMA), and that on more exotic architectures (such as state-space models and linear attention) Triton kernels often lagged hand-written CUDA significantly.[^2][^3]
The H100's tensor cores deliver roughly 989 TFLOP/s of half-precision matrix-multiply throughput, versus only on the order of 60 TFLOP/s for non-tensor operations, meaning the cost of leaving tensor cores idle is substantial. The Hazy Research team argued that the most productive path forward was a "small, dumb, simple" set of abstractions tightly mapped to the actual primitives the silicon exposes, rather than a fully general compiler.[^1]
Hazy Research, the Stanford University research group led by Christopher Ré, had been working on long-sequence architectures (including the FlashAttention line through Tri Dao, the H3/Hyena/Based linear attention families through Daniel Y. Fu and Simran Arora, and the Mamba and Mamba 2 state-space models). Each of these directions repeatedly required custom CUDA kernels to land at competitive throughput, and the group's stated goal for TK was to make the cost of writing and rewriting such kernels low enough that algorithmic innovation could keep pace with hardware capability.[^1][^2][^10]
ThunderKittens was first announced on 12 May 2024 in a Hazy Research blog post titled "GPUs Go Brrr: Fooling Around With Image Models" by Benjamin Spector, Aaryan Singhal, Simran Arora, and Chris Ré.[^1] A companion post the same day, "ThunderKittens: A Simple Embedded DSL for AI Kernels," summarized the framework's design goals.[^2] The release shipped reference kernels for FlashAttention on the NVIDIA A100 and RTX 4090, an H100 implementation of FlashAttention-2, and Based linear attention, claiming the H100 attention forward kernel was roughly 30 percent faster than the FlashAttention-2 baseline at the time and used about 100 lines of device code.[^1] The post also reported that a Based linear attention kernel sustained roughly 215 TFLOP/s on H100.[^1]
The release framed itself as a set of lessons learned from earlier kernel work in the lab rather than as a finished compiler. The blog post walks through five practical lessons (keep the tensor cores fed, use shared memory carefully, hide latency with asynchronous loads, treat WGMMA and TMA as first-class abstractions, and write small composable kernels), and then introduces ThunderKittens as the codification of those lessons in a header file. Two complete examples (an RTX 4090 flash attention kernel of roughly 60 lines achieving about 75 percent of theoretical utilization, and the 100-line H100 FlashAttention-2 kernel) accompany the announcement.[^1]
On 29 October 2024, the same team released a substantial update under the title "Easier, Better, Faster, Cuter." This release added a fused Mamba 2 kernel (described as "several times faster than the current Triton implementation"), long convolution kernels based on FFTConv at roughly 9 times the throughput of FlashFFTConv at sequence length 4096, additional linear attention variants (Based and LoLCATS Hedgehog), and fused RoPE, LayerNorm, and Linear-layer kernels. The same update introduced a Python setup.py install build flow, automatic shared-memory management, global layout descriptors, and broader FP16/FP32 support; the team also reported that the TK attention kernels exceeded FlashAttention-3 on the backward pass.[^10]
On 27 October 2024 the authors posted the arXiv preprint "ThunderKittens: Simple, Fast, and Adorable AI Kernels" (arXiv:2410.20399), with Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré as authors.[^11] The paper formalizes a three-level abstraction (warp, thread-block, grid), describes the load-compute-store-finish (LCSF) template that organizes asynchronous pipelining at the block level, and reports that TK matches cuBLAS and FlashAttention-3 on standard GEMM and attention forward, while delivering 10 to 40 percent improvements on attention backward, up to roughly 14 times speedups on linear attention versus Flash Linear Attention, and around 8 times speedups on state-space model kernels.[^3][^11]
On 28 November 2024, Conner Takehana and Aaryan Singhal (with the original TK team) released ThunderMittens, a Metal Shading Language port of the ThunderKittens abstractions for Apple Silicon, initially targeting the M2 Pro.[^4] The post observed that only one major change was required, shrinking the base tile from 16x16 (matched to H100 tensor cores) to 8x8 to align with Metal's simd_matrix<T,8,8> intrinsic. ThunderMittens reportedly delivered GEMM kernels roughly 9 percent faster than baseline and attention inference within plus or minus 15 percent of the reference MLX implementation for head dimensions 64 and 128.[^4]
On 15 March 2025, Benjamin Spector, Aaryan Singhal, Dan Fu, and Chris Ré published "ThunderKittens Now on Blackwells!" introducing support for the NVIDIA Blackwell B200 generation. The port added BF16 and FP8 GEMM kernels, plus attention forward and backward implementations. The team reported that the TK BF16 GEMM was up to roughly twice as fast as the cuBLAS H100 implementation and that the attention kernels were also up to twice as fast as FlashAttention-3 on H100, while being near cuDNN performance on B200.[^5]
The Blackwell release exposed three new hardware concepts to TK programmers: fifth-generation tensor cores that behave as 128x128 systolic arrays (requiring M and N dimensions to be multiples of 128 for full utilization), Tensor Memory (TMEM), a 256 KB dedicated register layer used to stage tensor-core operands, and CTA pairs under TCGEN05, allowing two cooperative thread arrays on adjacent SMs to coordinate a single larger GEMM.[^5]
In a 22 September 2025 post, "One Kernel for All Your GPUs," Stuart Sul, Dylan Lim, Benjamin Spector, and Chris Ré introduced a Parallel Global Layout (PGL) abstraction and a TKParallelTensor PyTorch wrapper for writing multi-GPU kernels with TMA descriptors and multicast addresses, demonstrating all-reduce, all-gather, and all-to-all collectives competitive with (and in some cases up to 2.6 times faster than) NCCL (NVIDIA Collective Communications Library) on 8x B200 systems.[^6]
This effort culminated in ParallelKittens, released on 17 November 2025 by Stuart Sul, Simran Arora, Benjamin Spector, and Chris Ré, which extends ThunderKittens to fused multi-GPU kernels for sequence parallelism, all-gather plus GEMM, ring attention, and other patterns over NVLink.[^7]
On 11 November 2025, William Hu, Drew Wadsworth, Chris Ré, Simran Arora, and collaborators (including Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, and Muhammad Osama) released HipKittens, an AMD-targeted sister project hosted at HazyResearch/HipKittens. HipKittens targets the AMD Instinct MI355X (with references to MI350X) and reports that a roughly 500-line GQA attention kernel outperforms the AITER hand-optimized assembly baseline; their GEMM hot loop fits in under 100 lines and achieves peak performance in BF16 and FP8.[^12]
ThunderKittens 2.0, by Stuart Sul and Chris Ré, was released on 19 February 2026. The release focused on internal refactoring ("subtraction as much as addition"), removing unnecessary fence instructions (worth 20 to 30 TFLOP/s on certain kernels), improving PTX assembler output for up to 10 percent end-to-end gains, adding CLC scheduling, and merging external contributions. It also delivered new GEMM kernels in MXFP8 and NVFP4 precisions matching or surpassing cuBLAS on B200, along with refreshed example kernels.[^8]
ThunderKittens is deliberately small. The authors describe the library as an "opinionated" set of primitives implemented as C++ templates inside a single CUDA header (kittens.cuh), with a published size of under 1 MB compared to CUTLASS at roughly 22 MB and cuBLAS at roughly 689 MB.[^3] The motivating insight is that the vast majority of modern AI compute is matrix multiplications and reductions over relatively small tiles, and that the GPU's tensor cores impose hard alignment requirements that make 16x16 (on Hopper) or 8x8 (on Apple Silicon) tiles a natural unit of abstraction. The TK team summarizes the constraint as a rhetorical question: "if your matrix multiply is smaller than 16x16, are you sure what you're doing is AI?"[^1]
The arXiv paper describes a three-level mapping of TK abstractions to the GPU's parallelism hierarchy:[^3]
| Level | TK abstractions | Mapped hardware |
|---|---|---|
| Warp | Register tile, register vector, shared tile, shared vector; PyTorch-style mma, exp, cumsum, pointwise ops[^1][^3] | 32-thread warp on a single SM |
| Block | Load-compute-store-finish (LCSF) template with load workers, compute workers, multi-stage pipelined buffers[^3] | Thread block / cooperative thread array (CTA) |
| Grid | Persistent launch and L2-cache-aware block scheduling[^3] | Entire kernel grid across SMs |
At the warp level, register and shared tiles are templated by element type, shape, and layout, and shared tiles automatically pick a swizzling pattern to avoid bank conflicts. The paper reports that TK attention backward eliminates the 9.6-way bank conflicts observed in profiling FlashAttention-3, contributing to an 85 percent reduction in shared-memory stall cycles.[^3]
At the block level, the LCSF template separates warps into "loaders" that move data from HBM into shared memory using TMA, and "computers" that consume the loaded tiles via WGMMA. Multi-stage circular buffers and named barriers allow the producer and consumer warps to overlap, with the paper reporting a roughly 3x throughput improvement (260 to 683 TFLOP/s on a representative kernel) as pipeline depth is increased.[^3]
At the grid level, persistent kernels keep a fixed number of CTAs resident on each SM and let them iterate over work tiles, and the framework includes L2-aware block schedulers that reorder tile assignment to improve cache reuse. The paper notes that for large matrices L2 reuse can change observed performance by more than 50 percent.[^3]
TK is built specifically around hardware features that became available in NVIDIA H100 (Hopper) and were extended in NVIDIA Blackwell:
A TK kernel is a CUDA __global__ function that declares register_tile and shared_tile variables, invokes warp- or warp-group-level ops such as tk::mma_AB, tk::exp, or tk::sum, and uses TMA primitives for asynchronous I/O. The H100 FlashAttention-2 reference kernel in the original blog post is roughly 100 lines of device code, while the team reports a comparable hand-written CUDA implementation runs to about 1,200 lines.[^1] On the same hardware, a TK matrix-multiply example reaches roughly 855 TFLOP/s, around 86 percent of the H100's theoretical BF16 peak, in under 100 lines.[^9]
The shape of a typical TK attention kernel illustrates the model: a producer warp issues TMA loads of Q, K, and V tiles into a multi-stage shared-memory ring buffer, while consumer warp groups iterate over the tiles, performing WGMMA matrix multiplies into register tiles, applying the softmax rescaling that FlashAttention uses, and finally writing the output tile back via TMA stores. The producer and consumer coordinate through named barriers, and tile shapes and stages are template parameters chosen by the kernel author. The framework provides "PyTorch-like" tile operations (pointwise multiply, exp, mma, cumsum, transpose) so that the algorithmic structure of the kernel reads similarly to the higher-level mathematical description.[^1][^3]
The arXiv paper formalizes a simple cost model in which kernel runtime is the maximum of the time to move data and the time to do compute, plus overheads. Achieving good performance reduces to keeping this maximum near the lower of the two ceilings (the HBM bandwidth limit and the tensor-core FLOPS limit), which in turn requires overlapping loads and computes well. The LCSF template is designed to make this overlap explicit, and the paper's Figure 6 reports a trade-off in which higher CTA occupancy increases overlap but constrains per-worker tile size due to register pressure; TK exposes this trade-off through a single occupancy parameter rather than a hand-crafted schedule.[^3]
The HazyResearch/ThunderKittens repository ships a growing set of reference kernels that double as benchmarks and as starting points for users. As of the ThunderKittens 2.0 release these include:[^3][^9][^10]
| Kernel | Notes |
|---|---|
| GEMM (BF16, FP8, MXFP8, NVFP4) | Matches or surpasses cuBLAS on H100 and B200 in the reported configurations[^3][^5][^8] |
| Attention forward (causal and non-causal) | Matches FlashAttention-3 on H100; up to ~2x faster than FA3 on B200[^3][^5] |
| Attention backward | Reported 10 to 40 percent improvement over FlashAttention-3 on H100[^3] |
| Mamba-2 fused kernel | "Several times faster" than the prior Triton implementation; ~8x state-space-model speedup in the paper[^3][^10] |
| Linear attention (Based, LoLCATS, Hedgehog) | Up to ~14x faster than Flash Linear Attention[^3][^10] |
| Long convolution / FFTConv | ~7.9x faster than FlashFFTConv at 1024 tokens, ~9x at 4096 tokens[^3][^10] |
| Rotary positional embedding | Competitive with Triton reference[^10] |
| LayerNorm, Linear, dropout-residual-layernorm fused ops | Production-oriented fused kernels[^10][^12] |
| LoLCATS linearization kernels | Tied to the LoLCATS linear-attention conversion recipe[^9] |
| Multi-GPU collectives (ParallelKittens) | All-reduce, all-gather, reduce-scatter, all-to-all, ring attention[^7] |
The Kittens family has expanded into a small ecosystem of sister projects sharing the tile abstraction:
| Project | Target hardware | Release | Authors (first listed) |
|---|---|---|---|
| ThunderKittens | NVIDIA Hopper, then Blackwell | 12 May 2024[^1] | Spector, Singhal, Arora, Ré |
| ThunderMittens | Apple Silicon (Metal Shading Language) | 28 November 2024[^4] | Takehana, Singhal |
| ThunderKittens Blackwell port | NVIDIA B200 | 15 March 2025[^5] | Spector, Singhal, Fu, Ré |
| One Kernel for All Your GPUs (PGL) | Multi-GPU NVLink/NVSwitch | 22 September 2025[^6] | Sul, Lim, Spector, Ré |
| HipKittens | AMD Instinct MI355X (CDNA4) | 11 November 2025[^12] | Hu, Wadsworth, Ré, Arora |
| ParallelKittens | Multi-GPU sequence/expert parallelism | 17 November 2025[^7] | Sul, Arora, Spector, Ré |
| ThunderKittens 2.0 | Hopper, Blackwell; MXFP8, NVFP4 | 19 February 2026[^8] | Sul, Ré |
| System | Style | Primary user | Hopper feature coverage | Repository size (per TK paper)[^3] |
|---|---|---|---|---|
| Raw CUDA C++ | Imperative C++ with PTX/MMA intrinsics | Library authors | Full but manual | n/a |
| CUTLASS / CuTe | Heavily templated C++ DSL with layout algebra | Library authors | Full | ~22 MB |
| OpenAI Triton (compiler) | Python-based, JIT-compiled, block-pointer model | Researchers and ML engineers | Partial; explicit WGMMA/TMA exposure is limited | ~12.6 MB |
| cuBLAS | Closed-source optimized binary library | All | Full but opaque | ~689 MB |
| ThunderKittens | Small embedded C++ DSL with tile primitives | Researchers writing custom kernels | Full direct WGMMA/TMA/TCGEN05 | <1 MB |
The TK paper argues that CUTLASS's flexibility "comes at an accessibility cost" because of its nested template style, and that Triton's higher-level abstractions "fail less gracefully" when a workload falls outside its assumed patterns, such as exotic recurrent state-space kernels.[^3] By contrast, TK aims to remain ordinary CUDA C++ that any CUDA programmer can extend, while baking in just enough structure (tiles, LCSF, layout descriptors) to make the asynchronous Hopper and Blackwell hardware tractable.[^1][^3]
ThunderKittens began as an internal project at Hazy Research but has seen growing external use. The Hazy Research team reports that TK powers production training kernels at Cursor (for the Composer code model) and inference kernels at Together AI.[^8] The ThunderKittens 2.0 announcement also mentions that the framework is used at firms including Jump Trading.[^8] Community and academic interest is reflected in the project's public repository under HazyResearch/ThunderKittens (MIT-licensed, header-only) and in discussions on platforms such as Hacker News, where the original "GPUs Go Brrr" post appeared on the front page in May 2024.[^13]
Adoption within the Hazy Research lab itself spans the group's own research on long-context architectures: the Based linear-attention architecture, the Mamba-2 state-space model line, the LoLCATS linear-attention conversion pipeline, the FFTConv long-convolution kernels, and various nanoGPT-TK and PyTorch Lightning training demonstrations all ship as TK kernels in the public repository.[^1][^10] The framework's role as a "research vehicle" is reinforced by the fact that the original release explicitly bundled a nanoGPT integration so that users could see end-to-end training of a small language model on top of TK kernels.[^2][^10]
ThunderKittens is one of several efforts in 2024 to 2026 that argue the practical bottleneck for new AI architectures is no longer raw FLOPS but the difficulty of writing kernels that exploit modern accelerator features such as warp-group MMA, TMA, and tensor memory. By demonstrating that 100-line tile-based kernels can match or beat the hand-tuned FlashAttention-3 and cuBLAS on H100 and B200, and by showing 8x to 14x speedups on State space model (deep learning) and Linear Attention kernels, the project provides empirical support for the idea that small, opinionated DSLs can close the gap between research-friendly languages and vendor libraries on a workload-by-workload basis.[^1][^3][^10] The follow-on work (ThunderMittens, HipKittens, ParallelKittens) further suggests that the tile-based programming model generalizes across vendors and across scaling regimes from a single chip to multi-GPU clusters.[^4][^7][^12]
A second contribution of TK is pedagogical. The "GPUs Go Brrr" post is widely cited as an accessible explanation of why Hopper-class GPUs require asynchronous overlap to reach peak throughput, why bank conflicts and shared-memory swizzling matter, and how WGMMA and TMA differ from their predecessors. The deliberately compact codebase (under 1 MB compared to CUTLASS at roughly 22 MB) and the use of plain CUDA C++ rather than an intermediate IR have made TK a frequent entry point for researchers learning to write fused kernels.[^1][^3][^13]
The published numbers for ThunderKittens are scattered across the original blog post, the arXiv paper, and the various follow-ups. The table below collects the most frequently cited measurements for context.
| Workload | Hardware | TK result | Baseline | Source |
|---|---|---|---|---|
| FlashAttention forward (causal) | H100 SXM | ~30% faster than FlashAttention-2 | FlashAttention-2 | "GPUs Go Brrr"[^1] |
| Based linear attention | H100 SXM | ~215 TFLOP/s sustained (300+ counting algorithmic recomputation) | Prior linear attention | "GPUs Go Brrr"[^1] |
| FlashAttention backward | H100 | 10 to 40 percent faster than FlashAttention-3 | FlashAttention-3 | TK paper[^3] |
| Linear attention (Based) | H100 | ~14x faster | Flash Linear Attention | TK paper[^3] |
| Linear attention (learned features) | H100 | ~6.5x faster | Flash Linear Attention | TK paper[^3] |
| State-space models (Mamba-2) | H100 | ~8x faster than Triton baselines | Triton Mamba-2 | TK paper[^3] |
| Long convolution (FFTConv) | H100 | ~7.9x faster at 1024 tokens, ~9x at 4096 tokens | FlashFFTConv | TK paper, TK2[^3][^10] |
| GEMM BF16 | B200 | Up to ~2x faster than cuBLAS on H100, near cuBLAS on B200 | cuBLAS | Blackwell post[^5] |
| Attention forward/backward | B200 | Up to ~2x faster than FlashAttention-3 on H100; near cuDNN on B200 | FlashAttention-3, cuDNN | Blackwell post[^5] |
| GEMM example | H100 | ~855 TFLOP/s (~86 percent of peak BF16) | n/a | TK README[^9] |
| All-to-all collective | 8x B200 | Up to ~2.6x faster than NCCL | NCCL | PGL post[^6] |
These numbers are reported by the authors using their own measurement methodologies and should be read in that context; competing libraries continue to improve, and some baselines (FlashAttention-2, Triton Mamba-2) have themselves been updated since the corresponding TK measurements were published.
ThunderKittens is narrower than competing systems in several respects: