ThunderKittens

AI Infrastructure Developer Tools Open Source AI

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 4,301 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ThunderKittens (often abbreviated TK) is an embedded C++ domain-specific language and header-only library for writing high-performance AI kernels on modern NVIDIA GPUs. It was developed by Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré at the Hazy Research lab at Stanford University, and was publicly released on 12 May 2024 alongside a blog post titled "GPUs Go Brrr."^[1]^[2] The framework introduces a small set of tile-based primitives (register tiles, register vectors, shared tiles, and shared vectors) that sit between raw CUDA C++ and the Triton (compiler) language, allowing researchers to write kernels that directly invoke warp-group and warp-level hardware features such as asynchronous WGMMA matrix multiplication and the Tensor Memory Accelerator (TMA) on the NVIDIA H100.^[1]^[3] Subsequent releases extended TK to Apple Silicon (ThunderMittens), AMD GPUs (HipKittens), NVIDIA Blackwell B200, and multi-GPU clusters (ParallelKittens).^[4]^[5]^[6]^[7]

Infobox

Field	Value
Initial release	12 May 2024^[1]
Latest major release	ThunderKittens 2.0, 19 February 2026^[8]
Developers	Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, Christopher Ré^[2]
Lab	Hazy Research, Stanford University
Language	C++20 (CUDA, header-only)^[9]
License	MIT^[9]
Repository	github.com/HazyResearch/ThunderKittens^[9]
Supported GPUs (current)	Hopper (H100), Blackwell (B200); Ampere legacy^[9]
Build requirements	CUDA 12.8+, GCC 11+ or Clang 11+, PyTorch 2.8+ for bindings^[9]

History

Background and motivation

By early 2024 the GPU programming landscape for AI workloads was bifurcated. On one end, NVIDIA's CUTLASS/CuTe library offered a rich C++ template system for tensor-core kernels, but its deep template metaprogramming was widely perceived as difficult to read and modify. On the other end, OpenAI's Triton (compiler) language provided a Python-style JIT compiler that produced competitive kernels for many transformer operations, but contributors at Hazy Research argued it was difficult to map Triton onto newly introduced Hopper-class hardware features such as warp-group MMA (WGMMA) and the Tensor Memory Accelerator (TMA), and that on more exotic architectures (such as state-space models and linear attention) Triton kernels often lagged hand-written CUDA significantly.^[2]^[3]

The H100's tensor cores deliver roughly 989 TFLOP/s of half-precision matrix-multiply throughput, versus only on the order of 60 TFLOP/s for non-tensor operations, meaning the cost of leaving tensor cores idle is substantial. The Hazy Research team argued that the most productive path forward was a "small, dumb, simple" set of abstractions tightly mapped to the actual primitives the silicon exposes, rather than a fully general compiler.^[1]

Hazy Research, the Stanford University research group led by Christopher Ré, had been working on long-sequence architectures (including the FlashAttention line through Tri Dao, the H3/Hyena/Based linear attention families through Daniel Y. Fu and Simran Arora, and the Mamba and Mamba 2 state-space models). Each of these directions repeatedly required custom CUDA kernels to land at competitive throughput, and the group's stated goal for TK was to make the cost of writing and rewriting such kernels low enough that algorithmic innovation could keep pace with hardware capability.^[1]^[2]^[10]

Release: "GPUs Go Brrr" (May 2024)

ThunderKittens was first announced on 12 May 2024 in a Hazy Research blog post titled "GPUs Go Brrr: Fooling Around With Image Models" by Benjamin Spector, Aaryan Singhal, Simran Arora, and Chris Ré.^[1] A companion post the same day, "ThunderKittens: A Simple Embedded DSL for AI Kernels," summarized the framework's design goals.^[2] The release shipped reference kernels for FlashAttention on the NVIDIA A100 and RTX 4090, an H100 implementation of FlashAttention-2, and Based linear attention, claiming the H100 attention forward kernel was roughly 30 percent faster than the FlashAttention-2 baseline at the time and used about 100 lines of device code.^[1] The post also reported that a Based linear attention kernel sustained roughly 215 TFLOP/s on H100.^[1]

The release framed itself as a set of lessons learned from earlier kernel work in the lab rather than as a finished compiler. The blog post walks through five practical lessons (keep the tensor cores fed, use shared memory carefully, hide latency with asynchronous loads, treat WGMMA and TMA as first-class abstractions, and write small composable kernels), and then introduces ThunderKittens as the codification of those lessons in a header file. Two complete examples (an RTX 4090 flash attention kernel of roughly 60 lines achieving about 75 percent of theoretical utilization, and the 100-line H100 FlashAttention-2 kernel) accompany the announcement.^[1]

"Easier, Better, Faster, Cuter" (October 2024)

On 29 October 2024, the same team released a substantial update under the title "Easier, Better, Faster, Cuter." This release added a fused Mamba 2 kernel (described as "several times faster than the current Triton implementation"), long convolution kernels based on FFTConv at roughly 9 times the throughput of FlashFFTConv at sequence length 4096, additional linear attention variants (Based and LoLCATS Hedgehog), and fused RoPE, LayerNorm, and Linear-layer kernels. The same update introduced a Python setup.py install build flow, automatic shared-memory management, global layout descriptors, and broader FP16/FP32 support; the team also reported that the TK attention kernels exceeded FlashAttention-3 on the backward pass.^[10]

Paper: "ThunderKittens: Simple, Fast, and Adorable AI Kernels" (October 2024)

On 27 October 2024 the authors posted the arXiv preprint "ThunderKittens: Simple, Fast, and Adorable AI Kernels" (arXiv:2410.20399), with Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré as authors.^[11] The paper formalizes a three-level abstraction (warp, thread-block, grid), describes the load-compute-store-finish (LCSF) template that organizes asynchronous pipelining at the block level, and reports that TK matches cuBLAS and FlashAttention-3 on standard GEMM and attention forward, while delivering 10 to 40 percent improvements on attention backward, up to roughly 14 times speedups on linear attention versus Flash Linear Attention, and around 8 times speedups on state-space model kernels.^[3]^[11]

ThunderMittens (Apple Silicon, November 2024)

On 28 November 2024, Conner Takehana and Aaryan Singhal (with the original TK team) released ThunderMittens, a Metal Shading Language port of the ThunderKittens abstractions for Apple Silicon, initially targeting the M2 Pro.^[4] The post observed that only one major change was required, shrinking the base tile from 16x16 (matched to H100 tensor cores) to 8x8 to align with Metal's simd_matrix<T,8,8> intrinsic. ThunderMittens reportedly delivered GEMM kernels roughly 9 percent faster than baseline and attention inference within plus or minus 15 percent of the reference MLX implementation for head dimensions 64 and 128.^[4]

Blackwell port (March 2025)

On 15 March 2025, Benjamin Spector, Aaryan Singhal, Dan Fu, and Chris Ré published "ThunderKittens Now on Blackwells!" introducing support for the NVIDIA Blackwell B200 generation. The port added BF16 and FP8 GEMM kernels, plus attention forward and backward implementations. The team reported that the TK BF16 GEMM was up to roughly twice as fast as the cuBLAS H100 implementation and that the attention kernels were also up to twice as fast as FlashAttention-3 on H100, while being near cuDNN performance on B200.^[5]

The Blackwell release exposed three new hardware concepts to TK programmers: fifth-generation tensor cores that behave as 128x128 systolic arrays (requiring M and N dimensions to be multiples of 128 for full utilization), Tensor Memory (TMEM), a 256 KB dedicated register layer used to stage tensor-core operands, and CTA pairs under TCGEN05, allowing two cooperative thread arrays on adjacent SMs to coordinate a single larger GEMM.^[5]

One Kernel for All Your GPUs and ParallelKittens (September to November 2025)

In a 22 September 2025 post, "One Kernel for All Your GPUs," Stuart Sul, Dylan Lim, Benjamin Spector, and Chris Ré introduced a Parallel Global Layout (PGL) abstraction and a TKParallelTensor PyTorch wrapper for writing multi-GPU kernels with TMA descriptors and multicast addresses, demonstrating all-reduce, all-gather, and all-to-all collectives competitive with (and in some cases up to 2.6 times faster than) NCCL (NVIDIA Collective Communications Library) on 8x B200 systems.^[6]

This effort culminated in ParallelKittens, released on 17 November 2025 by Stuart Sul, Simran Arora, Benjamin Spector, and Chris Ré, which extends ThunderKittens to fused multi-GPU kernels for sequence parallelism, all-gather plus GEMM, ring attention, and other patterns over NVLink.^[7]

HipKittens (AMD, November 2025)

On 11 November 2025, William Hu, Drew Wadsworth, Chris Ré, Simran Arora, and collaborators (including Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, and Muhammad Osama) released HipKittens, an AMD-targeted sister project hosted at HazyResearch/HipKittens. HipKittens targets the AMD Instinct MI355X (with references to MI350X) and reports that a roughly 500-line GQA attention kernel outperforms the AITER hand-optimized assembly baseline; their GEMM hot loop fits in under 100 lines and achieves peak performance in BF16 and FP8.^[12]

ThunderKittens 2.0 (February 2026)

ThunderKittens 2.0, by Stuart Sul and Chris Ré, was released on 19 February 2026. The release focused on internal refactoring ("subtraction as much as addition"), removing unnecessary fence instructions (worth 20 to 30 TFLOP/s on certain kernels), improving PTX assembler output for up to 10 percent end-to-end gains, adding CLC scheduling, and merging external contributions. It also delivered new GEMM kernels in MXFP8 and NVFP4 precisions matching or surpassing cuBLAS on B200, along with refreshed example kernels.^[8]

Technical details

Design philosophy

ThunderKittens is deliberately small. The authors describe the library as an "opinionated" set of primitives implemented as C++ templates inside a single CUDA header (kittens.cuh), with a published size of under 1 MB compared to CUTLASS at roughly 22 MB and cuBLAS at roughly 689 MB.^[3] The motivating insight is that the vast majority of modern AI compute is matrix multiplications and reductions over relatively small tiles, and that the GPU's tensor cores impose hard alignment requirements that make 16x16 (on Hopper) or 8x8 (on Apple Silicon) tiles a natural unit of abstraction. The TK team summarizes the constraint as a rhetorical question: "if your matrix multiply is smaller than 16x16, are you sure what you're doing is AI?"^[1]

Three-level hierarchy

The arXiv paper describes a three-level mapping of TK abstractions to the GPU's parallelism hierarchy:^[3]

Level	TK abstractions	Mapped hardware
Warp	Register tile, register vector, shared tile, shared vector; PyTorch-style `mma`, `exp`, `cumsum`, pointwise ops^[1]^[3]	32-thread warp on a single SM
Block	Load-compute-store-finish (LCSF) template with load workers, compute workers, multi-stage pipelined buffers^[3]	Thread block / cooperative thread array (CTA)
Grid	Persistent launch and L2-cache-aware block scheduling^[3]	Entire kernel grid across SMs

At the warp level, register and shared tiles are templated by element type, shape, and layout, and shared tiles automatically pick a swizzling pattern to avoid bank conflicts. The paper reports that TK attention backward eliminates the 9.6-way bank conflicts observed in profiling FlashAttention-3, contributing to an 85 percent reduction in shared-memory stall cycles.^[3]

At the block level, the LCSF template separates warps into "loaders" that move data from HBM into shared memory using TMA, and "computers" that consume the loaded tiles via WGMMA. Multi-stage circular buffers and named barriers allow the producer and consumer warps to overlap, with the paper reporting a roughly 3x throughput improvement (260 to 683 TFLOP/s on a representative kernel) as pipeline depth is increased.^[3]

At the grid level, persistent kernels keep a fixed number of CTAs resident on each SM and let them iterate over work tiles, and the framework includes L2-aware block schedulers that reorder tile assignment to improve cache reuse. The paper notes that for large matrices L2 reuse can change observed performance by more than 50 percent.^[3]

Hardware features used

TK is built specifically around hardware features that became available in NVIDIA H100 (Hopper) and were extended in NVIDIA Blackwell:

WGMMA (Warp-Group Matrix Multiply Accumulate): an asynchronous tensor-core instruction operating on a 128-thread warp group that produces tiles in registers. TK directly emits WGMMA, allowing the compute workers to fire matrix multiplies while loaders refill shared memory.^[1]^[3]
TMA (Tensor Memory Accelerator): hardware-managed bulk copy that generates HBM addresses from a descriptor instead of per-thread address arithmetic. TK builds TMA descriptors from its global-layout objects and uses them for asynchronous loads and stores.^[1]^[3]
Tensor Memory (TMEM) and TCGEN05 (Blackwell): a dedicated 256 KB register layer for tensor-core operands, plus the ability for two CTAs on adjacent SMs to coordinate a single larger MMA. TK exposes both through new template types in the Blackwell port.^[5]
Shared-memory swizzling: TK selects one of three swizzling layouts (32, 64, or 128 bytes) automatically for each tile to avoid bank conflicts.^[3]

Programming model

A TK kernel is a CUDA __global__ function that declares register_tile and shared_tile variables, invokes warp- or warp-group-level ops such as tk::mma_AB, tk::exp, or tk::sum, and uses TMA primitives for asynchronous I/O. The H100 FlashAttention-2 reference kernel in the original blog post is roughly 100 lines of device code, while the team reports a comparable hand-written CUDA implementation runs to about 1,200 lines.^[1] On the same hardware, a TK matrix-multiply example reaches roughly 855 TFLOP/s, around 86 percent of the H100's theoretical BF16 peak, in under 100 lines.^[9]

The shape of a typical TK attention kernel illustrates the model: a producer warp issues TMA loads of Q, K, and V tiles into a multi-stage shared-memory ring buffer, while consumer warp groups iterate over the tiles, performing WGMMA matrix multiplies into register tiles, applying the softmax rescaling that FlashAttention uses, and finally writing the output tile back via TMA stores. The producer and consumer coordinate through named barriers, and tile shapes and stages are template parameters chosen by the kernel author. The framework provides "PyTorch-like" tile operations (pointwise multiply, exp, mma, cumsum, transpose) so that the algorithmic structure of the kernel reads similarly to the higher-level mathematical description.^[1]^[3]

Occupancy and the cost model

The arXiv paper formalizes a simple cost model in which kernel runtime is the maximum of the time to move data and the time to do compute, plus overheads. Achieving good performance reduces to keeping this maximum near the lower of the two ceilings (the HBM bandwidth limit and the tensor-core FLOPS limit), which in turn requires overlapping loads and computes well. The LCSF template is designed to make this overlap explicit, and the paper's Figure 6 reports a trade-off in which higher CTA occupancy increases overlap but constrains per-worker tile size due to register pressure; TK exposes this trade-off through a single occupancy parameter rather than a hand-crafted schedule.^[3]

Reference kernels

The HazyResearch/ThunderKittens repository ships a growing set of reference kernels that double as benchmarks and as starting points for users. As of the ThunderKittens 2.0 release these include:^[3]^[9]^[10]

Kernel	Notes
GEMM (BF16, FP8, MXFP8, NVFP4)	Matches or surpasses cuBLAS on H100 and B200 in the reported configurations^[3]^[5]^[8]
Attention forward (causal and non-causal)	Matches FlashAttention-3 on H100; up to ~2x faster than FA3 on B200^[3]^[5]
Attention backward	Reported 10 to 40 percent improvement over FlashAttention-3 on H100^[3]
Mamba-2 fused kernel	"Several times faster" than the prior Triton implementation; ~8x state-space-model speedup in the paper^[3]^[10]
Linear attention (Based, LoLCATS, Hedgehog)	Up to ~14x faster than Flash Linear Attention^[3]^[10]
Long convolution / FFTConv	~7.9x faster than FlashFFTConv at 1024 tokens, ~9x at 4096 tokens^[3]^[10]
Rotary positional embedding	Competitive with Triton reference^[10]
LayerNorm, Linear, dropout-residual-layernorm fused ops	Production-oriented fused kernels^[10]^[12]
LoLCATS linearization kernels	Tied to the LoLCATS linear-attention conversion recipe^[9]
Multi-GPU collectives (ParallelKittens)	All-reduce, all-gather, reduce-scatter, all-to-all, ring attention^[7]

Variants and ports

The Kittens family has expanded into a small ecosystem of sister projects sharing the tile abstraction:

Project	Target hardware	Release	Authors (first listed)
ThunderKittens	NVIDIA Hopper, then Blackwell	12 May 2024^[1]	Spector, Singhal, Arora, Ré
ThunderMittens	Apple Silicon (Metal Shading Language)	28 November 2024^[4]	Takehana, Singhal
ThunderKittens Blackwell port	NVIDIA B200	15 March 2025^[5]	Spector, Singhal, Fu, Ré
One Kernel for All Your GPUs (PGL)	Multi-GPU NVLink/NVSwitch	22 September 2025^[6]	Sul, Lim, Spector, Ré
HipKittens	AMD Instinct MI355X (CDNA4)	11 November 2025^[12]	Hu, Wadsworth, Ré, Arora
ParallelKittens	Multi-GPU sequence/expert parallelism	17 November 2025^[7]	Sul, Arora, Spector, Ré
ThunderKittens 2.0	Hopper, Blackwell; MXFP8, NVFP4	19 February 2026^[8]	Sul, Ré

Comparison to other GPU programming approaches

System	Style	Primary user	Hopper feature coverage	Repository size (per TK paper)^[3]
Raw CUDA C++	Imperative C++ with PTX/MMA intrinsics	Library authors	Full but manual	n/a
CUTLASS / CuTe	Heavily templated C++ DSL with layout algebra	Library authors	Full	~22 MB
OpenAI Triton (compiler)	Python-based, JIT-compiled, block-pointer model	Researchers and ML engineers	Partial; explicit WGMMA/TMA exposure is limited	~12.6 MB
cuBLAS	Closed-source optimized binary library	All	Full but opaque	~689 MB
ThunderKittens	Small embedded C++ DSL with tile primitives	Researchers writing custom kernels	Full direct WGMMA/TMA/TCGEN05	<1 MB

The TK paper argues that CUTLASS's flexibility "comes at an accessibility cost" because of its nested template style, and that Triton's higher-level abstractions "fail less gracefully" when a workload falls outside its assumed patterns, such as exotic recurrent state-space kernels.^[3] By contrast, TK aims to remain ordinary CUDA C++ that any CUDA programmer can extend, while baking in just enough structure (tiles, LCSF, layout descriptors) to make the asynchronous Hopper and Blackwell hardware tractable.^[1]^[3]

Adoption

ThunderKittens began as an internal project at Hazy Research but has seen growing external use. The Hazy Research team reports that TK powers production training kernels at Cursor (for the Composer code model) and inference kernels at Together AI.^[8] The ThunderKittens 2.0 announcement also mentions that the framework is used at firms including Jump Trading.^[8] Community and academic interest is reflected in the project's public repository under HazyResearch/ThunderKittens (MIT-licensed, header-only) and in discussions on platforms such as Hacker News, where the original "GPUs Go Brrr" post appeared on the front page in May 2024.^[13]

Adoption within the Hazy Research lab itself spans the group's own research on long-context architectures: the Based linear-attention architecture, the Mamba-2 state-space model line, the LoLCATS linear-attention conversion pipeline, the FFTConv long-convolution kernels, and various nanoGPT-TK and PyTorch Lightning training demonstrations all ship as TK kernels in the public repository.^[1]^[10] The framework's role as a "research vehicle" is reinforced by the fact that the original release explicitly bundled a nanoGPT integration so that users could see end-to-end training of a small language model on top of TK kernels.^[2]^[10]

Significance

ThunderKittens is one of several efforts in 2024 to 2026 that argue the practical bottleneck for new AI architectures is no longer raw FLOPS but the difficulty of writing kernels that exploit modern accelerator features such as warp-group MMA, TMA, and tensor memory. By demonstrating that 100-line tile-based kernels can match or beat the hand-tuned FlashAttention-3 and cuBLAS on H100 and B200, and by showing 8x to 14x speedups on State space model (deep learning) and Linear Attention kernels, the project provides empirical support for the idea that small, opinionated DSLs can close the gap between research-friendly languages and vendor libraries on a workload-by-workload basis.^[1]^[3]^[10] The follow-on work (ThunderMittens, HipKittens, ParallelKittens) further suggests that the tile-based programming model generalizes across vendors and across scaling regimes from a single chip to multi-GPU clusters.^[4]^[7]^[12]

A second contribution of TK is pedagogical. The "GPUs Go Brrr" post is widely cited as an accessible explanation of why Hopper-class GPUs require asynchronous overlap to reach peak throughput, why bank conflicts and shared-memory swizzling matter, and how WGMMA and TMA differ from their predecessors. The deliberately compact codebase (under 1 MB compared to CUTLASS at roughly 22 MB) and the use of plain CUDA C++ rather than an intermediate IR have made TK a frequent entry point for researchers learning to write fused kernels.^[1]^[3]^[13]

Performance results (selected)

The published numbers for ThunderKittens are scattered across the original blog post, the arXiv paper, and the various follow-ups. The table below collects the most frequently cited measurements for context.

Workload	Hardware	TK result	Baseline	Source
FlashAttention forward (causal)	H100 SXM	~30% faster than FlashAttention-2	FlashAttention-2	"GPUs Go Brrr"^[1]
Based linear attention	H100 SXM	~215 TFLOP/s sustained (300+ counting algorithmic recomputation)	Prior linear attention	"GPUs Go Brrr"^[1]
FlashAttention backward	H100	10 to 40 percent faster than FlashAttention-3	FlashAttention-3	TK paper^[3]
Linear attention (Based)	H100	~14x faster	Flash Linear Attention	TK paper^[3]
Linear attention (learned features)	H100	~6.5x faster	Flash Linear Attention	TK paper^[3]
State-space models (Mamba-2)	H100	~8x faster than Triton baselines	Triton Mamba-2	TK paper^[3]
Long convolution (FFTConv)	H100	~7.9x faster at 1024 tokens, ~9x at 4096 tokens	FlashFFTConv	TK paper, TK2^[3]^[10]
GEMM BF16	B200	Up to ~2x faster than cuBLAS on H100, near cuBLAS on B200	cuBLAS	Blackwell post^[5]
Attention forward/backward	B200	Up to ~2x faster than FlashAttention-3 on H100; near cuDNN on B200	FlashAttention-3, cuDNN	Blackwell post^[5]
GEMM example	H100	~855 TFLOP/s (~86 percent of peak BF16)	n/a	TK README^[9]
All-to-all collective	8x B200	Up to ~2.6x faster than NCCL	NCCL	PGL post^[6]

These numbers are reported by the authors using their own measurement methodologies and should be read in that context; competing libraries continue to improve, and some baselines (FlashAttention-2, Triton Mamba-2) have themselves been updated since the corresponding TK measurements were published.

Limitations

ThunderKittens is narrower than competing systems in several respects:

Hardware scope. Current development focuses on NVIDIA Hopper and Blackwell; Ampere is no longer actively maintained, and the public repository directs AMD users to HipKittens rather than supporting both backends in one tree.^[9]^[12]
Tile size restrictions. The 16x16 minimum tile (8x8 on Apple Silicon) is convenient for tensor-core workloads but inappropriate for problems that do not factor cleanly into such tiles.^[1]^[4]
C++ template complexity. Although the library is much smaller than CUTLASS, it is still C++20 with heavy templates, and users must be comfortable reading CUDA error messages and PTX. The team's own ThunderKittens 2.0 post notes substantial time spent on "subtraction" of unnecessary instructions and on chasing assembler quirks.^[8]
Not a general compiler. Unlike Triton (compiler) or torch.compile, TK does not autotune or auto-schedule a kernel from a high-level Python expression; the developer writes the kernel directly, and the abstractions are intentionally thin.^[3]
Build prerequisites. As of the current release the framework requires CUDA 12.8+, a C++20 compiler (GCC 11+ or Clang 11+), and PyTorch 2.8+ for the supplied Python bindings.^[9]

FlashAttention and FlashAttention-2/3 by Tri Dao and collaborators, which TK explicitly benchmarks against and which inspired many of TK's tiling and pipelining strategies.^[1]^[3]
Triton (compiler), OpenAI's Python-embedded tile DSL with a JIT compiler.^[3]
CUDA and NVIDIA's CUTLASS template library, the underlying primitives TK is implemented on top of.^[3]
Mamba and Mamba 2, the state-space-model architectures that motivated the early TK Mamba-2 kernel.^[10]
MLX and Apple Silicon, the platform targeted by the ThunderMittens port.^[4]
NVLink and NCCL (NVIDIA Collective Communications Library), the substrate ParallelKittens compares against for multi-GPU collectives.^[6]^[7]
AMD Instinct MI355X and the HipKittens AMD port.^[12]

References

Benjamin Spector, Aaryan Singhal, Simran Arora and Chris Ré, "GPUs Go Brrr", Hazy Research blog, Stanford University, 2024-05-12. https://hazyresearch.stanford.edu/blog/2024-05-12-tk. Accessed 2026-05-21. ↩
Benjamin Spector, Aaryan Singhal, Simran Arora and Chris Ré, "ThunderKittens: A Simple Embedded DSL for AI kernels", Hazy Research blog, Stanford University, 2024-05-12. https://hazyresearch.stanford.edu/blog/2024-05-12-quick-tk. Accessed 2026-05-21. ↩
Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu and Christopher Ré, "ThunderKittens: Simple, Fast, and Adorable AI Kernels (HTML)", arXiv:2410.20399, 2024-10-27. https://arxiv.org/html/2410.20399v1. Accessed 2026-05-21. ↩
Conner Takehana, Aaryan Singhal, Benjamin Spector, Simran Arora, Daniel Y. Fu and Chris Ré, "ThunderMittens For Your ThunderKittens", Hazy Research blog, Stanford University, 2024-11-28. https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx. Accessed 2026-05-21. ↩
Benjamin Spector, Aaryan Singhal, Daniel Y. Fu and Christopher Ré, "ThunderKittens Now on Blackwells!", Hazy Research blog, Stanford University, 2025-03-15. https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwell. Accessed 2026-05-21. ↩
Stuart Sul, Dylan Lim, Benjamin Spector and Christopher Ré, "One Kernel for All Your GPUs", Hazy Research blog, Stanford University, 2025-09-22. https://hazyresearch.stanford.edu/blog/2025-09-22-pgl. Accessed 2026-05-21. ↩
Stuart Sul, Simran Arora, Benjamin Spector and Christopher Ré, "ParallelKittens: Simple and Fast Multi-GPU AI Kernels", Hazy Research blog, Stanford University, 2025-11-17. https://hazyresearch.stanford.edu/blog/2025-11-17-pk. Accessed 2026-05-21. ↩
Stuart Sul and Christopher Ré, "ThunderKittens 2.0: Even Faster Kernels for Your GPUs", Hazy Research blog, Stanford University, 2026-02-19. https://hazyresearch.stanford.edu/blog/2026-02-19-tk-2. Accessed 2026-05-21. ↩
HazyResearch, "ThunderKittens: Tile primitives for speedy kernels (README)", GitHub repository, 2026-02-19. https://github.com/HazyResearch/ThunderKittens. Accessed 2026-05-21. ↩
Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu and Christopher Ré, "Easier, Better, Faster, Cuter", Hazy Research blog, Stanford University, 2024-10-29. https://hazyresearch.stanford.edu/blog/2024-10-29-tk2. Accessed 2026-05-21. ↩
Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu and Christopher Ré, "ThunderKittens: Simple, Fast, and Adorable AI Kernels", arXiv preprint 2410.20399, 2024-10-27. https://arxiv.org/abs/2410.20399. Accessed 2026-05-21. ↩
William Hu, Drew Wadsworth, Chris Ré, Simran Arora, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann and Muhammad Osama, "HipKittens: Fast and Furious AMD Kernels", Hazy Research blog, Stanford University, 2025-11-09. https://hazyresearch.stanford.edu/blog/2025-11-09-hk. Accessed 2026-05-21. ↩
"GPUs Go Brrr", Hacker News discussion, 2024-05-12. https://news.ycombinator.com/item?id=40337936. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

XLA (Accelerated Linear Algebra)

Infobox

History

Background and motivation

Release: "GPUs Go Brrr" (May 2024)

"Easier, Better, Faster, Cuter" (October 2024)

Paper: "ThunderKittens: Simple, Fast, and Adorable AI Kernels" (October 2024)

ThunderMittens (Apple Silicon, November 2024)

Blackwell port (March 2025)

One Kernel for All Your GPUs and ParallelKittens (September to November 2025)

HipKittens (AMD, November 2025)

ThunderKittens 2.0 (February 2026)

Technical details

Design philosophy

Three-level hierarchy

Hardware features used

Programming model

Occupancy and the cost model

Reference kernels

Variants and ports

Comparison to other GPU programming approaches

Adoption

Significance

Performance results (selected)

Limitations

Related work

See also

References

Improve this article

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB