CUTLASS
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 2,105 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 2,105 words
Add missing citations, update stale details, or suggest a clearer explanation.
CUTLASS is an open-source library of reusable building blocks for writing high-performance matrix kernels on NVIDIA GPUs. The name expands to CUDA Templates for Linear Algebra Subroutines, and the project is published by NVIDIA on GitHub. [1][2] At its core, CUTLASS gives developers a set of C++ templates (and, more recently, a Python interface) for building General Matrix Multiply (GEMM) kernels and closely related operations, with enough control over how work maps onto the hardware that a hand-written kernel can come within a few percent of NVIDIA's own closed-source cuBLAS library. [1][3]
The library has been around since 2017 and has tracked every major NVIDIA architecture since Volta, through Turing, Ampere, Ada, Hopper, and Blackwell. [1][4] It sits underneath a surprising amount of modern AI infrastructure. Parts of FlashAttention, kernels inside PyTorch, and the FP8 and sparse GEMM paths in inference engines like vLLM are all built on CUTLASS primitives. [4][5][6]
Almost every heavy computation in deep learning is a matrix multiply in disguise. A linear layer is a GEMM. The attention scores in a transformer are a batch of GEMMs. A convolution can be rewritten as a GEMM through the implicit GEMM algorithm. When people talk about a model spending most of its time "in the matmuls," they mean that GEMM throughput sets the practical ceiling on both training cost and inference latency. [1][7]
Modern NVIDIA GPUs do this work on Tensor Cores, specialized units that compute small matrix multiply-accumulate operations far faster than the general-purpose CUDA cores. [3][8] Getting close to peak Tensor Core throughput is hard. The arithmetic itself is the easy part. The difficulty is keeping the cores fed, which means moving data through the memory hierarchy (global memory, then shared memory, then registers) fast enough that the cores never stall waiting for operands. A naive GEMM kernel can leave most of a GPU's theoretical performance on the table. cuBLAS solves this with hand-tuned assembly, but it is a fixed-function library: you call it with standard shapes and data types and you get a fast result, but you cannot reach inside and change how it works. CUTLASS exists to give that control back. [1][6]
The central idea in CUTLASS is to take the same hierarchical decomposition that cuBLAS and CuDNN use internally and expose it as composable, modular software components. [1][3] A large GEMM gets broken into a nested hierarchy of tiles, and each level of that hierarchy maps to a level of the CUDA execution and memory model. [3][7]
The levels, from coarse to fine, look like this.
| Tiling level | Maps to | What happens here |
|---|---|---|
| Threadblock tile | One CUDA threadblock | Loads tiles of the input matrices from global memory into shared memory and iterates over the K dimension, accumulating a partial product |
| Warp tile | A warp inside the threadblock | Fetches operands from shared memory into registers so the warp can issue matrix instructions |
| Thread / instruction tile | An individual thread and its Tensor Core or CUDA-core instruction | Issues the actual multiply-accumulate, using mma, wgmma, or wmma instructions on Tensor Cores |
Each of these levels is a C++ template parameter. The tile shapes, the data types, the number of pipeline stages, and the instruction selection are all things a developer sets when they instantiate the template, and the compiler then generates a specialized kernel for that exact configuration. [1][3] This is the difference that matters in practice. With cuBLAS you take what you are given. With CUTLASS you write the kernel once as a template and pick the tiling that fits your problem.
Two more pieces make the design work. The first is software pipelining, sometimes called multistage or double buffering. Memory loads are slow, so CUTLASS overlaps them with computation by keeping more than one buffer in flight. While the Tensor Cores chew on the current tile, the hardware is already loading the next tile from global memory into a second shared-memory buffer, and a second register fragment is being filled for the next warp-level operation. [3][7] On newer architectures this uses asynchronous copy instructions so the data movement does not block the math at all. The second piece is the epilogue. After the main multiply-accumulate loop finishes, the epilogue stage writes the result back out, and it is itself a template parameter. That is what lets CUTLASS fuse extra work into the same kernel. If your projection layer needs a GEMM followed by a bias add, a SiLU activation, and a residual add, you can fold all of that into the epilogue and emit a single kernel, rather than launching the GEMM and then paying for separate elementwise kernels that re-read the result from global memory. [6][7]
CUTLASS 3.0 introduced a new core library called CuTe, short for CUDA Templates. [1][9] Before CuTe, the bookkeeping for how threads and data were laid out in memory was scattered across many separate types and conventions. CuTe replaced that with a single vocabulary type, cute::Layout, that compactly describes the shape and stride of any multidimensional arrangement of threads or data. A Layout paired with a pointer gives a cute::Tensor. [9][10]
The value of this is a small, consistent algebra for layouts. You can compose layouts, partition a tensor across a set of threads, or tile it, all with the same handful of operations. CUTLASS 3.x rebuilt its GEMM hierarchy on top of CuTe, which made the components more composable and made it easier to write fast kernels for hardware like Hopper that has its own quirks around warpgroup-level matrix instructions and the Tensor Memory Accelerator. [4][9]
The more recent change is the CuTe DSL, added in the CUTLASS 4.x series in 2025. [2][11] This is a Python-native interface for writing the same kind of kernels without touching C++ template metaprogramming. It exposes the identical concepts (layouts, tensors, hardware atoms, and full control over the thread and data hierarchy) but the programming model lives in Python. [2][11] Under the hood it is not just a wrapper that calls precompiled kernels. The DSL parses the Python source and lowers it through a compiler stack so the result is a genuine GPU kernel.
What people care about with the CuTe DSL is that it does not give up much speed. NVIDIA reports that on operations like dense GEMM, grouped GEMM, and fused multi-head attention, the Python kernels run on par with the C++ versions across multiple GPU generations, with some gaps still being worked out on specific shapes. [11] The other win is compile time. C++ template instantiation in CUTLASS is famously slow, and the DSL cuts that dramatically: NVIDIA cites around a 100x compilation speedup for GEMM on Blackwell and 30x to 50x for flash attention, because the DSL uses just-in-time compilation and skips the heavy template machinery. [11] The DSL ships on PyPI as nvidia-cutlass-dsl and was released in public beta, so the interfaces are still changing. [2][12]
CUTLASS targets Tensor Cores from Volta onward, and each new architecture brings new instructions that the library wraps. On Hopper that meant warpgroup matrix instructions (wgmma) and the Tensor Memory Accelerator for bulk asynchronous copies. On Blackwell the library added support for the new fifth-generation Tensor Core instructions (the tcgen05 family) and Cluster Launch Control for dynamic scheduling of work across the GPU. [4][13]
The other axis is numeric precision, and this is where a lot of recent CUTLASS work has gone. The library handles a wide range of types: FP64, FP32, TF32, FP16, BF16, the two 8-bit floating-point formats (e5m2 and e4m3), narrow integers (4-bit and 8-bit, signed and unsigned), and on Blackwell a set of block-scaled 4-bit and 6-bit formats including NVFP4, MXFP4, MXFP6, and MXFP8. [1][13] Block-scaled FP4 is the headline feature for Blackwell inference, and CUTLASS implements it with the block-scaled Tensor Core instructions that run at roughly twice the throughput of the FP8 path. [13]
Mixed precision is a recurring theme. A kernel might multiply FP8 inputs while accumulating in higher precision, or mix operand types so that an 8-bit value multiplies a 4-bit one. [13] CUTLASS also supports structured 2:4 sparsity, where half the weights in a small block are pruned to zero, which combines with low-precision GEMM for further speedups. [5]
Beyond a single dense GEMM, the library covers convolution (through implicit GEMM), batched GEMM, and grouped GEMM. Grouped GEMM matters for Mixture of Experts models, where each expert is a separate small matrix multiply with its own shape. Instead of launching many tiny kernels, grouped GEMM packs the whole collection into one launch. [1][5]
The simplest way to place CUTLASS is between cuBLAS and a from-scratch CUDA kernel. cuBLAS is the fast default for ordinary GEMM with standard shapes and types. You cannot see its source, and you cannot change its behavior. A hand-written kernel gives total control but you build everything yourself and rarely match cuBLAS speed. CUTLASS is the middle path: you get the building blocks and the tiling structure that make cuBLAS fast, in source form, and you assemble them into the kernel you actually need. [1][6] In NVIDIA's own measurements, CUTLASS C++ lands within a few percent of cuBLAS for most data types. [3] So people reach for CUTLASS when the stock library does not fit: unusual data types like FP8, custom fused epilogues, sparse or grouped GEMMs, and the oddly shaped matrix products that show up in attention. [6]
FlashAttention is the clearest example. FlashAttention-2 was rewritten from scratch on top of CUTLASS 3.x and CuTe, which gave it clean access to the hardware features it needed for speed. [4] A 2023 study from Colfax Research walked through implementing FlashAttention-2 on Hopper specifically using the CUTLASS library, exercising the wgmma and TMA instructions through CuTe. [14] The newest version, FlashAttention-4, is written entirely in the CuTe DSL, NVIDIA's Python kernel language. [11][14] The relationship runs both ways: attention is a demanding customer that pushes CUTLASS, and CUTLASS is the substrate that makes the fastest attention kernels possible.
The direct users of CUTLASS are kernel engineers and framework developers rather than people training models day to day. PyTorch ships several internal kernels built on CUTLASS, and inference engines such as vLLM use CUTLASS-based kernels for FP8 quantized and 2:4 sparse matrix multiplies. [5][6] Research groups writing custom attention or quantization kernels lean on it heavily. Most application developers benefit from CUTLASS indirectly, through the frameworks and libraries that sit on top of it.
There are real limits. CUTLASS only targets NVIDIA GPUs, so it does nothing for other accelerators. The C++ template approach has a steep learning curve, the error messages from deep template instantiation can be brutal, and compile times for the C++ path are long (the main reason the CuTe DSL exists). Performance depends on choosing tiling and pipeline parameters that suit both the problem shape and the specific GPU, which takes expertise and tuning rather than coming for free. And because the library tracks new hardware so closely, the leading edge features (Blackwell FP4, the tcgen05 instructions, the CuTe DSL itself) move fast and parts remain in beta, so interfaces shift between releases. For a developer who only needs ordinary GEMM at good speed, plain cuBLAS is usually the simpler choice. CUTLASS earns its keep when the standard library cannot express what the kernel needs. [2][6]