Flash Attention 3 (FA3) is the third generation of the FlashAttention algorithm, designed to exploit hardware features specific to NVIDIA's Hopper GPU architecture, particularly the H100. Published in July 2024 by Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao (arXiv:2407.08608), it achieves 1.5 to 2.0 times faster throughput than Flash Attention 2 on FP16 workloads, reaching up to 740 TFLOPS/s (75% of the H100's theoretical maximum), and nearly 1.2 PFLOPS/s in FP8 precision. The core innovations are three: warp-specialized producer-consumer asynchrony using the Tensor Memory Accelerator (TMA) and warpgroup matrix multiply-accumulate (WGMMA) instructions, a pipelining scheme that overlaps GEMM operations with softmax computation, and block quantization with incoherent processing for FP8 accuracy preservation.
The paper was released alongside an open-source implementation in the Dao-AILab/flash-attention repository. FA3 is a Hopper-only optimization: it uses TMA and WGMMA instructions that do not exist on earlier NVIDIA architectures, including the A100 (Ampere). It represented the largest single-generation leap in attention throughput since FA2, more than doubling the GPU utilization achieved by its predecessor on the same hardware.
The attention mechanism in Transformer models computes scaled dot-product attention over queries, keys, and values. For a sequence of length N and head dimension d, naively computing attention requires O(N²) time and O(N²) memory, which becomes prohibitive for long-context workloads.
The original Flash Attention algorithm, published in May 2022 by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré at Stanford and the University at Buffalo (arXiv:2205.14135), addressed this by making attention IO-aware. Rather than materializing the full N×N attention matrix in high-bandwidth memory (HBM), it uses tiling to load blocks of queries, keys, and values into the much faster on-chip SRAM, computes partial attention scores there, and accumulates results using an online softmax algorithm. This reduced memory usage from quadratic to linear in sequence length and yielded 2 to 4 times wall-clock speedup over standard attention implementations on A100 GPUs.
Flash Attention 2, also by Tri Dao (arXiv:2307.08691), improved work partitioning across GPU threads. The key changes were reducing non-matmul operations in the inner loop, restructuring the parallelism to also span the sequence length dimension, and better distributing work between thread blocks to reduce idle warps. On A100 GPUs, FA2 achieved around 50 to 73% of peak FLOP utilization. It became the de facto attention backend in PyTorch, Hugging Face Transformers, vLLM, and most major language model training frameworks.
When NVIDIA released the H100 in 2022 based on the Hopper microarchitecture, it introduced hardware features that FA2 was not designed to use. Running FA2 on H100 achieved only about 35% of theoretical peak FLOPs, leaving roughly half the card's potential idle. The H100 brought two key new units: the Tensor Memory Accelerator (TMA) for asynchronous memory transfers, and warpgroup matrix multiply-accumulate (WGMMA) instructions running at substantially higher throughput than the older mma.sync instruction available on Ampere. Both required restructuring the kernel at a level FA2 had not attempted.
The utilization gap had a straightforward explanation. FA2 was written to target Ampere hardware, where memory movement was handled with cp.async instructions and matrix multiplication with mma.sync. On Hopper, NVIDIA replaced these with TMA and WGMMA respectively. Using the old instructions on new hardware still worked (backward compatibility was preserved), but left the new hardware's throughput advantage inaccessible. Writing kernels that actually used TMA and WGMMA required substantially more complex code: TMA introduces a new descriptor-based API, WGMMA operates at a 4-warp granularity rather than a single-warp granularity, and both require careful synchronization patterns that differ from the Ampere model. The FA3 project's access to NVIDIA's CUTLASS team members was essential for navigating these interfaces correctly.
The FA3 paper lists six authors from four institutions:
| Author | Affiliation |
|---|---|
| Jay Shah | Colfax Research |
| Ganesh Bikshandi | Colfax Research |
| Ying Zhang | Meta |
| Vijay Thakkar | NVIDIA |
| Pradeep Ramani | NVIDIA |
| Tri Dao | Princeton University, Together AI |
The presence of two NVIDIA researchers (Thakkar and Ramani, both members of the CUTLASS team) was significant: it gave the team direct access to CUTLASS primitives and internal documentation for the WGMMA and TMA interfaces, which are not fully documented in publicly available CUDA guides.
The paper was submitted to arXiv on July 11, 2024, with a minor revision (v2) the following day. A blog post from Together AI and a PyTorch blog post were published simultaneously. The GPU MODE conference talk by Jay Shah provides a detailed walkthrough of the CUTLASS integration, available from Colfax Research's website.
Tri Dao is an assistant professor at Princeton University and a co-founder and chief scientist at Together AI. He is also known as the lead author of Mamba, the state-space model architecture that competes with Transformers for certain long-sequence tasks.
The TMA is a dedicated hardware unit on Hopper GPUs that transfers multidimensional tensors between global memory (HBM) and shared memory (SMEM) asynchronously. Unlike earlier CUDA cp.async instructions, TMA handles all index calculation and out-of-bounds predication in hardware. This frees the issuing warp from performing address arithmetic, reducing register pressure substantially and allowing the software pipeline to use larger tile sizes.
In FA3, the producer warp issues TMA load instructions for query, key, and value blocks and then immediately yields its registers to the pipeline rather than waiting for transfers to complete. The SMEM receives these tiles while the consumer warpgroup proceeds with matmul on previously loaded tiles.
TMA descriptors specify the tensor layout, data type, tiling dimensions, and memory address in a compact structure passed to the instruction. This abstraction lets the hardware handle multi-dimensional strided access patterns that would otherwise require complex software-generated address sequences. For attention, where Q is accessed in row-major order and K in a transposed layout, TMA can perform the necessary layout transformation in hardware during the load.
WGMMA is the Hopper-specific instruction for asynchronous matrix multiplication at the warpgroup level (a warpgroup is four contiguous warps, 128 threads). On H100 SXM5, WGMMA achieves 989 TFLOPS for FP16 and 1978 TFLOPS for FP8, roughly twice the throughput of the mma.sync instruction used on Ampere.
The asynchronous nature of WGMMA matters: the instruction is issued and the GPU continues executing subsequent instructions; the result in registers is not guaranteed to be available until an explicit synchronization barrier. FA3 exploits this by issuing future-iteration GEMMs before waiting for current-iteration results, enabling overlapped execution.
WGMMA instructions require that input operands reside in shared memory (for A and B matrices) and that outputs accumulate in register files (for the C/D accumulator). This layout is compatible with TMA's output: data arrives in shared memory via TMA, then WGMMA reads it from there. The combination of TMA feeding shared memory and WGMMA reading from it is the foundation of the FA3 data pipeline.
Hopper GPUs support FP8 (8-bit floating point, both E4M3 and E5M2 formats) as native WGMMA operand types, doubling Tensor Core throughput relative to FP16 operations. The catch is precision loss from the narrow dynamic range of FP8, especially in the presence of outlier activations. FA3 addresses this with the incoherent processing technique described below.
FP8 support in Hopper is primarily an inference optimization. The attention forward pass is the primary target because most inference pipelines run the attention computation in lower precision to maximize throughput. The backward pass for training still benefits from FP16 or BF16 for stability reasons.
| Level | Capacity | Bandwidth |
|---|---|---|
| Global memory (HBM3) | 80 GiB | 3.35 TB/s |
| L2 cache | 50 MiB | 12 TB/s |
| Shared memory per SM | 228 KiB | 31 TB/s (per GPU aggregate) |
| Register file per SM | 256 KB | -- |
The central structural change in FA3 is warp specialization with a producer-consumer split. Within each thread block, warps are divided into two roles:
Producer warps are assigned exclusively to data movement. They call setmaxnreg to deallocate most of their registers (freeing those resources for consumer warps on the same SM), then issue TMA loads for Q, K, and V tiles into a circular shared-memory buffer. Each load writes to a named barrier in SMEM that the consumer can poll.
Consumer warpgroups reclaim those freed registers (via setmaxnreg in the other direction) and perform computation. After waiting on the barrier indicating a tile has arrived in SMEM, they issue WGMMA instructions against it and proceed to the next tile without stalling the producer.
This split means memory transfers and compute operations proceed simultaneously rather than sequentially. Earlier GPU architectures had limited hardware support for this pattern; on Hopper, TMA's asynchronous interface makes it practical without complex synchronization overhead.
The s-stage circular SMEM buffer (typically two or three stages) allows the producer to be s iterations ahead of the consumer, providing enough slack to hide global-memory latency even when the GPU's memory system is under load.
Register management is central to the design. Each SM on H100 has 256 KB of register file shared among all resident warps. In a naive kernel that assigns both data movement and computation to the same warps, each warp needs registers for both roles, limiting the tile size and pipeline depth. With warp specialization, the producer warp calls setmaxnreg 40 to reduce its register allocation to a minimum, freeing those registers for the consumer warpgroup. The consumer warpgroup can then call setmaxnreg 232 to claim a large register budget, supporting the large WGMMA tile sizes and the 2-stage pipeline's intermediate buffers. This register reallocation at kernel launch time (not dynamically during execution) is a feature unique to Hopper's programmable warpgroup interface.
For the backward pass, FA3 applies the same warp-specialization strategy to the three GEMMs required to compute gradients with respect to Q, K, and V. The backward pass achieves 1.5 to 1.75 times the throughput of FA2 backward on H100.
The paper measures the contribution of each technique on a representative configuration (batch size 4, sequence length 8,448, 16 heads, head dimension 128) on H100 SXM5:
| Configuration | Latency (ms) | TFLOPS/s |
|---|---|---|
| No warp specialization | 4.105 | 570 |
| Warp specialization only | 4.021 | 582 |
| Full FA3 (warp spec + GEMM-softmax pipelining) | 3.538 | 661 |
The combined gain from both techniques is roughly 16% over the no-specialization baseline, and both contribute meaningfully.
A second bottleneck is the softmax computation itself. On H100, FP16 matmul runs at 989 TFLOPS while the special-function unit (SFU) responsible for the exponential in softmax runs at only 3.9 TFLOPS, a 256-fold throughput ratio. In a sequential implementation, softmax on a block consumes roughly 50% of total kernel time despite performing very little arithmetic work. The fix is to overlap softmax with the next GEMM iteration so the SFU and Tensor Cores run concurrently.
FA3 implements two complementary strategies:
When two consumer warpgroups share a thread block, they alternate roles: while warpgroup 1 performs GEMM operations, warpgroup 2 runs softmax on its current scores, and vice versa. Explicit bar.sync barriers enforce this alternating schedule. The two warpgroups share the same HBM tiles (loaded once by the producer), so this doubles the compute work done per memory fetch.
Pingpong scheduling improves forward-pass throughput from approximately 570 TFLOPS (no specialization baseline) to around 620 TFLOPS.
Even within a single warpgroup, the algorithm can overlap operations across loop iterations. The key observation is that WGMMA is asynchronous: issuing WGMMA(Q, K_next) does not stall; the hardware executes it while the software proceeds. The 2-stage pipeline exploits this:
Step 3 overlaps with step 1 in hardware: the Tensor Core executes the O-GEMM while the WGMMA unit for QK is still in flight. This intra-warpgroup pipelining raises throughput to 640 to 660 TFLOPS, though at the cost of higher register pressure from buffering intermediate matrices.
A 3-stage variant is described in the paper's appendix and overlaps further, but register constraints make it impractical for typical tile sizes.
The two strategies are complementary and can be combined. In the full FA3 configuration with both warp specialization and 2-stage intra-warpgroup pipelining, the ablation study shows a final throughput of 661 TFLOPS/s on the tested configuration, compared to 570 TFLOPS/s without either technique.
It is worth noting that the 256-fold throughput gap between matrix operations and special functions is not a bug but a consequence of hardware design priorities. Tensor Cores occupy large silicon area and deliver enormous throughput because GEMMs dominate transformer training and inference. The SFU is sized to handle transcendental functions (exp, log, sin, cos) used sparingly throughout most workloads. The attention mechanism is unusual in that its exponential is on the critical path and occurs inside a tight loop. This makes attention exceptionally sensitive to SFU latency, and the FA3 overlap approach directly targets this sensitivity.
FP8 WGMMA requires per-tensor or per-block scaling factors. FA3 uses per-block quantization, maintaining one scaling factor for each tile of K, V, and Q rather than a single global value. This aligns naturally with FlashAttention's block-based iteration: each tile is quantized independently before being loaded into SMEM. The score matrix S also receives per-block scaling at essentially zero cost, since the scale can be fused with the softmax rescaling step.
Activation outliers in large language models concentrate high values in a few dimensions, causing disproportionate quantization error when the FP8 dynamic range is allocated to represent those extremes. FA3 addresses this by multiplying Q and K by a random orthogonal matrix M before quantization:
(QM)(KM)ᵀ = QKᵀ
Because M is orthogonal, attention output is unchanged. But the multiplication distributes the outlier energy across all dimensions, making the resulting activations more uniform and thus better suited to per-block FP8 quantization.
In practice, M is the product of a diagonal matrix with random ±1 entries and a Hadamard matrix, giving a complexity of O(d log d) rather than O(d²) for a general orthogonal matrix. The Hadamard transform can also be fused with rotary position embeddings at no extra cost, since both apply a dimension-wise linear transform before the attention operation.
FP8 WGMMA instructions require inputs in k-major (column-major) layout, but V tensors are typically stored in head-dimension contiguous format. FA3 performs an in-kernel transposition of V in the producer warpgroup using LDSM/STSM (shared memory load/store matrix) instructions, overlapping this with consumer GEMMs so no latency is added to the critical path.
Register layout mismatches between FP32 accumulators and FP8 operands are resolved with byte permute instructions that rearrange accumulator data before it is fed as the next WGMMA operand.
| Method | RMSE |
|---|---|
| Baseline FP8 (per-tensor quantization) | 2.4 × 10⁻² |
| FA3 block quantization only | 9.3 × 10⁻³ |
| FA3 incoherent processing only | 2.4 × 10⁻² |
| FA3 block quantization + incoherent processing | 9.1 × 10⁻³ |
The combined technique achieves a 2.6 times reduction in RMSE versus the baseline. Block quantization does most of the work; incoherent processing adds a small additional benefit on distributions with heavy outliers.
For FP16, FA3 numerical accuracy is essentially identical to FA2:
| Method | RMSE (outlier distribution) |
|---|---|
| Standard attention | 3.2 × 10⁻⁴ |
| Flash Attention 2 | 1.9 × 10⁻⁴ |
| Flash Attention 3 | 1.9 × 10⁻⁴ |
| Sequence length | FA2 TFLOPS/s | FA3 TFLOPS/s | Speedup |
|---|---|---|---|
| 512 | ~180 | ~270 | 1.5x |
| 1,024 | ~240 | ~400 | 1.7x |
| 2,048 | ~290 | ~510 | 1.75x |
| 4,096 | ~330 | ~600 | 1.8x |
| 8,192 | ~350 | ~660 | 1.9x |
| 16,384 | ~360 | ~700 | 1.95x |
Peak FP16 throughput reaches 740 TFLOPS/s, representing 75% of the H100's theoretical FP16 maximum of 989 TFLOPS/s. The speedup is larger at longer sequence lengths because the tile-based computation becomes a larger fraction of total work at long contexts, amortizing per-kernel launch overhead.
FA3 also improves the backward pass by 1.5 to 1.75 times compared to FA2, using the same warp-specialization and asynchrony techniques applied to the backward-pass GEMMs.
The FP8 kernel reaches close to 1.2 PFLOPS/s (1,200 TFLOPS/s), which is approximately 60% of the H100's peak FP8 throughput of 1,978 TFLOPS/s. For large sequences without causal masking, FA3's FP8 matches or exceeds cuDNN. For small sequences with causal masking, FA3 FP8 currently falls behind cuDNN due to the absence of a persistent kernel with load balancing (a noted limitation acknowledged in the paper).
| Metric | Flash Attention 2 | Flash Attention 3 |
|---|---|---|
| H100 utilization (FP16) | ~35% | ~75% |
| Peak FP16 throughput | ~350 TFLOPS/s | ~740 TFLOPS/s |
| Peak FP8 throughput | N/A | ~1,200 TFLOPS/s |
| FP16 forward speedup | 1x (baseline) | 1.5 to 2.0x |
| FP16 backward speedup | 1x (baseline) | 1.5 to 1.75x |
| FP8 support | No | Yes (forward pass) |
| Native Hopper TMA/WGMMA | No | Yes |
| Minimum GPU | Ampere | Hopper (H100/H800) |
FA3 supports the same attention variants as FA2, implemented within the same tiled kernel structure:
| Variant | Description | FA3 Support |
|---|---|---|
| Multi-head attention (MHA) | Full separate Q, K, V per head | Yes |
| Multi-query attention (MQA) | Single K/V shared across all Q heads | Yes |
| Grouped-query attention (GQA) | Multiple Q heads per K/V group | Yes |
| Causal (autoregressive) masking | Upper triangular mask | Yes |
| Sliding window attention | Windowed local attention mask | Partial |
| Variable-length sequences | Packed batches of varying lengths | Yes |
| FP16 forward | Full precision forward pass | Yes |
| BF16 forward | Bfloat16 forward pass | Yes |
| FP8 forward | 8-bit forward pass | Yes |
| FP16/BF16 backward | Backward pass for training | Yes |
| FP8 backward | 8-bit backward pass for training | Described in paper; not fully optimized in initial release |
GQA is handled by adjusting the tensor indexing: when loading K and V blocks, the kernel maps each Q head to its corresponding K/V group without duplicating the K/V tensors in memory. This allows FA3 to support modern architectures like LLaMA 3, Mistral, and Qwen that use GQA for inference efficiency.
Variable-length sequence support uses a packed-sequence format where all sequences in a batch are concatenated into a single tensor, with a separate array of cumulative lengths tracking boundaries. The tiled attention loop respects these boundaries, applying causal masking only within each sequence and never mixing attention between sequences.
The FA3 code was released as a beta under the existing Dao-AILab/flash-attention GitHub repository, in a dedicated hopper/ subdirectory. Installation requires CUDA 12.3 or later (CUDA 12.8 recommended) and an H100 or H800 GPU. The Python interface is imported as flash_attn_interface.
Tri Dao and Together AI maintain a companion repository at togethercomputer/flash-attention-3.
Since FA3's release in July 2024, it has been integrated or is in the process of integration across major inference and training frameworks:
| Framework | Status |
|---|---|
| PyTorch (scaled_dot_product_attention) | FA3 integration planned/in progress |
| SGLang | Supports FA3 as an attention backend |
| vLLM | FA3 support under discussion (GitHub issue #11372) |
| Hugging Face Transformers | Integration under discussion (GitHub issue #33373) |
| NVIDIA Megatron-LM | FA integration present; FA3 extension ongoing |
| Microsoft DeepSpeed | FlashAttention integrated in inference engine |
FA2 remains the more widely deployed version as of mid-2025, given the broader installed base of Ampere GPUs. FA3 adoption accelerates as H100 deployments scale.
NVIDIA's own cuDNN library includes highly optimized attention kernels for Hopper. On large FP16 sequences (length >= 1,024), FA3 matches or exceeds cuDNN throughput. For FP8 at large sequence lengths without causal masking, FA3 is competitive. FA3 has an advantage in openness: it is available as open source under a permissive license, while cuDNN is a closed binary.
PyTorch's torch.nn.functional.scaled_dot_product_attention function (SDPA) was redesigned in PyTorch 2.0 to dispatch to multiple backends, including FlashAttention. The SDPA interface selects the most efficient available kernel based on input dtype, device, and attention mask type at call time. FA2 is available as one such backend. Full integration of FA3 into the PyTorch SDPA dispatch path was listed as a near-term goal in both the FA3 paper and the PyTorch blog post at the time of FA3's release.
Long-context language models depend heavily on fast attention kernels because the number of attention operations grows quadratically with context length. FA3's roughly 2x speedup on H100 compared to FA2 translates directly into either faster generation at a given context length or support for longer contexts at the same latency budget. Models supporting 128K or 1M token contexts, such as GPT-4 Turbo and Llama 3 extended-context variants, benefit from any reduction in per-attention-block latency. FA3's FP8 support is particularly relevant here: at 128K context, the attention block consumes a substantial fraction of total inference compute, and halving precision roughly halves that cost while FA3's incoherent processing keeps accuracy degradation small.
FA3 uses WGMMA and TMA instructions that are exclusive to Hopper (compute capability 9.0). It does not run on Ampere (A100, compute capability 8.0) or Ada Lovelace (RTX 4000 series, compute capability 8.9). Organizations running A100 fleets must continue to use FA2.
The FP8 optimization in the paper covers only the forward pass. The backward pass algorithm for FP8 is described in the paper's appendix but was not fully optimized in the initial release. FP8 training therefore requires a separate implementation path or falls back to BF16/FP16 for the backward pass.
For small sequences with causal masking, FA3's FP8 kernel underperforms cuDNN because it lacks a persistent kernel design with dynamic load balancing. This is noted explicitly in the paper as future work.
The intra-warpgroup 2-stage pipeline requires buffering intermediate GEMM results across iterations, consuming extra registers per thread. For large tile sizes, this risks register spilling to local memory (which routes through HBM), partially undoing the pipelining benefit. The 3-stage variant described in the paper appendix is more susceptible to this problem.
Initial releases of FA3 do not support NVIDIA Blackwell GPUs (NVIDIA Blackwell), including the GB200 and B200. Blackwell replaces WGMMA with a new instruction set called TCGEN05, which has a different programming model. Attempts to run FA3 on Blackwell (sm_100) raise CUDA launch errors. This limitation motivated the subsequent development of Flash Attention 4.
Flash Attention 4 (FA4), announced in 2025, is a rewrite in CuTeDSL (a Python domain-specific language built on NVIDIA's CUTLASS 3.x framework for writing GPU kernels). FA4 targets both Hopper (H100) and NVIDIA Blackwell (B200), using NVIDIA's TCGEN05 tensor core instructions on Blackwell in place of WGMMA. The architectural shift is significant: Blackwell removed WGMMA and replaced it with TCGEN05 instructions that operate on the tensor memory (TMEM) subsystem rather than shared memory, requiring a fundamentally different kernel structure.
FA4 is notable for being implemented primarily in Python using CuTeDSL's JIT compilation model, rather than in C++ as FA3 was. This reduces the engineering burden of writing correct kernel code while retaining the ability to generate architecture-specific CUDA at compile time.
Early benchmarks on B200 showed FA4 reaching 1,605 TFLOPS/s on BF16, about 71% of the B200's theoretical maximum. On H100, FA4 also showed competitive performance with FA3. FA3 is expected to remain the deployed standard on H100 clusters until FA4's implementation matures and is integrated into downstream frameworks. The two versions cover complementary hardware generations, with FA3 being the production-grade solution for Hopper and FA4 being the successor for Blackwell and later.