Flash Attention 3
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 · 5,819 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 · 5,819 words
Add missing citations, update stale details, or suggest a clearer explanation.
Flash Attention 3 (FA3) is the third generation of the FlashAttention algorithm, designed to exploit hardware features specific to NVIDIA's Hopper GPU architecture, particularly the H100. Published in July 2024 by Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao (arXiv:2407.08608), it achieves 1.5 to 2.0 times faster throughput than Flash Attention 2 on FP16 workloads, reaching up to 740 TFLOPS/s (75% of the H100's theoretical maximum), and nearly 1.2 PFLOPS/s in FP8 precision. The core innovations are three: warp-specialized producer-consumer asynchrony using the Tensor Memory Accelerator (TMA) and warpgroup matrix multiply-accumulate (WGMMA) instructions, a pipelining scheme that overlaps GEMM operations with softmax computation, and block quantization with incoherent processing for FP8 accuracy preservation.
The paper was released alongside an open-source implementation in the Dao-AILab/flash-attention repository. FA3 is a Hopper-only optimization: it uses TMA and WGMMA instructions that do not exist on earlier NVIDIA architectures, including the A100 (Ampere). It represented the largest single-generation leap in attention throughput since FA2, more than doubling the GPU utilization achieved by its predecessor on the same hardware. Updates published through 2025 raised peak BF16 throughput to roughly 840 TFLOPS/s (about 85% of theoretical peak) and FP8 throughput to roughly 1.3 PFLOPS/s, via tighter kernel scheduling and a persistent-kernel path for short sequences.
The attention mechanism in Transformer models computes scaled dot-product attention over queries, keys, and values. For a sequence of length N and head dimension d, naively computing attention requires O(N²) time and O(N²) memory, which becomes prohibitive for long-context workloads.
The original Flash Attention algorithm, published in May 2022 by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré at Stanford and the University at Buffalo (arXiv:2205.14135), addressed this by making attention IO-aware. Rather than materializing the full N×N attention matrix in high-bandwidth memory (HBM), it uses tiling to load blocks of queries, keys, and values into the much faster on-chip SRAM, computes partial attention scores there, and accumulates results using an online softmax algorithm. This reduced memory usage from quadratic to linear in sequence length and yielded 2 to 4 times wall-clock speedup over standard attention implementations on A100 GPUs.
Flash Attention 2, also by Tri Dao (arXiv:2307.08691), improved work partitioning across GPU threads. The key changes were reducing non-matmul operations in the inner loop, restructuring the parallelism to also span the sequence length dimension, and better distributing work between thread blocks to reduce idle warps. On A100 GPUs, FA2 achieved around 50 to 73% of peak FLOP utilization. It became the de facto attention backend in PyTorch, Hugging Face Transformers, vLLM, and most major language model training frameworks.
When NVIDIA released the H100 in 2022 based on the Hopper microarchitecture, it introduced hardware features that FA2 was not designed to use. Running FA2 on H100 achieved only about 35% of theoretical peak FLOPs, leaving roughly half the card's potential idle. The H100 brought two key new units: the Tensor Memory Accelerator (TMA) for asynchronous memory transfers, and warpgroup matrix multiply-accumulate (WGMMA) instructions running at substantially higher throughput than the older mma.sync instruction available on Ampere. Both required restructuring the kernel at a level FA2 had not attempted.
The utilization gap had a straightforward explanation. FA2 was written to target Ampere hardware, where memory movement was handled with cp.async instructions and matrix multiplication with mma.sync. On Hopper, NVIDIA replaced these with TMA and WGMMA respectively. Using the old instructions on new hardware still worked (backward compatibility was preserved), but left the new hardware's throughput advantage inaccessible. Writing kernels that actually used TMA and WGMMA required substantially more complex code: TMA introduces a new descriptor-based API, WGMMA operates at a 4-warp granularity rather than a single-warp granularity, and both require careful synchronization patterns that differ from the Ampere model. The FA3 project's access to NVIDIA's CUTLASS team members was essential for navigating these interfaces correctly.
The FA3 paper lists six authors from four institutions:
| Author | Affiliation |
|---|---|
| Jay Shah | Colfax Research |
| Ganesh Bikshandi | Colfax Research |
| Ying Zhang | Meta |
| Vijay Thakkar | NVIDIA |
| Pradeep Ramani | NVIDIA |
| Tri Dao | Princeton University, Together AI |
The presence of two NVIDIA researchers (Thakkar and Ramani, both members of the CUTLASS team) was significant: it gave the team direct access to CUTLASS primitives and internal documentation for the WGMMA and TMA interfaces, which are not fully documented in publicly available CUDA guides.
The paper was submitted to arXiv on July 11, 2024, with a minor revision (v2) the following day. A blog post from Together AI and a PyTorch blog post were published simultaneously. The GPU MODE conference talk by Jay Shah provides a detailed walkthrough of the CUTLASS integration, available from Colfax Research's website. The paper was subsequently accepted as a spotlight at NeurIPS 2024, with the camera-ready version incorporating reviewer feedback about FP8 calibration and the persistent-kernel limitation.
Tri Dao is an assistant professor at Princeton University and a co-founder and chief scientist at Together AI. He is also known as the lead author of Mamba, the state-space model architecture that competes with Transformers for certain long-sequence tasks.
The TMA is a dedicated hardware unit on Hopper GPUs that transfers multidimensional tensors between global memory (HBM) and shared memory (SMEM) asynchronously. Unlike earlier CUDA cp.async instructions, TMA handles all index calculation and out-of-bounds predication in hardware. This frees the issuing warp from performing address arithmetic, reducing register pressure substantially and allowing the software pipeline to use larger tile sizes.
In FA3, the producer warp issues TMA load instructions for query, key, and value blocks and then immediately yields its registers to the pipeline rather than waiting for transfers to complete. The SMEM receives these tiles while the consumer warpgroup proceeds with matmul on previously loaded tiles.
TMA descriptors specify the tensor layout, data type, tiling dimensions, and memory address in a compact structure passed to the instruction. This abstraction lets the hardware handle multi-dimensional strided access patterns that would otherwise require complex software-generated address sequences. For attention, where Q is accessed in row-major order and K in a transposed layout, TMA can perform the necessary layout transformation in hardware during the load.
WGMMA is the Hopper-specific instruction for asynchronous matrix multiplication at the warpgroup level (a warpgroup is four contiguous warps, 128 threads). On H100 SXM5, WGMMA achieves 989 TFLOPS for FP16 and 1978 TFLOPS for FP8, roughly twice the throughput of the mma.sync instruction used on Ampere.
The asynchronous nature of WGMMA matters: the instruction is issued and the GPU continues executing subsequent instructions; the result in registers is not guaranteed to be available until an explicit synchronization barrier. FA3 exploits this by issuing future-iteration GEMMs before waiting for current-iteration results, enabling overlapped execution.
WGMMA instructions require that input operands reside in shared memory (for A and B matrices) and that outputs accumulate in register files (for the C/D accumulator). This layout is compatible with TMA's output: data arrives in shared memory via TMA, then WGMMA reads it from there. The combination of TMA feeding shared memory and WGMMA reading from it is the foundation of the FA3 data pipeline.
Hopper GPUs support FP8 (8-bit floating point, both E4M3 and E5M2 formats) as native WGMMA operand types, doubling Tensor Core throughput relative to FP16 operations. The catch is precision loss from the narrow dynamic range of FP8, especially in the presence of outlier activations. FA3 addresses this with the incoherent processing technique described below.
FP8 support in Hopper is primarily an inference optimization. The attention forward pass is the primary target because most inference pipelines run the attention computation in lower precision to maximize throughput. The backward pass for training still benefits from FP16 or BF16 for stability reasons.
| Level | Capacity | Bandwidth |
|---|---|---|
| Global memory (HBM3) | 80 GiB | 3.35 TB/s |
| L2 cache | 50 MiB | 12 TB/s |
| Shared memory per SM | 228 KiB | 31 TB/s (per GPU aggregate) |
| Register file per SM | 256 KB | -- |
The central structural change in FA3 is warp specialization with a producer-consumer split. Within each thread block, warps are divided into two roles:
Producer warps are assigned exclusively to data movement. They call setmaxnreg to deallocate most of their registers (freeing those resources for consumer warps on the same SM), then issue TMA loads for Q, K, and V tiles into a circular shared-memory buffer. Each load writes to a named barrier in SMEM that the consumer can poll.
Consumer warpgroups reclaim those freed registers (via setmaxnreg in the other direction) and perform computation. After waiting on the barrier indicating a tile has arrived in SMEM, they issue WGMMA instructions against it and proceed to the next tile without stalling the producer.
This split means memory transfers and compute operations proceed simultaneously rather than sequentially. Earlier GPU architectures had limited hardware support for this pattern; on Hopper, TMA's asynchronous interface makes it practical without complex synchronization overhead.
The s-stage circular SMEM buffer (typically two or three stages) allows the producer to be s iterations ahead of the consumer, providing enough slack to hide global-memory latency even when the GPU's memory system is under load.
Register management is central to the design. Each SM on H100 has 256 KB of register file shared among all resident warps. In a naive kernel that assigns both data movement and computation to the same warps, each warp needs registers for both roles, limiting the tile size and pipeline depth. With warp specialization, the producer warp calls setmaxnreg 40 to reduce its register allocation to a minimum, freeing those registers for the consumer warpgroup. The consumer warpgroup can then call setmaxnreg 232 to claim a large register budget, supporting the large WGMMA tile sizes and the 2-stage pipeline's intermediate buffers. This register reallocation at kernel launch time (not dynamically during execution) is a feature unique to Hopper's programmable warpgroup interface.
For the backward pass, FA3 applies the same warp-specialization strategy to the three GEMMs required to compute gradients with respect to Q, K, and V. The backward pass achieves 1.5 to 1.75 times the throughput of FA2 backward on H100.
The paper measures the contribution of each technique on a representative configuration (batch size 4, sequence length 8,448, 16 heads, head dimension 128) on H100 SXM5:
| Configuration | Latency (ms) | TFLOPS/s |
|---|---|---|
| No warp specialization | 4.105 | 570 |
| Warp specialization only | 4.021 | 582 |
| Full FA3 (warp spec + GEMM-softmax pipelining) | 3.538 | 661 |
The combined gain from both techniques is roughly 16% over the no-specialization baseline, and both contribute meaningfully.
A second bottleneck is the softmax computation itself. On H100, FP16 matmul runs at 989 TFLOPS while the special-function unit (SFU) responsible for the exponential in softmax runs at only 3.9 TFLOPS, a 256-fold throughput ratio. In a sequential implementation, softmax on a block consumes roughly 50% of total kernel time despite performing very little arithmetic work. The fix is to overlap softmax with the next GEMM iteration so the SFU and Tensor Cores run concurrently.
FA3 implements two complementary strategies:
When two consumer warpgroups share a thread block, they alternate roles: while warpgroup 1 performs GEMM operations, warpgroup 2 runs softmax on its current scores, and vice versa. Explicit bar.sync barriers enforce this alternating schedule. The two warpgroups share the same HBM tiles (loaded once by the producer), so this doubles the compute work done per memory fetch.
Pingpong scheduling improves forward-pass throughput from approximately 570 TFLOPS (no specialization baseline) to around 620 TFLOPS.
Even within a single warpgroup, the algorithm can overlap operations across loop iterations. The key observation is that WGMMA is asynchronous: issuing WGMMA(Q, K_next) does not stall; the hardware executes it while the software proceeds. The 2-stage pipeline exploits this:
Step 3 overlaps with step 1 in hardware: the Tensor Core executes the O-GEMM while the WGMMA unit for QK is still in flight. This intra-warpgroup pipelining raises throughput to 640 to 660 TFLOPS, though at the cost of higher register pressure from buffering intermediate matrices.
A 3-stage variant is described in the paper's appendix and overlaps further, but register constraints make it impractical for typical tile sizes.
The two strategies are complementary and can be combined. In the full FA3 configuration with both warp specialization and 2-stage intra-warpgroup pipelining, the ablation study shows a final throughput of 661 TFLOPS/s on the tested configuration, compared to 570 TFLOPS/s without either technique.
It is worth noting that the 256-fold throughput gap between matrix operations and special functions is not a bug but a consequence of hardware design priorities. Tensor Cores occupy large silicon area and deliver enormous throughput because GEMMs dominate transformer training and inference. The SFU is sized to handle transcendental functions (exp, log, sin, cos) used sparingly throughout most workloads. The attention mechanism is unusual in that its exponential is on the critical path and occurs inside a tight loop. This makes attention exceptionally sensitive to SFU latency, and the FA3 overlap approach directly targets this sensitivity.
FP8 WGMMA requires per-tensor or per-block scaling factors. FA3 uses per-block quantization, maintaining one scaling factor for each tile of K, V, and Q rather than a single global value. This aligns naturally with FlashAttention's block-based iteration: each tile is quantized independently before being loaded into SMEM. The score matrix S also receives per-block scaling at essentially zero cost, since the scale can be fused with the softmax rescaling step.
Activation outliers in large language models concentrate high values in a few dimensions, causing disproportionate quantization error when the FP8 dynamic range is allocated to represent those extremes. FA3 addresses this by multiplying Q and K by a random orthogonal matrix M before quantization:
(QM)(KM)ᵀ = QKᵀ
Because M is orthogonal, attention output is unchanged. But the multiplication distributes the outlier energy across all dimensions, making the resulting activations more uniform and thus better suited to per-block FP8 quantization.
In practice, M is the product of a diagonal matrix with random ±1 entries and a Hadamard matrix, giving a complexity of O(d log d) rather than O(d²) for a general orthogonal matrix. The Hadamard transform can also be fused with rotary position embeddings at no extra cost, since both apply a dimension-wise linear transform before the attention operation.
FP8 WGMMA instructions require inputs in k-major (column-major) layout, but V tensors are typically stored in head-dimension contiguous format. FA3 performs an in-kernel transposition of V in the producer warpgroup using LDSM/STSM (shared memory load/store matrix) instructions, overlapping this with consumer GEMMs so no latency is added to the critical path.
Register layout mismatches between FP32 accumulators and FP8 operands are resolved with byte permute instructions that rearrange accumulator data before it is fed as the next WGMMA operand.
| Method | RMSE |
|---|---|
| Baseline FP8 (per-tensor quantization) | 2.4 × 10⁻² |
| FA3 block quantization only | 9.3 × 10⁻³ |
| FA3 incoherent processing only | 2.4 × 10⁻² |
| FA3 block quantization + incoherent processing | 9.1 × 10⁻³ |
The combined technique achieves a 2.6 times reduction in RMSE versus the baseline. Block quantization does most of the work; incoherent processing adds a small additional benefit on distributions with heavy outliers.
For FP16, FA3 numerical accuracy is essentially identical to FA2:
| Method | RMSE (outlier distribution) |
|---|---|
| Standard attention | 3.2 × 10⁻⁴ |
| Flash Attention 2 | 1.9 × 10⁻⁴ |
| Flash Attention 3 | 1.9 × 10⁻⁴ |
| Sequence length | FA2 TFLOPS/s | FA3 TFLOPS/s | Speedup |
|---|---|---|---|
| 512 | ~180 | ~270 | 1.5x |
| 1,024 | ~240 | ~400 | 1.7x |
| 2,048 | ~290 | ~510 | 1.75x |
| 4,096 | ~330 | ~600 | 1.8x |
| 8,192 | ~350 | ~660 | 1.9x |
| 16,384 | ~360 | ~700 | 1.95x |
Peak FP16 throughput at release reached 740 TFLOPS/s, representing 75% of the H100's theoretical FP16 maximum of 989 TFLOPS/s. The speedup is larger at longer sequence lengths because the tile-based computation becomes a larger fraction of total work at long contexts, amortizing per-kernel launch overhead. Patches landed in late 2024 and through 2025 lifted the peak further: the PyTorch blog updated in November 2025 reports up to 840 TFLOPS/s on BF16 (about 85% utilization) for sequence length 16,384 with head dimension 128, attributable to a more aggressive intra-warpgroup pipeline schedule and tighter management of the L2 residency for the K and V tensors.
FA3 also improves the backward pass by 1.5 to 1.75 times compared to FA2, using the same warp-specialization and asynchrony techniques applied to the backward-pass GEMMs.
The FP8 kernel reaches close to 1.2 PFLOPS/s (1,200 TFLOPS/s) at release, with later builds breaking the 1.3 PFLOPS/s mark on long unmasked sequences. This is approximately 60 to 66% of the H100's peak FP8 throughput of 1,978 TFLOPS/s. For large sequences without causal masking, FA3's FP8 matches or exceeds cuDNN. For small sequences with causal masking, FA3 FP8 initially fell behind cuDNN due to the absence of a persistent kernel with load balancing (a noted limitation acknowledged in the paper); a persistent-kernel variant landed in the hopper/ subdirectory during 2025 and closes most of that gap.
| Metric | Flash Attention 2 | Flash Attention 3 |
|---|---|---|
| H100 utilization (FP16) | ~35% | ~75 to 85% |
| Peak FP16/BF16 throughput | ~350 TFLOPS/s | ~740 to 840 TFLOPS/s |
| Peak FP8 throughput | N/A | ~1,200 to 1,300 TFLOPS/s |
| FP16 forward speedup | 1x (baseline) | 1.5 to 2.0x |
| FP16 backward speedup | 1x (baseline) | 1.5 to 1.75x |
| FP8 support | No | Yes (forward pass) |
| Native Hopper TMA/WGMMA | No | Yes |
| Minimum GPU | Ampere | Hopper (H100/H800) |
FA3 supports the same attention variants as FA2, implemented within the same tiled kernel structure:
| Variant | Description | FA3 Support |
|---|---|---|
| Multi-head attention (MHA) | Full separate Q, K, V per head | Yes |
| Multi-query attention (MQA) | Single K/V shared across all Q heads | Yes |
| Grouped-query attention (GQA) | Multiple Q heads per K/V group | Yes |
| Causal (autoregressive) masking | Upper triangular mask | Yes |
| Sliding window attention | Windowed local attention mask | Partial |
| Variable-length sequences | Packed batches of varying lengths | Yes |
| FP16 forward | Full precision forward pass | Yes |
| BF16 forward | Bfloat16 forward pass | Yes |
| FP8 forward | 8-bit forward pass | Yes |
| FP16/BF16 backward | Backward pass for training | Yes |
| FP8 backward | 8-bit backward pass for training | Described in paper; not fully optimized in initial release |
GQA is handled by adjusting the tensor indexing: when loading K and V blocks, the kernel maps each Q head to its corresponding K/V group without duplicating the K/V tensors in memory. This allows FA3 to support modern architectures like LLaMA 3, Mistral, and Qwen that use GQA for inference efficiency.
Variable-length sequence support uses a packed-sequence format where all sequences in a batch are concatenated into a single tensor, with a separate array of cumulative lengths tracking boundaries. The tiled attention loop respects these boundaries, applying causal masking only within each sequence and never mixing attention between sequences.
The FA3 code was released as a beta under the existing Dao-AILab/flash-attention GitHub repository, in a dedicated hopper/ subdirectory. Installation requires CUDA 12.3 or later (CUDA 12.8 recommended) and an H100 or H800 GPU. The Python interface is imported as flash_attn_interface.
Tri Dao and Together AI maintain a companion repository at togethercomputer/flash-attention-3. A second active fork lives under the vLLM project at vllm-project/flash-attention, which slightly modifies the Python wrapper to fit the vLLM scheduler and exposes the persistent-kernel variants the vLLM team relies on for production deployment.
Since FA3's release in July 2024, it has been integrated or is in the process of integration across major inference and training frameworks. As of mid-2026 the picture is the following:
| Framework | Status (mid-2026) | Notes |
|---|---|---|
| PyTorch (scaled_dot_product_attention) | FA2 remains the default backend; FA3 dispatch path tracked under issue #148891 | Binary-size concerns from adding a second Hopper-only backend; H100-only builds proposed |
| SGLang | FA3 enabled as a selectable attention backend | Used for high-throughput H100 deployments with FP8 |
| vLLM | FA3 supported via the vllm-project fork; FA2 and FA3 both available | FA4 not yet wired in as of early 2026 |
| Hugging Face Transformers | FA3 wrapper available via flash_attn 3.x; integration tracked under GitHub issue #33373 | Models opt in via attn_implementation="flash_attention_3" |
| NVIDIA TensorRT-LLM | Uses FA3-style kernels through its own cuDNN/CUTLASS backends | Reports the highest H100 throughput in third-party benchmarks |
| NVIDIA Megatron-LM | FA integration present; FA3 extension shipped during 2025 | Used for FP8 pretraining on H100 clusters |
| Microsoft DeepSpeed | FlashAttention integrated in inference engine | Hopper kernels selected automatically when available |
| AMD ROCm | FA-style kernels via a separate composable kernel; FA3 itself is NVIDIA-only | ROCm bundles its own FlashAttention build for MI300X |
FA2 remains the more widely deployed version through 2025 given the broader installed base of Ampere GPUs. FA3 adoption accelerated through 2025 and into 2026 as H100 and H200 deployments scaled, and as the vLLM fork stabilised its FP8 attention path. By early 2026 every major open-source serving engine including vLLM, SGLang, TensorRT-LLM, llama.cpp, ExLlamaV2, and MLC-LLM uses FlashAttention or a direct derivative as its attention backend.
Third-party serving benchmarks from 2025 and 2026 use FA3 as the default attention path on H100 for the three leading open-source engines. On a single H100 80 GB SXM5 with a 70B-parameter Llama-3 derivative in FP8 at 50 concurrent requests, TensorRT-LLM reports the highest throughput, with vLLM (using FA3 via the vllm-project fork) at roughly 85 to 90% of that level and SGLang at 88 to 92%. The spread between engines comes mainly from scheduler, KV-cache management, and CUDA Graph capture differences rather than from the attention kernel itself.
NVIDIA's own cuDNN library includes highly optimized attention kernels for Hopper. On large FP16 sequences (length ≥ 1,024), FA3 matches or exceeds cuDNN throughput. For FP8 at large sequence lengths without causal masking, FA3 is competitive. FA3 has an advantage in openness: it is available as open source under a permissive license, while cuDNN is a closed binary. Subsequent cuDNN releases (9.x series) closed some of the gap on causal-masked workloads, but the FA3 reference remains the open-source baseline that downstream projects target when porting to new precisions or fused operations.
PyTorch's torch.nn.functional.scaled_dot_product_attention function (SDPA) was redesigned in PyTorch 2.0 to dispatch to multiple backends, including FlashAttention. The SDPA interface selects the most efficient available kernel based on input dtype, device, and attention mask type at call time. FA2 is available as one such backend. Full integration of FA3 into the PyTorch SDPA dispatch path was listed as a near-term goal in both the FA3 paper and the PyTorch blog post at the time of FA3's release.
As of mid-2026 that integration is still in progress, tracked under PyTorch GitHub issue #148891. The blocker is engineering rather than performance: adding FA3 alongside FA2 inflates the prebuilt PyTorch wheel because the FA3 kernels are large and Hopper-specific. The proposal under discussion is to build the FA3 path only into Hopper-targeted wheels, leaving Ampere and Ada wheels using FA2. Users who want FA3 today install the flash_attn 3.x Python package and call it directly through flash_attn_interface, or rely on serving frameworks (vLLM, SGLang, TensorRT-LLM) that bundle FA3 themselves.
FlexAttention, a PyTorch 2.5 feature that lets users define custom attention score modifications while still using a fused kernel, was extended in late 2025 and early 2026 to dispatch to FA3-style Hopper kernels on H100 and to FA4-style Blackwell kernels on B200. This is the most likely long-term path for FA3 to reach mainstream PyTorch users: not as a new SDPA backend, but as one of the kernels FlexAttention dispatches into.
Long-context language models depend heavily on fast attention kernels because the number of attention operations grows quadratically with context length. FA3's roughly 2x speedup on H100 compared to FA2 translates directly into either faster generation at a given context length or support for longer contexts at the same latency budget. Models supporting 128K or 1M token contexts, such as GPT-4 Turbo, Llama 3 extended-context variants, and Claude 3 extended contexts, benefit from any reduction in per-attention-block latency. FA3's FP8 support is particularly relevant here: at 128K context, the attention block consumes a substantial fraction of total inference compute, and halving precision roughly halves that cost while FA3's incoherent processing keeps accuracy degradation small.
FA3 was designed as a fast kernel for standard scaled dot-product attention, but several attention variants that became popular during 2024 and 2025 require kernel modifications rather than direct reuse. The two most consequential are Multi-head Latent Attention (MLA) and Mixture of Block Attention (MoBA).
Multi-head Latent Attention (MLA), introduced by DeepSeek in its V2 architecture in mid-2024 and used in DeepSeek-V3 and the R1 reasoning model, compresses the keys and values into a low-rank latent vector stored in the KV cache, with full-rank K and V reconstructed on the fly. The technique cuts KV-cache memory by roughly 50 to 90% versus standard multi-head attention while preserving downstream quality.
MLA breaks one of FA3's assumptions: the keys and values change shape between storage and use, so a standard FA3 kernel cannot be called directly. DeepSeek's response was FlashMLA, an open-source kernel released during DeepSeek's Open Source Week in February 2025. FlashMLA is modelled on FlashAttention 2 and 3, reuses the warp-specialization and TMA producer-consumer pattern from FA3, and folds the latent-to-full projection inside the kernel so that uncompressed K and V never reach HBM. FlashMLA reaches around 660 TFLOPS/s in compute-bound regimes on an H800 SXM5 and was later updated for an additional 5 to 15% speedup.
FA3 and FlashMLA coexist in production: standard models (Llama 3, Qwen 2.5, Mistral) use FA3, while DeepSeek's MLA-based models use FlashMLA.
MoBA (Mixture of Block Attention), introduced by Moonshot AI in early 2025 and used in the Kimi K2 long-context models, divides the key/value sequence into blocks and uses a gating router to select a sparse subset of blocks each query token attends to. MoBA is sparse rather than dense, so it cannot reuse FA3 without modification. The reference MoBA kernels nevertheless adopt FA3's warp-specialization template, with the inner loop iterating over selected blocks and the per-block compute borrowing FA3's TMA plus WGMMA pipeline.
Native Sparse Attention, ring attention for sequence parallelism, and hybrid sliding-window plus global-token schemes all build on the FA3 codebase as a starting point. Most modify only the loop bounds and the mask logic, leaving the producer-consumer split intact.
FA3 uses WGMMA and TMA instructions that are exclusive to Hopper (compute capability 9.0). It does not run on Ampere (A100, compute capability 8.0) or Ada Lovelace (RTX 4000 series, compute capability 8.9). Organizations running A100 fleets must continue to use FA2.
The FP8 optimization in the paper covers only the forward pass. The backward pass algorithm for FP8 is described in the paper's appendix but was not fully optimized in the initial release. FP8 training therefore requires a separate implementation path or falls back to BF16/FP16 for the backward pass.
For small sequences with causal masking, FA3's initial FP8 kernel underperformed cuDNN because it lacked a persistent kernel design with dynamic load balancing. This was noted explicitly in the paper as future work. A persistent variant has since landed in the open-source repository, closing most of the gap, but the persistent path is still less mature than the standard kernel and is selected manually rather than by an autotuner.
The intra-warpgroup 2-stage pipeline requires buffering intermediate GEMM results across iterations, consuming extra registers per thread. For large tile sizes, this risks register spilling to local memory (which routes through HBM), partially undoing the pipelining benefit. The 3-stage variant described in the paper appendix is more susceptible to this problem.
Initial releases of FA3 do not support NVIDIA Blackwell GPUs (NVIDIA Blackwell), including the GB200 and B200. Blackwell replaces WGMMA with a new instruction set called TCGEN05, which has a different programming model: TCGEN05 instructions are fully asynchronous, accumulate into a new on-chip storage called Tensor Memory (TMEM) rather than into registers, and require a different producer-consumer split. Attempts to run FA3 on Blackwell (sm_100) raise CUDA launch errors. This limitation directly motivated the subsequent development of Flash Attention 4.
The sm_121 target used in the consumer GB10 chips also lacks FA3 support; most cloud deployments will skip FA3 on Blackwell and move straight to FA4.
Sliding window attention is listed as partially supported. The kernel handles the basic windowed mask, but more complex schemes such as the StreamingLLM attention sink combined with a sliding window require additional kernel paths that are not all in the released FA3 build.
Flash Attention 4 (FA4), previewed at Hot Chips in August 2025 and published in full as a technical report by Together AI and the Dao-AILab group in early 2026, is a rewrite in CuTeDSL (a Python domain-specific language built on NVIDIA's CUTLASS 3.x framework for writing GPU kernels). FA4 targets both Hopper (H100) and NVIDIA Blackwell (B200), using NVIDIA's TCGEN05 tensor core instructions on Blackwell in place of WGMMA. The architectural shift is significant: Blackwell removed WGMMA and replaced it with TCGEN05 instructions that operate on the tensor memory (TMEM) subsystem rather than shared memory, requiring a fundamentally different kernel structure.
FA4 is notable for being implemented primarily in Python using CuTeDSL's JIT compilation model, rather than in C++ as FA3 was. This reduces the engineering burden of writing correct kernel code while retaining the ability to generate architecture-specific CUDA at compile time. CuTeDSL also cuts compile times by roughly 20 to 30 times compared with C++ template-heavy CUTLASS code, a practical benefit for kernel iteration.
The technical report reports up to 1,613 TFLOPS/s on BF16 on B200 (about 71% of the B200's theoretical peak), a 1.1 to 1.3 times speedup over cuDNN 9.13 and 2.1 to 2.7 times over Triton. On H100, FA4 is competitive with FA3 and sometimes slightly faster on long sequences thanks to a tighter intra-warpgroup schedule, but does not yet beat the most aggressive FA3 builds on every shape.
| GPU | Precision | FA3 (mid-2026 build) | FA4 (early 2026 release) |
|---|---|---|---|
| H100 SXM5 | BF16 | ~840 TFLOPS/s | ~850 TFLOPS/s (competitive) |
| H100 SXM5 | FP8 | ~1,300 TFLOPS/s | ~1,350 TFLOPS/s |
| B200 | BF16 | not supported | ~1,613 TFLOPS/s |
| B200 | FP8 | not supported | ~2,800 TFLOPS/s (preliminary) |
FA3 is expected to remain the deployed standard on H100 clusters until FA4's implementation matures and is integrated into downstream frameworks. The two versions cover complementary hardware generations, with FA3 being the production-grade solution for Hopper and FA4 being the successor for Blackwell and later. As of early 2026 vLLM supports FA2 and FA3 but not yet FA4, while PyTorch's FlexAttention dispatches to FA3 on Hopper and FA4 on Blackwell.
The Together AI write-up of FA4 frames the algorithm as co-designed with NVIDIA's Blackwell tensor memory subsystem, and acknowledges that future GPU generations (NVIDIA Rubin and beyond) will require another rewrite. The Dao-AILab roadmap moves the FlashAttention codebase entirely into CuTeDSL to make these forward ports cheaper, treating the algorithm itself as the stable contribution while the kernel implementation tracks each new tensor core ISA.