Triton (OpenAI GPU programming language)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,065 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,065 words
Add missing citations, update stale details, or suggest a clearer explanation.
Triton is an open-source Python-embedded domain-specific language and compiler for writing high-performance custom GPU kernels, particularly those used in deep-learning workloads. It was created by Philippe Tillet during his PhD at Harvard University and presented at the MAPL 2019 workshop, then re-implemented at OpenAI and released as Triton 1.0 in July 2021 under an MIT license.[^1][^2] Programs in Triton are written as Python functions annotated with @triton.jit; the compiler operates on block-level tiles (statically shaped multi-dimensional sub-arrays) and handles memory coalescing, shared-memory management, and thread scheduling automatically, in contrast to CUDA's explicit single-instruction multiple-thread (SIMT) model.[^2][^3] Since 2022 the compiler has been rebuilt on top of MLIR, and it now lowers Triton-IR through a Triton-GPU IR and LLVM-IR to NVIDIA PTX or AMD AMDGCN device code.[^4] Triton is the kernel-generation backend used by PyTorch's TorchInductor in PyTorch 2.0 (March 2023), and it underpins a large family of community kernel libraries including FlashAttention's Triton reference kernels, LinkedIn's Liger Kernel, FBGEMM_GPU's GenAI kernels, and the vLLM attention backend.[^5][^6][^7][^8][^9]
Writing efficient custom kernels for graphics processing units has long required CUDA C++ expertise: programmers manually tile work across thread blocks, stage data into shared memory, coordinate warp-level scheduling, and tune for specific compute capabilities. The first formal description of Triton, "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations" by Philippe Tillet, H.T. Kung, and David Cox, was published in the Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL '19), held on 22 June 2019 in Phoenix, Arizona alongside PLDI.[^1] Tillet, then a Harvard PhD student under Kung and Cox, had begun the project in 2018 after working on auto-tuners for matrix multiplications in CUDA and on the ViennaCL OpenCL library; he completed his PhD in 2020 with a dissertation on compilers for blocked algorithms on GPUs.[^10]
The MAPL paper argued that "tile-level operations and optimizations" should be first-class concepts inside a compiler intermediate representation, and showed that adding "just a few data- and control-flow extensions to LLVM-IR could enable various tile-level optimization passes which jointly lead to performance on-par with vendor libraries."[^1] The original prototype shipped as a C++/Python library and was hosted at github.com/ptillet/triton and later github.com/openai/triton.[^2]
Tillet joined OpenAI full-time in 2020 to continue Triton's development as part of OpenAI's accelerator research, after holding internships and research positions at NVIDIA and AMD that influenced his thinking about GEMM kernel optimisation.[^10] On 28 July 2021 OpenAI announced Triton 1.0 in the blog post "Introducing Triton: Open-source GPU programming for neural networks," authored by Philippe Tillet.[^2] The post described Triton as "an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code" and highlighted that "it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS in under 25 lines of code" on recent NVIDIA hardware.[^2] The announcement positioned Triton as a complement rather than a wholesale replacement for CUDA: researchers writing custom fused activations, normalisation kernels, or attention variants would no longer need to manage thread-level shared-memory tiling or warp-synchronous reductions by hand, while still being able to target the same NVIDIA tensor cores used by cuBLAS and cuDNN.[^2]
Between late 2022 and 2023 the Triton compiler underwent a major internal rewrite to use the MLIR infrastructure. The Triton 2.0 release notes describe a "backend rewritten to use MLIR" along with support for kernels that contain back-to-back matmuls (the FlashAttention pattern).[^3] PyTorch 2.0, released on 15 March 2023, made Triton the default GPU code-generator inside the new TorchInductor backend of torch.compile.[^5] The repository was moved from the openai/ GitHub organization to a community triton-lang/ organization, where development continues with contributions from OpenAI, Meta, AMD, Intel, NVIDIA, IBM and Red Hat.[^3][^11]
| Milestone | Date | Notes |
|---|---|---|
| MAPL '19 paper | 22 June 2019 | Tillet, Kung, Cox introduce Triton-IR and tile-level optimization passes[^1] |
| Tillet joins OpenAI | 2020 | Full-time work on Triton begins inside OpenAI; PhD completed at Harvard University[^10] |
| Triton 1.0 release | 28 July 2021 | OpenAI blog post by Phil Tillet; Python-embedded DSL; FP16 GEMM matching cuBLAS in under 25 lines[^2] |
| MLIR-based rewrite | 2022, into Triton 2.x | Backend rewritten on MLIR; supports back-to-back matmuls for FlashAttention[^3][^4] |
| PyTorch 2.0 / TorchInductor | 15 March 2023 | Triton becomes default GPU code-generator for torch.compile on NVIDIA and AMD[^5] |
| ROCm 6.2 AMD backend | 2024 | First-class AMD GPU support with AMDGCN lowering[^11][^15] |
| Liger Kernel release | August 2024 | LinkedIn open-sources Triton kernel library for LLM training[^8][^20] |
| Triton Developer Conference | September 2024 | Talks from Intel, AMD, NVIDIA, Qualcomm, Microsoft and AWS in Silicon Valley[^11] |
| Gluon lower-level dialect | 2025 | Gluon programming model exposing layouts, shared memory and warp specialization as a sibling to Triton[^12] |
Triton flips the conventional CUDA model. CUDA programs are written as scalar functions executed by individual threads inside thread blocks, with the programmer responsible for memory layouts, coalescing, shared-memory staging and warp-level synchronization. A Triton kernel, by contrast, is written from the perspective of an entire program instance, and the unit of data is a block (or tile): a statically shaped multi-dimensional array distributed across threads automatically by the compiler.[^1][^2]
The OpenAI announcement characterised the difference as "blocked program, scalar threads" rather than CUDA's "scalar program, blocked threads": the programmer reasons about whole tiles, while the Triton compiler decides which threads in a warp will own which elements and inserts the necessary loads, stores, and synchronizations.[^2] This abstraction is what the original paper called the "tile" abstraction; it lets the compiler apply optimisations such as automatic shared-memory staging of operands to computationally intensive block operations, automatic vectorisation of memory accesses, and coalescing of contiguous loads and stores.[^1][^2]
Kernels are written as ordinary Python functions decorated with @triton.jit. A canonical example from the Triton tutorial implements element-wise vector addition:[^13]
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements,
BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
Several language conventions are visible in this listing.[^13] tl.program_id(axis=0) returns the index of the kernel instance along a launch grid axis. BLOCK_SIZE: tl.constexpr marks a compile-time constant, enabling shape-specialised code generation. tl.arange, tl.load, and tl.store operate on tile-shaped tensors; the optional mask argument prevents out-of-bounds memory accesses on the boundary of the input. The kernel launch grid is computed in Python and passed to the JIT-compiled callable.[^13]
The Triton standard library exposes block-level primitives such as tl.dot (which the compiler lowers to architecture-specific tensor-core MMA or WMMA instructions), tl.atomic_* operations, reductions like tl.sum, broadcasting, and pointer arithmetic on tensor-shaped pointers. Performance tuning is done primarily through triton.autotune, which compiles and benchmarks a kernel across a list of configuration dictionaries (block sizes, number of warps, number of pipeline stages) and caches the winner.[^14] The blog post notes that the "purpose of Triton is to fully automate these optimizations, so that developers can better focus on the high-level logic of their parallel code."[^2]
Visually, Triton kernels have an ASCII-art-like syntax in which 1-D and 2-D index ranges are constructed with tl.arange and then broadcast against each other to form 2-D tile coordinates. A typical matrix multiplication kernel computes a BLOCK_M-by-BLOCK_N output tile by sweeping the K dimension in BLOCK_K-sized chunks, accumulating into a BLOCK_M-by-BLOCK_N accumulator and invoking tl.dot once per iteration; the compiler is responsible for pipelining global-memory loads through shared memory and for issuing the appropriate tensor-core instructions on the target architecture.[^13][^4] Programmers express only the tile-level dataflow; they do not write code per CUDA thread, per warp, or per memory bank.[^2]
The Triton compiler is organised as a multi-stage lowering pipeline through several MLIR dialects, before exiting to LLVM-IR and then to vendor-specific assembly.[^4]
@triton.jit decorator captures the Python function's AST and translates it into Triton-IR, an unoptimised, machine-independent dialect inspired by LLVM IR but with first-class tile-typed values such as tensor<1024x!tt.ptr<f32>>.[^4] Triton-IR has been an MLIR dialect since the 2022 rewrite.[^4]#blocked, slice, dot_op, shared, nvidia_mma, amd_mfma, and amd_wmma.[^4] Passes at this level distribute tensors across warps or wavefronts, schedule loads through shared memory, and pipeline matmul-heavy loops.[^4]Intermediate forms at every stage can be inspected programmatically via the compiled kernel's asm dictionary, which exposes the Triton-IR, Triton-GPU IR, LLVM-IR, PTX, and final binary.[^4]
Internal optimisation passes performed during these lowerings include software pipelining of matmul loops (overlapping global-memory loads, shared-memory stores, and tensor-core operations), allocation of shared-memory buffers, layout conversion to insert minimal data movement between layouts, register pressure analysis, and architecture-specific instruction selection for tl.dot such as Ampere mma.sync.m16n8k16, Hopper wgmma.mma_async, and AMD MFMA/WMMA intrinsics.[^4][^17] These passes are written as MLIR rewrites and are gated on hardware capabilities so that the same Triton source can produce different (and architecturally appropriate) device binaries depending on the target.[^4]
Triton ships first-class backends for NVIDIA and AMD GPUs, and several vendor-maintained out-of-tree backends extend it to other accelerators.[^11]
NVIDIA. The reference backend targets compute capability 8.0 and above (Ampere and newer).[^11] On NVIDIA Hopper H100 hardware Triton uses the asynchronous warp-group wgmma instruction for FP8 and FP16 tensor-core operations, which is required to extract peak throughput from H100's tensor cores.[^17] NVIDIA has also contributed Blackwell tensor-core support to Triton, including a CUDA Tile IR backend that targets newer instructions on Blackwell GPUs.[^18]
AMD. Triton's AMD backend targets ROCm 6.2 and newer; it lowers Triton-GPU IR through LLVM-IR to AMDGCN assembly and produces hsaco binaries through the AMD ROCm toolchain.[^4][^11] AMD describes Triton as enabling kernels written once that run on both NVIDIA and AMD hardware, reducing the need for manual CUDA-to-HIP porting.[^15] AMD has integrated Triton tightly with vLLM, making the Triton attention backend the default on AMD GPUs.[^7]
Intel. Intel maintains an out-of-tree backend at intel/intel-xpu-backend-for-triton that targets Intel GPUs through an Intel-specific GenX MLIR dialect, lowering to LLVM and SPIR-V; it uses 2D block I/O and DPAS tensor-core operations for tl.dot.[^16]
CPU and other targets. Microsoft's microsoft/triton-shared repository integrates MLIR's linalg dialect to enable CPU lowering through standard MLIR passes, and triton-lang/triton-cpu is an experimental CPU backend hosted in the official organization.[^19] Work is also under way on RISC-V CPU lowering as an extension of the CPU backend.[^11]
PyTorch 2.0, released on 15 March 2023, introduced torch.compile and its new TorchInductor backend.[^5] TorchInductor uses what the PyTorch team described as "a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs."[^5] The release post explicitly states: "As an underpinning technology of torch.compile, TorchInductor with NVIDIA and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized CUDA libraries such as cuBLAS."[^5] The integration was previewed at the PyTorch Conference on 2 December 2022, and the choice of Triton was motivated by the PyTorch team observing that custom-kernel authors were increasingly adopting Triton over hand-written CUDA.[^5]
TorchInductor's Triton code path is implemented in torch/_inductor/codegen/triton.py; PyTorch also exposes torch.library.triton_op and tutorials for using user-defined Triton kernels alongside torch.compile.[^5]
FlashAttention, the IO-aware exact attention algorithm by Tri Dao and collaborators, ships a Triton-based reference implementation alongside its CUDA kernels. Phil Tillet wrote an early FlashAttention-style fused-attention tutorial inside the Triton repository at python/tutorials/06-fused-attention.py, which demonstrates the back-to-back matmul pattern that Triton 2.0 was extended to support.[^3][^6] The Dao-AILab/flash-attention repository also includes its own experimental Triton implementation at flash_attn/flash_attn_triton.py, supporting attention biases such as ALiBi; the documentation notes that the Triton port is often easier to read and modify than the hand-tuned CUDA kernels.[^6] The Triton implementations also serve as portable references that can run on AMD and Intel GPUs once the corresponding vendor backends are installed, whereas the hand-written CUDA kernels target NVIDIA hardware exclusively.[^6][^15]
Liger Kernel is an open-source collection of Triton kernels for large-language-model training released by LinkedIn in August 2024, with an accompanying arXiv technical report "Liger Kernel: Efficient Triton Kernels for LLM Training" by Pin-Lun Hsu and colleagues posted on 14 October 2024.[^8][^20] The kernels cover RMSNorm, LayerNorm, RoPE, SwiGLU, GeGLU, CrossEntropy, FusedLinearCrossEntropy, Softmax, KLDivergence, alignment losses (DPO, ORPO, SimPO, KTO), and several distillation losses.[^8] The paper reports an average 20 percent training-throughput increase and roughly 60 percent GPU-memory reduction for popular LLMs relative to HuggingFace reference implementations.[^20]
Meta's FBGEMM_GPU library, hosted at pytorch/FBGEMM, includes Triton kernels in its experimental GenAI module covering high/low-precision conversions between FP32, BF16 and FP8 with tensor-wise, row-wise and block-wise scaling, as well as Triton matmuls used in recommendation-system and generative AI workloads on both NVIDIA H100 and AMD MI300X GPUs.[^9] Recent FBGEMM_GPU releases have added a Triton "IKBO LCE" kernel and a TLX IKBO flash-attention variant, reflecting Meta's preference for shipping new performance-sensitive kernels in Triton first and only later porting them to hand-written CUDA or HIP when needed.[^9]
vLLM's inference engine uses Triton extensively. A Triton-based attention backend developed jointly by IBM Research, Red Hat and AMD provides a portable replacement for vendor-specific attention kernels; this Triton backend is the default on AMD GPUs and acts as a fallback on Intel and pre-Hopper NVIDIA cards.[^7] vLLM's paged-attention path includes a Triton kernel paged_attention_2d for decode-time attention over the block-table representation of the KV cache.[^7]
Triton kernels are also used in DeepSpeed-style training stacks and in numerous research codebases, including kernel libraries for quantised inference such as TurboQuant's 3-bit key / 2-bit value KV-cache kernels integrated with vLLM.[^7] Triton has been called "the great equalizer of GPU programming" because the same kernel source can compile to NVIDIA, AMD, and (via vendor backends) Intel hardware.[^15] Hugging Face's text-generation stack, several quantisation libraries, and a steady stream of research papers (for example, custom mixture-of-experts gating, low-bit attention, and speculative-decoding kernels) all rely on Triton as their kernel author's first port of call before falling back to hand-written CUDA when peak performance is required.[^11] AMD's AOTriton project, presented as a PyTorch Conference 2024 lightning talk, provides ahead-of-time compiled Triton kernel libraries on ROCm so that inference servers do not pay JIT compilation overhead at startup.[^11] Red Hat's emerging-tech blog has framed Triton as a "democratisation" layer that lowers the barrier between high-level AI research and accelerator programming, particularly for organisations using a mix of NVIDIA and AMD silicon.[^15]
Triton has become a foundational layer across multiple parts of the deep-learning stack. In training, it powers fused-kernel libraries such as Liger Kernel and the kernels embedded in Hugging Face Transformers' optimised attention paths, where it reduces both memory consumption and step time relative to PyTorch-eager implementations.[^8][^20] In inference, Triton is the kernel-generation backend for torch.compile and is used by vLLM for paged attention, by FBGEMM_GPU for FP8 matrix multiplications, and by quantised-inference libraries for low-bit GEMMs and dequantisation kernels.[^5][^7][^9] In research, Triton is widely used to prototype novel attention variants, mixture-of-experts routing, low-rank adapters, and custom activations, because authors can express algorithmic ideas in roughly Python-like syntax and obtain near-vendor-library performance without writing CUDA.[^1][^2]
A second class of applications is portability. Because the same Triton source compiles to NVIDIA, AMD, and (via Intel's out-of-tree backend) Intel GPUs, organisations can avoid maintaining separate hand-tuned codepaths per vendor; AMD's ROCm team describes Triton as essential to the "single codebase, many backends" model that ROCm-on-PyTorch users rely on.[^7][^15] Triton is also used to author kernels for the experimental CPU backend and for RISC-V CPUs, broadening the scope of "GPU programming language" to include accelerator-style CPU code.[^11][^19]
A third application area is education and rapid prototyping. The official Triton tutorial set walks through vector addition, fused softmax, matrix multiplication with autotuning, low-memory dropout, layer-norm forward and backward, and the original fused-attention example from Phil Tillet that pre-dates FlashAttention's CUDA kernels.[^13] University courses and online lecture series (such as the GPU MODE / CUDA MODE lecture sequence) use Triton as a stepping stone between PyTorch-eager code and CUDA, because the gap from "tensor operation" to "tile-level kernel" is much smaller than from PyTorch to a hand-written CUDA kernel.[^21]
| System | Programming abstraction | Compiler IR | Primary backends |
|---|---|---|---|
| CUDA / CUDA C++ | Scalar threads in blocks; manual shared memory | NVVM/PTX | NVIDIA only[^2] |
| Triton | Block-level tiles; automatic coalescing and shared-memory staging | Triton-IR and Triton-GPU IR on MLIR | NVIDIA, AMD, Intel (out-of-tree), CPU (experimental)[^4][^11] |
| Halide / Apache TVM | Algorithm and schedule split with explicit loop-nest transformations | Halide IR / TIR | CPU, GPU, accelerators[^21] |
| XLA | Graph-level fusion over HLO | HLO and StableHLO | TPU, GPU, CPU |
| Gluon | Lower-level tile DSL exposing layouts, shared memory, warp specialization | Reuses Triton's MLIR stack | NVIDIA, AMD[^12] |
The MAPL paper specifically contrasted Triton with Halide and TVM, observing that schedule-based DSLs "use loop synthesizers" to transform tensor computations into loop nests that are then optimised by user-defined schedules, whereas Triton operates directly on blocks of data and therefore "effectively abstracts away all the issues related to concurrency within CUDA thread blocks (e.g., memory coalescing, shared memory synchronization/conflicts, tensor core scheduling)."[^1] Triton also supports irregular and sparse iteration spaces more readily than polyhedral or schedule-based systems because its block-structured representation lets programmers handle load-balancing manually.[^21]
Several newer projects build on Triton's core ideas. Gluon, mentioned above, is the explicit-performance escape hatch for advanced kernel authors.[^12] ML-Triton is a research extension that adds multi-level compilation layers between the user kernel and the existing Triton-GPU IR. TileLang and CUDA Tile IR are independent industry efforts that target the same "tile-as-IR" abstraction. NVIDIA's CUDA Tile IR backend for Triton, contributed in 2025, lets the Triton compiler emit instructions that take advantage of newer Blackwell features without requiring users to learn a separate language.[^18]
Despite its strong adoption Triton has documented limitations.[^21] Performance, while close to vendor libraries for standard GEMM and attention shapes, can still trail hand-written CUTLASS or hand-tuned CUDA kernels for memory-bound workloads; community analyses report typical gaps of around 20 percent of peak on NVIDIA H100 for the most demanding kernels.[^22] Debugging is constrained because Triton kernels are JIT-compiled into device code; the official documentation recommends setting TRITON_INTERPRET=1 to bypass compilation and execute kernels under a NumPy-based interpreter, and using NVIDIA's compute-sanitizer for race detection.[^23] Triton's block sizes and shapes are largely treated as compile-time constants, which can lead to recompilation overhead when input shapes vary widely. The cross-vendor abstraction also imposes constraints: features that depend on architecture-specific instructions (warp specialization, cluster launch, tensor memory accelerator descriptors) historically required custom intrinsics until the Gluon lower-level dialect was added in 2025 to expose them explicitly.[^12]
The Modular team's critique of Python embedded DSLs notes that Triton "trades performance for productivity, which makes it easier to write GPU code, but it also prevents Triton from achieving peak efficiency," and observes that the same trade-off plays out across other Python eDSLs that abstract away thread-level control.[^22] A separate practical limitation is that autotuning over large configuration spaces can take many minutes on each new input shape, motivating the use of ahead-of-time compilation projects such as AMD's AOTriton.[^11] Triton's compilation cache, while convenient, has been a source of subtle correctness and version-skew bugs in PyTorch CI pipelines, particularly when the cache outlives a Triton upgrade.[^5]
The project is licensed under the MIT license and was originally hosted at github.com/openai/triton. Around 2024 the repository was moved to the community-style triton-lang/triton GitHub organization; the codebase is roughly 38 percent MLIR, 33 percent Python, and 27 percent C++, with contributions from OpenAI, Meta, AMD, Intel, NVIDIA, IBM, Microsoft, and Red Hat.[^3][^11] An annual Triton Developer Conference began in 2024 in Silicon Valley, with talks from chip vendors (Intel, AMD, NVIDIA, Qualcomm) and cloud providers (Microsoft, AWS).[^11] Vendor-specific backends are maintained in separate repositories: intel/intel-xpu-backend-for-triton, microsoft/triton-shared, and triton-lang/triton-cpu.[^16][^19]
The release cadence has been steady. Triton 2.0 introduced the MLIR rewrite and flash-attention-style back-to-back matmul support. Subsequent 2.x and 3.x releases brought Hopper FP8 tensor-core support via wgmma, the AMD ROCm backend, autotuning improvements, an interpreter mode for debugging, and incremental work on cluster launch and tensor memory accelerator support.[^3] Binary wheels are published on PyPI for CPython 3.10 through 3.14.[^3] The Gluon programming model, introduced in 2025, reuses Triton's MLIR stack but exposes layouts, shared-memory allocations, warp specialization and per-thread operations through Linear Layout, providing a lower-level escape hatch for kernels that need to reach the last few percent of hardware peak.[^12]
torch.compile.[^5]