Triton (OpenAI GPU programming language)

Developer Tools OpenAI Programming Languages

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 4,155 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Triton is an open-source, Python-embedded programming language and compiler, originally created by Philippe Tillet and released by OpenAI in July 2021, that lets researchers write high-performance custom GPU kernels without writing CUDA C++. It is widely described as a more portable, productivity-oriented complement to CUDA: a Triton kernel is a Python function annotated with @triton.jit in which the programmer reasons about block-level tiles while the compiler handles memory coalescing, shared-memory management, and thread scheduling automatically.^[1]^[2] Triton was created by Tillet during his PhD at Harvard University and presented at the MAPL 2019 workshop, then re-implemented at OpenAI and released as Triton 1.0 on 28 July 2021 under an MIT license.^[1]^[2] Since 2022 the compiler has been rebuilt on top of MLIR, and it now lowers Triton-IR through a Triton-GPU IR and LLVM-IR to NVIDIA PTX or AMD AMDGCN device code.^[4] Triton is the kernel-generation backend used by PyTorch's TorchInductor in PyTorch 2.0 (March 2023), and it underpins a large family of community kernel libraries including FlashAttention's Triton reference kernels, LinkedIn's Liger Kernel, FBGEMM_GPU's GenAI kernels, and the vLLM attention backend.^[5]^[6]^[7]^[8]^[9]

Note: This article covers the Triton GPU programming language. It is distinct from NVIDIA Triton Inference Server, an unrelated model-serving product.

What is Triton and where did it come from?

Writing efficient custom kernels for graphics processing units has long required CUDA C++ expertise: programmers manually tile work across thread blocks, stage data into shared memory, coordinate warp-level scheduling, and tune for specific compute capabilities. The first formal description of Triton, "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations" by Philippe Tillet, H.T. Kung, and David Cox, was published in the Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL '19), held on 22 June 2019 in Phoenix, Arizona alongside PLDI.^[1] Tillet, then a Harvard PhD student under Kung and Cox, had begun the project in 2018 after working on auto-tuners for matrix multiplications in CUDA and on the ViennaCL OpenCL library; he completed his PhD in 2020 with a dissertation on compilers for blocked algorithms on GPUs.^[10]

The MAPL paper argued that "tile-level operations and optimizations" should be first-class concepts inside a compiler intermediate representation, and showed that adding "just a few data- and control-flow extensions to LLVM-IR could enable various tile-level optimization passes which jointly lead to performance on-par with vendor libraries."^[1] The original prototype shipped as a C++/Python library and was hosted at github.com/ptillet/triton and later github.com/openai/triton.^[2]

Tillet joined OpenAI full-time in 2020 to continue Triton's development as part of OpenAI's accelerator research, after holding internships and research positions at NVIDIA and AMD that influenced his thinking about GEMM kernel optimisation.^[10] On 28 July 2021 OpenAI announced Triton 1.0 in the blog post "Introducing Triton: Open-source GPU programming for neural networks," authored by Philippe Tillet.^[2] The post described Triton as "an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code" and highlighted that "it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS in under 25 lines of code" on recent NVIDIA hardware.^[2] The announcement positioned Triton as a complement rather than a wholesale replacement for CUDA: researchers writing custom fused activations, normalisation kernels, or attention variants would no longer need to manage thread-level shared-memory tiling or warp-synchronous reductions by hand, while still being able to target the same NVIDIA tensor cores used by cuBLAS and cuDNN.^[2]

Between late 2022 and 2023 the Triton compiler underwent a major internal rewrite to use the MLIR infrastructure. The Triton 2.0 release notes describe a "backend rewritten to use MLIR" along with support for kernels that contain back-to-back matmuls (the FlashAttention pattern).^[3] PyTorch 2.0, released on 15 March 2023, made Triton the default GPU code-generator inside the new TorchInductor backend of torch.compile.^[5] The repository was moved from the openai/ GitHub organization to a community triton-lang/ organization, where development continues with contributions from OpenAI, Meta, AMD, Intel, NVIDIA, IBM and Red Hat.^[3]^[11]

Milestone	Date	Notes
MAPL '19 paper	22 June 2019	Tillet, Kung, Cox introduce Triton-IR and tile-level optimization passes^[1]
Tillet joins OpenAI	2020	Full-time work on Triton begins inside OpenAI; PhD completed at Harvard University^[10]
Triton 1.0 release	28 July 2021	OpenAI blog post by Phil Tillet; Python-embedded DSL; FP16 GEMM matching cuBLAS in under 25 lines^[2]
MLIR-based rewrite	2022, into Triton 2.x	Backend rewritten on MLIR; supports back-to-back matmuls for FlashAttention^[3]^[4]
PyTorch 2.0 / TorchInductor	15 March 2023	Triton becomes default GPU code-generator for `torch.compile` on NVIDIA and AMD^[5]
ROCm 6.2 AMD backend	2024	First-class AMD GPU support with AMDGCN lowering^[11]^[15]
Liger Kernel release	August 2024	LinkedIn open-sources Triton kernel library for LLM training^[8]^[20]
Triton Developer Conference	September 2024	Talks from Intel, AMD, NVIDIA, Qualcomm, Microsoft and AWS in Silicon Valley^[11]
Gluon lower-level dialect	2025	Gluon programming model exposing layouts, shared memory and warp specialization as a sibling to Triton^[12]

How does the Triton programming model differ from CUDA?

Triton flips the conventional CUDA model. CUDA programs are written as scalar functions executed by individual threads inside thread blocks, with the programmer responsible for memory layouts, coalescing, shared-memory staging and warp-level synchronization. A Triton kernel, by contrast, is written from the perspective of an entire program instance, and the unit of data is a block (or tile): a statically shaped multi-dimensional array distributed across threads automatically by the compiler.^[1]^[2]

The OpenAI announcement characterised the difference as "blocked program, scalar threads" rather than CUDA's "scalar program, blocked threads": the programmer reasons about whole tiles, while the Triton compiler decides which threads in a warp will own which elements and inserts the necessary loads, stores, and synchronizations.^[2] This abstraction is what the original paper called the "tile" abstraction; it lets the compiler apply optimisations such as automatic shared-memory staging of operands to computationally intensive block operations, automatic vectorisation of memory accesses, and coalescing of contiguous loads and stores.^[1]^[2]

Kernels are written as ordinary Python functions decorated with @triton.jit. A canonical example from the Triton tutorial implements element-wise vector addition:^[13]

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements,
               BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

Several language conventions are visible in this listing.^[13] tl.program_id(axis=0) returns the index of the kernel instance along a launch grid axis. BLOCK_SIZE: tl.constexpr marks a compile-time constant, enabling shape-specialised code generation. tl.arange, tl.load, and tl.store operate on tile-shaped tensors; the optional mask argument prevents out-of-bounds memory accesses on the boundary of the input. The kernel launch grid is computed in Python and passed to the JIT-compiled callable.^[13]

The Triton standard library exposes block-level primitives such as tl.dot (which the compiler lowers to architecture-specific tensor-core MMA or WMMA instructions), tl.atomic_* operations, reductions like tl.sum, broadcasting, and pointer arithmetic on tensor-shaped pointers. Performance tuning is done primarily through triton.autotune, which compiles and benchmarks a kernel across a list of configuration dictionaries (block sizes, number of warps, number of pipeline stages) and caches the winner.^[14] The blog post notes that the "purpose of Triton is to fully automate these optimizations, so that developers can better focus on the high-level logic of their parallel code."^[2]

Visually, Triton kernels have an ASCII-art-like syntax in which 1-D and 2-D index ranges are constructed with tl.arange and then broadcast against each other to form 2-D tile coordinates. A typical matrix multiplication kernel computes a BLOCK_M-by-BLOCK_N output tile by sweeping the K dimension in BLOCK_K-sized chunks, accumulating into a BLOCK_M-by-BLOCK_N accumulator and invoking tl.dot once per iteration; the compiler is responsible for pipelining global-memory loads through shared memory and for issuing the appropriate tensor-core instructions on the target architecture.^[13]^[4] Programmers express only the tile-level dataflow; they do not write code per CUDA thread, per warp, or per memory bank.^[2]

How does the Triton compiler work?

The Triton compiler is organised as a multi-stage lowering pipeline through several MLIR dialects, before exiting to LLVM-IR and then to vendor-specific assembly.^[4]

Python AST to Triton-IR. The @triton.jit decorator captures the Python function's AST and translates it into Triton-IR, an unoptimised, machine-independent dialect inspired by LLVM IR but with first-class tile-typed values such as tensor<1024x!tt.ptr<f32>>.^[4] Triton-IR has been an MLIR dialect since the 2022 rewrite.^[4]
Triton-IR to Triton-GPU IR. The optimiser lowers Triton-IR to Triton-GPU IR (TritonGPU dialect), which annotates tensors with hardware-specific layouts such as #blocked, slice, dot_op, shared, nvidia_mma, amd_mfma, and amd_wmma.^[4] Passes at this level distribute tensors across warps or wavefronts, schedule loads through shared memory, and pipeline matmul-heavy loops.^[4]
Triton-GPU IR to LLVM-IR. Triton-GPU IR is converted to LLVM-IR with explicit pointer arithmetic, vector extracts, and inline assembly for non-portable memory operations.^[4]
LLVM-IR to device code. LLVM compiles to NVIDIA PTX (further assembled into cubin) or to AMD AMDGCN assembly (compiled to a hsaco binary), depending on the active backend.^[4]^[15] Vendor backends contribute additional dialects (for example, the Intel XPU backend adds the GenX dialect) to help with lowering.^[16]

Intermediate forms at every stage can be inspected programmatically via the compiled kernel's asm dictionary, which exposes the Triton-IR, Triton-GPU IR, LLVM-IR, PTX, and final binary.^[4]

Internal optimisation passes performed during these lowerings include software pipelining of matmul loops (overlapping global-memory loads, shared-memory stores, and tensor-core operations), allocation of shared-memory buffers, layout conversion to insert minimal data movement between layouts, register pressure analysis, and architecture-specific instruction selection for tl.dot such as Ampere mma.sync.m16n8k16, Hopper wgmma.mma_async, and AMD MFMA/WMMA intrinsics.^[4]^[17] These passes are written as MLIR rewrites and are gated on hardware capabilities so that the same Triton source can produce different (and architecturally appropriate) device binaries depending on the target.^[4]

Which hardware backends does Triton support?

Triton ships first-class backends for NVIDIA and AMD GPUs, and several vendor-maintained out-of-tree backends extend it to other accelerators.^[11]

NVIDIA. The reference backend targets compute capability 8.0 and above (Ampere and newer).^[11] On NVIDIA Hopper H100 hardware Triton uses the asynchronous warp-group wgmma instruction for FP8 and FP16 tensor-core operations, which is required to extract peak throughput from H100's tensor cores.^[17] NVIDIA has also contributed Blackwell tensor-core support to Triton, including a CUDA Tile IR backend that targets newer instructions on Blackwell GPUs.^[18]

AMD. Triton's AMD backend targets ROCm 6.2 and newer; it lowers Triton-GPU IR through LLVM-IR to AMDGCN assembly and produces hsaco binaries through the AMD ROCm toolchain.^[4]^[11] AMD describes Triton as enabling kernels written once that run on both NVIDIA and AMD hardware, reducing the need for manual CUDA-to-HIP porting.^[15] AMD has integrated Triton tightly with vLLM, making the Triton attention backend the default on AMD GPUs.^[7]

Intel. Intel maintains an out-of-tree backend at intel/intel-xpu-backend-for-triton that targets Intel GPUs through an Intel-specific GenX MLIR dialect, lowering to LLVM and SPIR-V; it uses 2D block I/O and DPAS tensor-core operations for tl.dot.^[16]

CPU and other targets. Microsoft's microsoft/triton-shared repository integrates MLIR's linalg dialect to enable CPU lowering through standard MLIR passes, and triton-lang/triton-cpu is an experimental CPU backend hosted in the official organization.^[19] Work is also under way on RISC-V CPU lowering as an extension of the CPU backend.^[11]

What is Triton used for?

Triton has become a foundational layer across multiple parts of the deep-learning stack.

PyTorch and TorchInductor

PyTorch 2.0, released on 15 March 2023, introduced torch.compile and its new TorchInductor backend.^[5] TorchInductor uses what the PyTorch team described as "a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs."^[5] The release post explicitly states: "As an underpinning technology of torch.compile, TorchInductor with NVIDIA and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that's on par with hand-written kernels and specialized CUDA libraries such as cuBLAS."^[5] The integration was previewed at the PyTorch Conference on 2 December 2022, and the choice of Triton was motivated by the PyTorch team observing that custom-kernel authors were increasingly adopting Triton over hand-written CUDA.^[5]

TorchInductor's Triton code path is implemented in torch/_inductor/codegen/triton.py; PyTorch also exposes torch.library.triton_op and tutorials for using user-defined Triton kernels alongside torch.compile.^[5]

FlashAttention reference kernels

FlashAttention, the IO-aware exact attention algorithm by Tri Dao and collaborators, ships a Triton-based reference implementation alongside its CUDA kernels. Phil Tillet wrote an early FlashAttention-style fused-attention tutorial inside the Triton repository at python/tutorials/06-fused-attention.py, which demonstrates the back-to-back matmul pattern that Triton 2.0 was extended to support.^[3]^[6] The Dao-AILab/flash-attention repository also includes its own experimental Triton implementation at flash_attn/flash_attn_triton.py, supporting attention biases such as ALiBi; the documentation notes that the Triton port is often easier to read and modify than the hand-tuned CUDA kernels.^[6] The Triton implementations also serve as portable references that can run on AMD and Intel GPUs once the corresponding vendor backends are installed, whereas the hand-written CUDA kernels target NVIDIA hardware exclusively.^[6]^[15]

Liger Kernel

Liger Kernel is an open-source collection of Triton kernels for large-language-model training released by LinkedIn in August 2024, with an accompanying arXiv technical report "Liger Kernel: Efficient Triton Kernels for LLM Training" by Pin-Lun Hsu and colleagues posted on 14 October 2024.^[8]^[20] The kernels cover RMSNorm, LayerNorm, RoPE, SwiGLU, GeGLU, CrossEntropy, FusedLinearCrossEntropy, Softmax, KLDivergence, alignment losses (DPO, ORPO, SimPO, KTO), and several distillation losses.^[8] The paper reports an average 20 percent training-throughput increase and roughly 60 percent GPU-memory reduction for popular LLMs relative to HuggingFace reference implementations.^[20]

FBGEMM_GPU

Meta's FBGEMM_GPU library, hosted at pytorch/FBGEMM, includes Triton kernels in its experimental GenAI module covering high/low-precision conversions between FP32, BF16 and FP8 with tensor-wise, row-wise and block-wise scaling, as well as Triton matmuls used in recommendation-system and generative AI workloads on both NVIDIA H100 and AMD MI300X GPUs.^[9] Recent FBGEMM_GPU releases have added a Triton "IKBO LCE" kernel and a TLX IKBO flash-attention variant, reflecting Meta's preference for shipping new performance-sensitive kernels in Triton first and only later porting them to hand-written CUDA or HIP when needed.^[9]

vLLM

vLLM's inference engine uses Triton extensively. A Triton-based attention backend developed jointly by IBM Research, Red Hat and AMD provides a portable replacement for vendor-specific attention kernels; this Triton backend is the default on AMD GPUs and acts as a fallback on Intel and pre-Hopper NVIDIA cards.^[7] vLLM's paged-attention path includes a Triton kernel paged_attention_2d for decode-time attention over the block-table representation of the KV cache.^[7]

Other downstream users

Triton kernels are also used in DeepSpeed-style training stacks and in numerous research codebases, including kernel libraries for quantised inference such as TurboQuant's 3-bit key / 2-bit value KV-cache kernels integrated with vLLM.^[7] Triton has been called "the great equalizer of GPU programming" because the same kernel source can compile to NVIDIA, AMD, and (via vendor backends) Intel hardware.^[15] Hugging Face's text-generation stack, several quantisation libraries, and a steady stream of research papers (for example, custom mixture-of-experts gating, low-bit attention, and speculative-decoding kernels) all rely on Triton as their kernel author's first port of call before falling back to hand-written CUDA when peak performance is required.^[11] AMD's AOTriton project, presented as a PyTorch Conference 2024 lightning talk, provides ahead-of-time compiled Triton kernel libraries on ROCm so that inference servers do not pay JIT compilation overhead at startup.^[11] Red Hat's emerging-tech blog has framed Triton as a "democratisation" layer that lowers the barrier between high-level AI research and accelerator programming, particularly for organisations using a mix of NVIDIA and AMD silicon.^[15]

What kinds of workloads run on Triton?

Triton has become a foundational layer across multiple parts of the deep-learning stack. In training, it powers fused-kernel libraries such as Liger Kernel and the kernels embedded in Hugging Face Transformers' optimised attention paths, where it reduces both memory consumption and step time relative to PyTorch-eager implementations.^[8]^[20] In inference, Triton is the kernel-generation backend for torch.compile and is used by vLLM for paged attention, by FBGEMM_GPU for FP8 matrix multiplications, and by quantised-inference libraries for low-bit GEMMs and dequantisation kernels.^[5]^[7]^[9] In research, Triton is widely used to prototype novel attention variants, mixture-of-experts routing, low-rank adapters, and custom activations, because authors can express algorithmic ideas in roughly Python-like syntax and obtain near-vendor-library performance without writing CUDA.^[1]^[2]

A second class of applications is portability. Because the same Triton source compiles to NVIDIA, AMD, and (via Intel's out-of-tree backend) Intel GPUs, organisations can avoid maintaining separate hand-tuned codepaths per vendor; AMD's ROCm team describes Triton as essential to the "single codebase, many backends" model that ROCm-on-PyTorch users rely on.^[7]^[15] Triton is also used to author kernels for the experimental CPU backend and for RISC-V CPUs, broadening the scope of "GPU programming language" to include accelerator-style CPU code.^[11]^[19]

A third application area is education and rapid prototyping. The official Triton tutorial set walks through vector addition, fused softmax, matrix multiplication with autotuning, low-memory dropout, layer-norm forward and backward, and the original fused-attention example from Phil Tillet that pre-dates FlashAttention's CUDA kernels.^[13] University courses and online lecture series (such as the GPU MODE / CUDA MODE lecture sequence) use Triton as a stepping stone between PyTorch-eager code and CUDA, because the gap from "tensor operation" to "tile-level kernel" is much smaller than from PyTorch to a hand-written CUDA kernel.^[21]

How does Triton compare with CUDA, Halide, TVM, and XLA?

System	Programming abstraction	Compiler IR	Primary backends
CUDA / CUDA C++	Scalar threads in blocks; manual shared memory	NVVM/PTX	NVIDIA only^[2]
Triton	Block-level tiles; automatic coalescing and shared-memory staging	Triton-IR and Triton-GPU IR on MLIR	NVIDIA, AMD, Intel (out-of-tree), CPU (experimental)^[4]^[11]
Halide / Apache TVM	Algorithm and schedule split with explicit loop-nest transformations	Halide IR / TIR	CPU, GPU, accelerators^[21]
XLA	Graph-level fusion over HLO	HLO and StableHLO	TPU, GPU, CPU
Gluon	Lower-level tile DSL exposing layouts, shared memory, warp specialization	Reuses Triton's MLIR stack	NVIDIA, AMD^[12]

The MAPL paper specifically contrasted Triton with Halide and TVM, observing that schedule-based DSLs "use loop synthesizers" to transform tensor computations into loop nests that are then optimised by user-defined schedules, whereas Triton operates directly on blocks of data and therefore "effectively abstracts away all the issues related to concurrency within CUDA thread blocks (e.g., memory coalescing, shared memory synchronization/conflicts, tensor core scheduling)."^[1] Triton also supports irregular and sparse iteration spaces more readily than polyhedral or schedule-based systems because its block-structured representation lets programmers handle load-balancing manually.^[21]

Several newer projects build on Triton's core ideas. Gluon, mentioned above, is the explicit-performance escape hatch for advanced kernel authors.^[12] ML-Triton is a research extension that adds multi-level compilation layers between the user kernel and the existing Triton-GPU IR. TileLang and CUDA Tile IR are independent industry efforts that target the same "tile-as-IR" abstraction. NVIDIA's CUDA Tile IR backend for Triton, contributed in 2025, lets the Triton compiler emit instructions that take advantage of newer Blackwell features without requiring users to learn a separate language.^[18]

What are Triton's limitations?

Despite its strong adoption Triton has documented limitations.^[21] Performance, while close to vendor libraries for standard GEMM and attention shapes, can still trail hand-written CUTLASS or hand-tuned CUDA kernels for memory-bound workloads; community analyses report typical gaps of around 20 percent of peak on NVIDIA H100 for the most demanding kernels.^[22] Debugging is constrained because Triton kernels are JIT-compiled into device code; the official documentation recommends setting TRITON_INTERPRET=1 to bypass compilation and execute kernels under a NumPy-based interpreter, and using NVIDIA's compute-sanitizer for race detection.^[23] Triton's block sizes and shapes are largely treated as compile-time constants, which can lead to recompilation overhead when input shapes vary widely. The cross-vendor abstraction also imposes constraints: features that depend on architecture-specific instructions (warp specialization, cluster launch, tensor memory accelerator descriptors) historically required custom intrinsics until the Gluon lower-level dialect was added in 2025 to expose them explicitly.^[12]

The Modular team's critique of Python embedded DSLs notes that Triton "trades performance for productivity, which makes it easier to write GPU code, but it also prevents Triton from achieving peak efficiency," and observes that the same trade-off plays out across other Python eDSLs that abstract away thread-level control.^[22] A separate practical limitation is that autotuning over large configuration spaces can take many minutes on each new input shape, motivating the use of ahead-of-time compilation projects such as AMD's AOTriton.^[11] Triton's compilation cache, while convenient, has been a source of subtle correctness and version-skew bugs in PyTorch CI pipelines, particularly when the cache outlives a Triton upgrade.^[5]

Who governs Triton and is it open source?

The project is licensed under the MIT license and was originally hosted at github.com/openai/triton. Around 2024 the repository was moved to the community-style triton-lang/triton GitHub organization; the codebase is roughly 38 percent MLIR, 33 percent Python, and 27 percent C++, with contributions from OpenAI, Meta, AMD, Intel, NVIDIA, IBM, Microsoft, and Red Hat.^[3]^[11] An annual Triton Developer Conference began in 2024 in Silicon Valley, with talks from chip vendors (Intel, AMD, NVIDIA, Qualcomm) and cloud providers (Microsoft, AWS).^[11] Vendor-specific backends are maintained in separate repositories: intel/intel-xpu-backend-for-triton, microsoft/triton-shared, and triton-lang/triton-cpu.^[16]^[19]

The release cadence has been steady. Triton 2.0 introduced the MLIR rewrite and flash-attention-style back-to-back matmul support. Subsequent 2.x and 3.x releases brought Hopper FP8 tensor-core support via wgmma, the AMD ROCm backend, autotuning improvements, an interpreter mode for debugging, and incremental work on cluster launch and tensor memory accelerator support.^[3] Binary wheels are published on PyPI for CPython 3.10 through 3.14.^[3] The Gluon programming model, introduced in 2025, reuses Triton's MLIR stack but exposes layouts, shared-memory allocations, warp specialization and per-thread operations through Linear Layout, providing a lower-level escape hatch for kernels that need to reach the last few percent of hardware peak.^[12]

PyTorch integrates Triton through TorchInductor in torch.compile.^[5]
CUDA is the dominant NVIDIA-only alternative; Triton is positioned as a more portable, productivity-oriented complement.^[2]
FlashAttention and FlashAttention 3 ship Triton reference implementations alongside their hand-tuned CUDA kernels.^[6]
XLA and Apache TVM occupy adjacent points in the design space, operating at the graph level rather than the kernel level.^[21]
vLLM's attention backend uses Triton kernels for paged attention and quantised inference.^[7]
DeepSpeed training pipelines interoperate with Triton-based fused kernels such as Liger Kernel.^[8]

References

Tillet, Philippe; Kung, H. T.; Cox, David. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL '19), 2019-06-22. https://doi.org/10.1145/3315508.3329973. Accessed 2026-05-21. ↩
Tillet, Philippe (OpenAI). "Introducing Triton: Open-source GPU programming for neural networks." OpenAI Blog, 2021-07-28. https://openai.com/index/triton/. Accessed 2026-05-21. ↩
triton-lang. "README.md, Triton repository." GitHub, accessed 2026-05-21. https://github.com/triton-lang/triton. Accessed 2026-05-21. ↩
PyTorch Foundation. "Triton Kernel Compilation Stages." PyTorch Blog, 2024-10-30. https://pytorch.org/blog/triton-kernel-compilation-stages/. Accessed 2026-05-21. ↩
PyTorch Foundation. "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever." PyTorch Blog, 2023-03-15. https://pytorch.org/blog/pytorch-2-0-release/. Accessed 2026-05-21. ↩
Dao-AILab. "flash-attention repository README." GitHub, accessed 2026-05-21. https://github.com/Dao-AILab/flash-attention. Accessed 2026-05-21. ↩
vLLM Project. "vLLM Triton Attention Backend Deep Dive." vLLM Blog, 2026-03-04. https://blog.vllm.ai/2026/03/04/vllm-triton-backend-deep-dive.html. Accessed 2026-05-21. ↩
LinkedIn. "Liger-Kernel: Efficient Triton Kernels for LLM Training, repository README." GitHub, accessed 2026-05-21. https://github.com/linkedin/Liger-Kernel. Accessed 2026-05-21. ↩
PyTorch (Meta). "FBGEMM_GPU experimental GenAI README." GitHub, accessed 2026-05-21. https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gen_ai/README.md. Accessed 2026-05-21. ↩
MLSys. "Chat with Philippe Tillet, Author of Triton." mlsys.ai, accessed 2026-05-21. https://www.mlsys.ai/papers/tillet.html. Accessed 2026-05-21. ↩
DigitalOcean Community. "Programming AI Accelerators with Triton." DigitalOcean Tutorials, accessed 2026-05-21. https://www.digitalocean.com/community/tutorials/introduction-to-triton-programming. Accessed 2026-05-21. ↩
triton-lang. "Gluon Overview." Triton Documentation, accessed 2026-05-21. https://triton-lang.org/main/gluon/index.html. Accessed 2026-05-21. ↩
triton-lang. "Vector Addition tutorial." Triton Documentation, accessed 2026-05-21. https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html. Accessed 2026-05-21. ↩
triton-lang. "triton.autotune API reference." Triton Documentation, accessed 2026-05-21. https://triton-lang.org/main/python-api/generated/triton.autotune.html. Accessed 2026-05-21. ↩
AMD ROCm. "Unlock Peak Performance on AMD GPUs with Triton Kernel Optimizations." ROCm Blogs, accessed 2026-05-21. https://rocm.blogs.amd.com/software-tools-optimization/kernel-development-optimizations-with-triton-on-/README.html. Accessed 2026-05-21. ↩
Intel. "intel-xpu-backend-for-triton README and ARCHITECTURE." GitHub, accessed 2026-05-21. https://github.com/intel/intel-xpu-backend-for-triton. Accessed 2026-05-21. ↩
PyTorch Foundation. "Accelerating Llama3 FP8 Inference with Triton Kernels." PyTorch Blog, accessed 2026-05-21. https://pytorch.org/blog/accelerating-llama3/. Accessed 2026-05-21. ↩
NVIDIA. "Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton." NVIDIA Technical Blog, accessed 2026-05-21. https://developer.nvidia.com/blog/advancing-gpu-programming-with-the-cuda-tile-ir-backend-for-openai-triton/. Accessed 2026-05-21. ↩
Microsoft. "triton-shared: Shared Middle-Layer for Triton Compilation." GitHub, accessed 2026-05-21. https://github.com/microsoft/triton-shared. Accessed 2026-05-21. ↩
Hsu, Pin-Lun; Dai, Yun; Kothapalli, Vignesh; Song, Qingquan; Tang, Shao; Zhu, Siyu; Shimizu, Steven; Sahni, Shivam; Ning, Haowen; Chen, Yanning. "Liger Kernel: Efficient Triton Kernels for LLM Training." arXiv:2410.10989, 2024-10-14. https://arxiv.org/abs/2410.10989. Accessed 2026-05-21. ↩
triton-lang. "Related Work." Triton Documentation, accessed 2026-05-21. https://triton-lang.org/main/programming-guide/chapter-2/related-work.html. Accessed 2026-05-21. ↩
Modular. "What about Triton and Python eDSLs? (Democratizing AI Compute, Part 7)." Modular Blog, accessed 2026-05-21. https://www.modular.com/blog/democratizing-ai-compute-part-7-what-about-triton-and-python-edsls. Accessed 2026-05-21. ↩
triton-lang. "Debugging Triton." Triton Documentation, accessed 2026-05-21. https://triton-lang.org/main/programming-guide/chapter-3/debugging.html. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

CUDA Flash Attention Flash-Decoding JAX Microsoft Azure Maia 100 Pallas (JAX kernel language)RMSNorm ThunderKittens Unsloth tinygrad torch.compile xLSTM

What is Triton and where did it come from?

How does the Triton programming model differ from CUDA?

How does the Triton compiler work?

Which hardware backends does Triton support?

What is Triton used for?

PyTorch and TorchInductor

FlashAttention reference kernels

Liger Kernel

FBGEMM_GPU

vLLM

Other downstream users

What kinds of workloads run on Triton?

How does Triton compare with CUDA, Halide, TVM, and XLA?

What are Triton's limitations?

Who governs Triton and is it open source?

Related work

See also

References

Improve this article

Related Articles

tf.keras

Gradio

Python (programming language)

Agno

GPT API

Gym (OpenAI Gym / Gymnasium)

What links here

Related Articles

tf.keras

Gradio

Python (programming language)

Agno

GPT API

Gym (OpenAI Gym / Gymnasium)

What links here