Triton (compiler)
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,815 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,815 words
Add missing citations, update stale details, or suggest a clearer explanation.
Triton is an open source programming language and compiler for writing high performance GPU kernels in a Python like syntax. It was created by Philippe Tillet during his doctoral work at Harvard University and is now maintained primarily by OpenAI. The project pairs a small Python embedded domain specific language with a multi stage MLIR based compiler that lowers a tiled tensor program down to vendor specific machine code, currently targeting NVIDIA, AMD, and Intel GPUs. Triton is best known as the backend that PyTorch uses to generate fused GPU kernels inside torch.compile, and as the language behind several widely used custom kernels including Triton implementations of FlashAttention.
Triton is unusual among GPU programming languages because its abstraction is the tile, not the thread. A Triton kernel is written as if it were operating on whole blocks of data at a time. The compiler then takes responsibility for mapping that block code onto the thousands of threads that an actual GPU uses, handling shared memory layout, memory coalescing, tensor core scheduling, and software pipelining automatically. The result is a tool that lets a Python programmer write fused kernels that come within a small margin of hand tuned CUDA, without having to think about warps, banks, or registers by hand.
The project should not be confused with NVIDIA Triton Inference Server, which is a completely separate piece of software that serves trained models at inference time. The two products share only a name. This article covers the OpenAI Triton language and compiler.
Writing a fast GPU kernel by hand is one of the harder corners of modern systems programming. The standard tool for the job, NVIDIA's CUDA C++, exposes a Single Instruction Multiple Thread (SIMT) execution model where the programmer is responsible for placing data into shared memory, choosing thread block dimensions, coordinating warp level instructions, and avoiding shared memory bank conflicts. Even a competent CUDA developer can spend weeks tuning a new matrix multiplication or attention variant, and the resulting code is brittle: small changes to the input shape, the GPU generation, or the surrounding fused operations often require a full retune.
The deep learning research community ran into this problem repeatedly during the late 2010s. Operations that fit the patterns offered by cuDNN or cuBLAS, such as dense matrix multiplication and standard convolution, could be dispatched to vendor libraries that had years of hand tuned assembly behind them. Anything outside that envelope, such as shift convolutions, sparse attention, or novel normalization layers, had to be written from scratch. The cost was high enough that many promising research ideas stalled simply because nobody wanted to write the kernel.
There were several earlier attempts to fix this. Halide separated the algorithm from the schedule and was used effectively for image processing pipelines. TVM extended that idea to deep learning and added autotuning. cuTLASS gave CUDA programmers a library of reusable template components for GEMM. Each of these helped, but none of them produced the combination of a small Python like surface, near vendor performance, and reasonable portability across GPU vendors. Triton was an attempt to find that combination.
Triton began as a research project from Philippe Tillet at Harvard. The first public paper, titled "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations," was presented by Tillet, H. T. Kung, and David Cox at the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL) at PLDI in June 2019. That paper introduced a C based language and an LLVM based intermediate representation built around tile variables, along with a set of optimization passes specifically designed for tile level programs. The paper demonstrated that the resulting compiler could produce matrix multiplication and convolution kernels on par with cuBLAS and cuDNN, and that it could handle research ideas like shift convolution that vendor libraries did not support.
Tillet later joined OpenAI, where the project was rewritten with Python as the user facing surface instead of the original C dialect. OpenAI announced Triton 1.0 on its blog on 28 July 2021, positioning it as open source GPU programming for neural networks. The announcement framed the goal succinctly: it should be possible for a researcher with no CUDA experience to write a kernel that runs within a small margin of a hand tuned expert implementation. The release included worked examples for vector addition, fused softmax, and matrix multiplication, and it claimed that the fused softmax kernel was several times faster than the equivalent PyTorch operations on standard GPUs.
Triton 2.0 followed in 2023. The most important change was a complete rewrite of the compiler backend on top of MLIR, the multi level intermediate representation framework developed within the LLVM project. The MLIR rewrite introduced three main IR stages: the Triton dialect at the front, a TritonGPU dialect that adds GPU specific layout encodings, and a final lowering to the LLVM dialect for code generation. That change made it much easier to add new hardware backends and to apply optimizations like layout coalescing, software pipelining, and dot product acceleration in a principled way. It also enabled back to back matrix multiplications in a single kernel, which is the pattern at the heart of FlashAttention.
In early 2024 the project moved from its original openai/triton GitHub location to triton-lang/triton, signalling that Triton was being treated more as a community project than an OpenAI controlled artifact. Hardware vendors began upstreaming optimization passes for their own silicon, and the Triton Developer Conference became an annual gathering. The third Triton Developer Conference took place at the Microsoft Silicon Valley Campus in Mountain View, California, on 21 October 2025, bringing together engineers from Intel, AMD, Qualcomm, NVIDIA, Microsoft, and Amazon Web Services to share Triton results on their respective hardware.
The table below lists the major releases that shaped the project's public arc.
| Year | Milestone | Notes |
|---|---|---|
| 2019 | MAPL paper | Tillet, Kung, and Cox publish the original Triton paper at the MAPL workshop at PLDI |
| 2021 | Triton 1.0 release | OpenAI publishes the Python based language and compiler under the MIT license |
| 2022 | Initial AMD GPU work | Community port to ROCm begins |
| 2023 | Triton 2.0 | Backend rewritten on top of MLIR; FlashAttention style kernels become possible |
| 2023 | PyTorch 2.0 ships | TorchInductor adopts Triton as the default GPU code generator |
| 2024 | Move to triton-lang org | Repository moves from openai/triton to triton-lang/triton |
| 2025 | Blackwell support | Native code generation for NVIDIA Blackwell tensor cores and block scaled formats |
| 2026 | Triton 3.7 | Release on 7 May 2026, with continued NVIDIA, AMD, and CPU backend work |
A Triton kernel is a Python function decorated with @triton.jit. Inside the function the programmer writes code that looks like ordinary NumPy style array operations, but the values are blocks of elements rather than individual scalars. The compiler turns each block operation into the appropriate sequence of GPU instructions when the kernel is launched.
A canonical vector addition kernel illustrates the surface. The function takes pointers to the input and output arrays, computes the offsets for the current program instance, loads the corresponding blocks of data, performs the addition, and stores the result. The relevant operations are tl.program_id, tl.arange, tl.load, and tl.store, all of which act on entire tiles rather than single elements. Boundary handling is done with a mask passed to tl.load and tl.store, which avoids the scalar conditional code that a CUDA kernel would need at the edges of the iteration space.
The execution model is described in the OpenAI announcement and the Triton documentation. A kernel is launched on a grid of program instances, much like a CUDA grid of thread blocks, but each instance operates on a tile and the compiler is free to distribute the work inside the tile across the threads of the underlying block however it likes. The programmer never names a thread index, never allocates shared memory explicitly, and never writes a syncthreads barrier. Those decisions are all made by the compiler from the structure of the tile operations.
The compiler itself is a multi stage MLIR pipeline. The frontend consumes the decorated Python function and produces the Triton dialect, an IR that captures tile operations in a hardware independent form. The middle end progressively lowers that IR into the TritonGPU dialect, which extends the base dialect with layout attributes that describe how each tensor's elements are distributed across threads, warps, and cooperative thread arrays. Several optimization passes run on the TritonGPU IR, including memory coalescing, redundant layout conversion removal, thread locality optimization, a matrix multiplication acceleration pipeline that targets tensor cores, dot operand optimization, software loop pipelining, and prefetching. The backend then converts the TritonGPU dialect to the LLVM dialect, lowers that to LLVM IR, and emits PTX and cubin for NVIDIA targets or the equivalent assembly for AMD and Intel targets.
The most useful programmer facing feature on top of all this is the autotuner. The @triton.autotune decorator takes a list of candidate configurations, where each configuration sets the block size, the number of warps, the number of pipeline stages, and any kernel specific constants. The first time a kernel runs with a new shape, Triton compiles and benchmarks every configuration in the list and remembers the fastest one. Subsequent launches use the winning configuration directly. This replaces the manual sweep that a CUDA programmer would write by hand and removes a major source of friction in kernel development.
Block pointers are a related convenience. tl.make_block_ptr constructs a pointer that knows the shape and strides of the underlying tensor, and the compiler can use that information to emit hardware specific instructions like NVIDIA's Tensor Memory Accelerator on Hopper or Blackwell, which moves up to five dimensional tiles asynchronously between global and shared memory in a single transaction. Without block pointers Triton can still produce correct code, but block pointers give the optimizer enough static information to pick the best memory movement primitive available on the target.
The largest single user of Triton is PyTorch. PyTorch 2.0, released in March 2023, introduced torch.compile, an opt in just in time compilation path that traces an eager PyTorch model into a graph and then compiles it. The default backend behind torch.compile is TorchInductor, and TorchInductor generates Triton code for NVIDIA and AMD GPUs and C++ with OpenMP for CPUs.
The PyTorch team chose Triton for several reasons that the project's blog posts and design documents make explicit. First, Triton operates at the same level of abstraction that they wanted Inductor to expose: a Python like description of tile operations rather than thread level CUDA. Second, the Triton compiler already handled the parts of code generation that would otherwise have to be reimplemented inside Inductor, including register allocation, shared memory layout, and tensor core selection. Third, Triton was open source and hackable, which gave the PyTorch team room to add features upstream when their use case demanded them. Fourth, the same Triton source could target multiple GPU vendors, which mattered as AMD and Intel hardware became more important in production deployments.
When TorchInductor compiles a model, it traces a fused operator graph and generates a Triton kernel for each fused region. The user does not see this code unless they enable debug output. For most users torch.compile simply makes their model faster on the same hardware. Power users can write their own Triton kernels and call them from a compiled torch.compile graph, with the compiler treating them as opaque custom operators. PyTorch's documentation includes a dedicated recipe for this pattern, and many of the most popular PyTorch extensions in the LLM ecosystem ship Triton kernels alongside their eager mode implementations.
The practical effect of this integration is that almost every modern PyTorch deployment runs Triton code, often without the developers realizing it. A team that calls torch.compile(model) and sees a speedup is benefitting from Triton kernels generated automatically by Inductor, and a team that uses any large library built on top of PyTorch is likely loading additional Triton kernels for fused attention, normalization, or activation functions.
Triton's design separates the hardware independent Triton dialect from the hardware specific lowering, which makes it easier to add new backends. As of 2026 the project supports three main GPU families and is actively developing CPU support.
| Target | Status | Notes |
|---|---|---|
| NVIDIA GPUs | Mature, primary target | Requires Compute Capability 8.0 or newer in current releases; Ampere, Ada Lovelace, Hopper, and Blackwell all supported; native code generation for tensor cores, TMA on Hopper and Blackwell, and Blackwell block scaled floating point formats |
| AMD GPUs | Mature, second tier | Requires ROCm 6.2 or newer; CDNA (MI200, MI300) and RDNA generations supported; FlashAttention Triton kernels run on MI300 with fp16, bf16, fp32, and FP8 paths |
| Intel GPUs | Out of tree backend | Maintained as intel-xpu-backend-for-triton and consumed via the triton-xpu PyPI package; tuned for Intel Data Center GPU Max and Arc lines |
| CPUs | Under development | Backend targeting x86 and ARM CPUs is being added in the main repository; experimental work on RISC-V has also been reported |
The NVIDIA path is the most mature and is treated as the primary target by the project. The Hopper architecture introduced the Tensor Memory Accelerator and a new generation of asynchronous tensor core instructions, both of which Triton learned to emit during the H100 generation. The Blackwell architecture introduced block scaled floating point formats inspired by the Open Compute Project's microscaling formats, and Triton added native code generation for those formats so that FlashAttention style kernels see roughly 1.5x throughput on Blackwell over Hopper at FP16.
The AMD path is maintained jointly by AMD engineers and the upstream community. The flash attention Triton kernel on AMD CDNA hardware supports forward and backward passes, causal masking, multi query and grouped query attention, dropout, rotary embeddings, ALiBi, paged attention, and FP8. AMD's ROCm documentation includes Triton tutorials and a public optimization guide that maps Triton tuning knobs to MI200 and MI300 hardware. AMD's own multi GPU framework, Iris, is built on top of Triton.
The Intel path lives in a separate repository because it is co developed with the Intel oneAPI software stack and pulls in tooling that the upstream project does not. Codeplay and Intel engineers have published performance numbers showing Triton FlashAttention competitive with hand tuned kernels on Intel Data Center GPUs. Users install the Intel backend by replacing the default triton package with triton-xpu, which carries the Intel specific compiler passes.
The CPU backend is the newest piece of work. It targets cases where the same Triton source should be runnable both on a GPU during training and on a CPU during local inference, particularly for development loops where a GPU is not always available.
Triton has become the default language for custom kernels in the open source large language model ecosystem.
FlashAttention, the memory efficient attention kernel introduced by Tri Dao and collaborators in 2022, has a widely used Triton implementation. The Triton documentation ships a worked tutorial on a fused attention kernel with the same recurrence relation that FlashAttention uses. The reference Dao AILab FlashAttention repository includes a Triton path for AMD CDNA and RDNA GPUs that supports the full FlashAttention 2 feature set. FlashAttention-3, the Hopper specific successor, has Triton based implementations that target the Hopper TMA and warp specialization features.
Liger Kernel is a production grade collection of fused operator implementations for large language model training. Every kernel in the library is written in Triton, including fused root mean square normalization, rotary positional embedding, cross entropy, and several variants of attention. The library is used inside many open source training stacks because the Triton kernels reduce both memory usage and wall clock time compared to the equivalent eager PyTorch operations.
Mixture of experts (MoE) kernels are another common Triton target. Routing the tokens for an MoE layer involves a scatter operation that is awkward to express in eager PyTorch and slow when expressed naively. Several open source MoE implementations, including the Triton MoE kernels used by inference frameworks like vLLM and SGLang, are written directly in Triton because the language lets the implementer control the data movement between routing and expert computation. OpenAI's own gpt-oss models, which use a sparse MoE design, ship Triton kernels for their grouped GEMM and expert routing paths.
Quantization kernels are a third common pattern. Libraries that need fast int4 or FP8 matrix multiplication for inference, including bitsandbytes style implementations and the Marlin family of kernels, often ship Triton variants alongside their CUDA reference. The Triton variants are easier to maintain because the quantization layout can be changed without rewriting the inner loop.
Beyond these specific libraries, Triton is also widely used as the teaching language for GPU kernel development. Group programming events such as the GPU MODE lecture series use Triton as the language of instruction for fused kernels and tensor core programming, because the abstraction lets students reason about performance without first having to learn the CUDA SIMT model.
The name Triton is shared by two completely separate products in the AI infrastructure space, which causes regular confusion. The table below summarizes the difference.
| Property | OpenAI Triton (this article) | NVIDIA Triton Inference Server |
|---|---|---|
| Purpose | Programming language and compiler for writing GPU kernels | Inference serving system for deployed models |
| Created by | Philippe Tillet, originally at Harvard; OpenAI for the public release | NVIDIA |
| Primary repository | triton-lang/triton on GitHub | triton-inference-server/server on GitHub |
| User interaction | Write Python kernels with @triton.jit | Configure model repositories and call an HTTP or gRPC endpoint |
| Position in the stack | Below PyTorch and other frameworks, generating low level GPU code | Above frameworks, serving compiled models to clients |
| Hardware | Compiles for NVIDIA, AMD, and Intel GPUs, with CPU support in progress | Runs on NVIDIA GPUs, x86 and ARM CPUs, and AWS Inferentia |
| License | MIT | BSD-3-Clause |
In 2024 NVIDIA began branding the inference server as Dynamo-Triton in product material, partly to disambiguate the two projects, but the older name is still in common use. When developers discuss writing Triton kernels they almost always mean the OpenAI project; when they discuss deploying Triton servers they almost always mean the NVIDIA project. Confusingly, the NVIDIA Triton Inference Server can serve PyTorch models that contain kernels generated by the OpenAI Triton compiler, so it is possible for both products to appear in the same stack.
Reception of Triton has been broadly positive across academia, the major AI labs, and the hardware vendor community. The original MAPL paper showed near vendor performance on matrix multiplication and convolution, which was already a strong result for a research compiler. The 2021 OpenAI release attracted attention mainly because of the productivity gap with CUDA. A common figure quoted by adopters is that Triton kernels are three to ten times faster to develop than the equivalent CUDA code, while reaching roughly 80 to 100 percent of hand tuned CUDA performance on common deep learning workloads. The exact gap depends heavily on the kernel and the GPU generation; for some matrix multiplication shapes Triton has matched or beaten the corresponding cuBLAS path, while for some quantization or sparse kernels handwritten CUDA still has a measurable edge.
The decision by the PyTorch team to make Triton the default Inductor backend in PyTorch 2.0 was widely seen as the moment that Triton went from a research project to critical infrastructure. By embedding Triton inside torch.compile, the PyTorch team effectively guaranteed that every team using compiled PyTorch on GPU was also a Triton user, which created a strong pull on the surrounding ecosystem. Hardware vendors responded by contributing backends and optimization passes for their own silicon, which in turn improved the experience for PyTorch users on non NVIDIA hardware. NVIDIA itself contributes upstream and has presented Triton work at its GTC conference, including a Blackwell focused session in 2025.
Not all reactions have been uncritical. Some compiler researchers have argued that Python embedded domain specific languages, including Triton, sacrifice a degree of expressiveness in exchange for adoption. The Modular team published a blog post in 2025 examining this trade off, arguing that while Triton's productivity gains are real, certain advanced kernel patterns are still easier to express in CUDA or in their own Mojo language. Other critics have pointed to long autotuning times on first execution, the brittleness of cache invalidation when shapes change, and the difficulty of debugging Triton kernels when the compiler picks an unexpected layout. The Triton team has addressed several of these complaints over time, but they remain a recurring topic on the project's issue tracker.
The project's longer term significance is hard to overstate for deep learning practitioners. Triton lowered the cost of writing a custom kernel from weeks of CUDA work to days of Python work, which made it economically viable to ship custom operators in projects that would never have built a CUDA team. That shift is visible in the open source ecosystem, where fused attention, MoE routing, custom quantization, and a long tail of LLM specific optimizations now ship with their own Triton implementations.