torch.compile
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,092 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,092 words
Add missing citations, update stale details, or suggest a clearer explanation.
torch.compile is the just-in-time graph capture and compilation feature introduced in PyTorch 2.0, a release first announced at the PyTorch Conference on December 2, 2022 and shipped as a stable version on March 15, 2023.[1][2] Wrapping a model with a single call (torch.compile(model)) routes its Python execution through a stack of new components, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor, that lift unmodified eager-mode code into optimized Triton kernels on NVIDIA A100-class GPUs or C++/OpenMP code on CPUs.[1][3] Meta and the PyTorch Foundation reported geometric-mean training speedups of 38% on TIMM, 52% on HuggingFace Transformers, and 76% on TorchBench at the December 2022 announcement, with a follow-on ASPLOS 2024 paper measuring 2.27x inference and 1.41x training speedup across 180+ models on an A100.[1][3] The design preserves PyTorch's "eager by default" programming model while introducing optional ahead-of-time compilation, an approach Meta describes as the largest architectural shift in PyTorch since the 1.0 release.[1][4]
| Field | Value |
|---|---|
| Feature name | torch.compile |
| Project | PyTorch 2.x |
| Developer | Meta AI, PyTorch Foundation, broader open-source contributors |
| First announced | December 2, 2022 (PyTorch Conference) |
| Stable release | March 15, 2023 (PyTorch 2.0)[2] |
| License | BSD-3-Clause (PyTorch repository) |
| Default backend | TorchInductor |
| GPU codegen target | OpenAI Triton |
| CPU codegen target | C++ with OpenMP |
| Repository | github.com/pytorch/pytorch |
| Reference paper | Ansel et al., ASPLOS 2024[3] |
PyTorch was originally a "define-by-run" framework: each tensor operation executes immediately in the Python interpreter, which is the eager mode that made the library popular among researchers but historically slower than graph-based frameworks like TensorFlow 1 or XLA-backed systems.[4] Several earlier attempts to add compilation to PyTorch (TorchScript, FX tracing, Lazy Tensor, and various nvFuser/NNC fusers) each captured only part of a model or required users to rewrite code, and none became the default path.[3]
Meta began the work that became torch.compile as TorchDynamo in 2020, prototyped by Jason Ansel; the ASPLOS 2024 paper describes the system as the outcome of a roughly five-year effort to find a graph-capture mechanism robust enough to work on arbitrary PyTorch programs.[3][5] The official PyTorch 2.0 retrospective frames TorchDynamo's bytecode-rewriting approach as "the result of 5 years of R&D into safe graph capture."[1]
The public roll-out followed a deliberate sequence:
torch.compile available in nightly builds and a stable release promised for early March 2023.[1][6]torch.compile, NumPy API support inside compiled regions, AVX-512 codegen on CPU, and improved Inductor schedulers.[18]scaled_dot_product_attention, the AOTInductor ahead-of-time deployment path, optimizer compilation improvements, horizontal fusion for torch.cat, and the unified TORCH_LOGS logging system.[19]PyTorch 2.0 is fully backward-compatible with PyTorch 1.x: torch.compile is an opt-in wrapper, not a new programming model, and code that does not call it continues to execute eagerly with no changes.[1][2] The PyTorch Foundation, the Linux Foundation-hosted body that took stewardship of the project in late 2022, presents torch.compile as the flagship feature of the 2.x line and the central piece around which subsequent releases organize their performance work.[4][19]
torch.compile(fn) returns a callable that, on first invocation, traces and compiles the underlying Python function or nn.Module. The stack underneath is layered, with each component responsible for a different stage of lowering a Python program down to device code.[1][3]
TorchDynamo is a Python-level just-in-time compiler that hooks into the CPython frame evaluation API specified by PEP 523.[7][8] When a wrapped function is called, Dynamo intercepts the frame, walks the function's bytecode, and rewrites it on the fly: PyTorch operations are extracted into an FX graph (torch.fx.GraphModule), while constructs that cannot be safely captured (data-dependent control flow, arbitrary Python side effects, calls to C extensions) cause a "graph break" and fall back to the regular CPython interpreter.[3][9]
Because Dynamo operates on bytecode rather than the Python source, it can capture code paths that traditional symbolic tracing cannot reach, including if/else branches, list comprehensions, and many third-party library calls. The ASPLOS 2024 paper reports that Dynamo captures graphs more robustly than prior PyTorch approaches while adding minimal overhead.[3] On a 7,000-plus repository GitHub corpus, Meta reports a 99% graph-capture rate, with the remaining 1% falling back to eager execution.[1]
PEP 523, accepted into Python 3.6, added a hook in the CPython interpreter that lets external C code substitute its own frame-evaluation function in place of _PyEval_EvalFrameDefault. Originally motivated as a generic JIT entry point, it became the foundation TorchDynamo uses to intercept Python execution without modifying the user's source code or requiring AST-level analysis.[7] When TorchDynamo is active it installs a custom frame evaluator that, for each Python function call, decides whether to rewrite the bytecode (extracting an FX graph and substituting calls to the compiled artifact) or to defer to the default evaluator. Cached compilations are keyed on the bytecode object plus the guard set, so successive calls with the same shapes and types reuse the existing compiled artifact at near-zero overhead.[7][9]
Each compiled graph is paired with a set of "guards", runtime checks that confirm the assumptions Dynamo made during tracing (tensor shapes, dtypes, attribute values, Python types) still hold for a new call.[9][10] If all guards pass the cached compiled artifact is reused; if a guard fails, Dynamo retraces and produces a new specialization, optionally widening the assumption (for example, converting a fixed shape into a symbolic one) so the next call is more likely to hit the cache.[9]
By default dynamic=None, automatic dynamic shapes: the first call specializes on observed shapes, and if subsequent calls violate that guard the function is recompiled with symbolic shape support, using a ShapeEnv and SymPy expressions to reason about size relationships.[10] Setting dynamic=True skips the first specialization and traces symbolically from the start. Setting TORCH_LOGS=recompiles causes the compiler to log each recompilation along with the guards that triggered it, which is the primary debugging tool for guard failures.[10]
To accelerate training and not just inference, PyTorch 2.0 introduced AOTAutograd, an "ahead-of-time" autograd engine that traces both the forward and backward graphs by re-using the standard dispatcher.[1][3] AOTAutograd intercepts the captured Dynamo graph, decomposes operators against the PyTorch dispatcher, runs the autograd engine ahead of time on FakeTensors, and emits a joint forward-and-backward FX graph that the downstream compiler can optimize as a single unit.[3] Without AOTAutograd, only the forward pass would be compiled, leaving training largely on the eager path; with it, the entire training step (including activation save/restore) becomes a single compiled artifact.[1]
PyTorch has on the order of two thousand operators (counting overloads), too many for any single backend to implement directly. PrimTorch defines two smaller, stable operator sets that the captured graph is lowered into: the "Prim ops" set of roughly 250 low-level primitives suited for compilers, and the "ATen ops" set of approximately 750 higher-level operators suited for backends that prefer to consume larger building blocks.[1] Decompositions are written in Python so they can be reused across backends and traced through Dynamo.[1][3]
TorchInductor is the default backend that consumes the post-AOTAutograd, post-PrimTorch graph and produces device code.[1][3] On NVIDIA GPUs it lowers to OpenAI Triton kernels; on CPUs it generates C++ source files compiled with a normal host compiler and parallelized with OpenMP.[3] Inductor uses a Python loop-level intermediate representation with about 50 operators that is deliberately small and extensible, and it performs scheduling, fusion, and memory-planning passes before emitting kernels.[1] The ASPLOS 2024 paper reports that Inductor outperforms six other compilers tested in the same harness across 180+ models.[3]
Code generation proceeds in three logical phases. First, the captured FX graph is decomposed and normalized using PrimTorch decompositions; second, Inductor lowers the result into its loop-level IR and performs scheduling decisions (op fusion, tiling, memory layout); third, it emits Triton kernels (for CUDA and ROCm targets) or C++ source files (for CPU targets) and compiles them with the corresponding toolchain. The Triton path benefits from Triton's block-level programming model, which lets a single kernel author cover a wide range of tile sizes and broadcasting patterns without writing per-shape variants.[1][3]
torch.compile accepts a mode argument that selects an optimization preset:
| Mode | Behavior | Typical use |
|---|---|---|
"default" | Balanced compile time and runtime, suitable for large models. | Most training and inference workloads.[11] |
"reduce-overhead" | Wraps the compiled graph in CUDA Graphs to eliminate per-launch Python and driver overhead, at the cost of additional memory. | Small models and low-latency inference where launch overhead dominates.[11][12] |
"max-autotune" | Searches over Triton kernel variants, fusion schedules, and templated GEMMs (with optional Cutlass and cuDNN templates); compile times are much longer but generated code is the fastest available. | Latency-critical deployment or repeated training runs where compile cost amortizes.[13] |
Additional keyword arguments include fullgraph=True (raise on any graph break, useful for export and debugging) and dynamic=True/False/None (control symbolic-shape behavior).[9][10] The reduce-overhead mode is implemented via "CUDAGraph Trees", a refinement that records multiple captured graphs sharing a single memory pool so that different execution paths can be replayed without separate allocations.[12]
CUDA Graph Trees specifically address a problem that arises when chaining several captured CUDA graphs together: a naive approach would force each graph to use a separate memory pool, blowing up activation memory and adding host-side copies to move intermediates between graphs. Trees instead share a single pool across all graphs in the tree and use a tensor-liveness tracker (implemented in torch/_inductor/cudagraph_trees.py) to reuse dead memory across replays, retaining the launch-overhead savings without paying the memory cost.[12][22] Because CUDA Graphs require fixed tensor addresses and shapes, Trees re-record a fresh CUDAGraph for each unique input shape; for workloads with very high shape variability, the re-record cost can outweigh the savings, in which case mode="default" (without CUDA Graphs) is preferred.[12][22]
torch.export and AOTInductortorch.export is a sibling API that reuses Dynamo to capture a "complete" graph (one with no graph breaks) of an nn.Module plus example inputs, then serializes it as an ExportedProgram. PyTorch 2.2 added AOTInductor, an ahead-of-time variant of TorchInductor that consumes an exported program and emits a compiled shared library suitable for loading in non-Python server-side environments, sharing the same Triton/C++ codegen as the JIT path.[19] This lets a single backend serve both interactive development (via torch.compile) and production deployment (via torch.export and AOTInductor).
Although TorchInductor is the default, torch.compile exposes a backend= argument that selects an alternative compiler; the canonical list can be retrieved with torch.compiler.list_backends() and includes inductor, cudagraphs, onnxrt, openxla, openxla_eval, and tvm, with other backends moved out of core into companion packages.[14] Custom backends can be registered through the torch._dynamo extension API; the developer documentation walks through the contract a backend must satisfy (it receives an FX GraphModule and a list of example inputs, and returns a callable).[15]
Notable third-party backends include:
| Backend | Maintainer | Target | Notes |
|---|---|---|---|
| TorchInductor | Meta / PyTorch | NVIDIA GPUs (Triton), AMD GPUs (Triton on ROCm), CPUs (C++/OpenMP) | Default backend; the focus of the ASPLOS 2024 paper.[3] |
cudagraphs | PyTorch | NVIDIA GPUs | Wraps captured graph in CUDA Graphs without Inductor codegen.[12] |
onnxrt | Microsoft / PyTorch | ONNX Runtime targets | Lowers FX graph to ONNX and dispatches via ONNX Runtime.[14] |
openxla | Google / PyTorch | XLA-supported hardware (TPUs, GPUs) | Routes through XLA via the PyTorch/XLA integration.[14] |
tvm | Apache TVM | Multiple | Lowers through Apache TVM's compiler stack.[14] |
| Hidet | CentML, University of Toronto, AWS | NVIDIA GPUs | Selected via backend="hidet"; aimed at inference with operator-schedule search and Tensor Core utilization.[16] |
| Intel Extension for PyTorch (IPEX) | Intel (OpenVINO ecosystem) | Intel CPUs and Intel GPUs | Provides Intel-optimized fusion patterns and dispatches to Intel libraries.[14] |
The PyTorch project moved most of these backends out of torch core into separate packages over the course of 2023 and 2024, leaving the in-tree set focused on Inductor, CUDA Graphs, and a small number of cross-vendor adapters.[14]
Two distinct sets of headline numbers are commonly cited.
The PyTorch Conference 2022 keynote (December 2, 2022) and the official PyTorch 2.0 announcement reported, across 163 diverse open-source models tested on an A100:[1]
| Benchmark suite | Models tested | Training speedup (weighted) |
|---|---|---|
| HuggingFace Transformers | 46 | 52% |
| TIMM (PyTorch Image Models) | 61 | 38% |
| TorchBench | 56 | 76% |
| Overall | 163 | 43% (21% FP32, 51% AMP) |
torch.compile was reported to work on 93% of the 163 models tested, with the weighted average defined as 0.75 times the AMP speedup plus 0.25 times the FP32 speedup to reflect that automatic mixed precision is the more common training configuration.[1]
The ASPLOS 2024 paper, which uses a refreshed harness covering 180+ models on the same A100 hardware, reports a geometric-mean inference speedup of 2.27x and a geometric-mean training speedup of 1.41x for TorchInductor over eager mode, and finds that TorchInductor outperforms each of six other compared compilers across the harness.[3] The paper also evaluates Dynamo's graph-capture coverage and overhead and presents detailed ablations of guard handling, dynamic shapes, and operator-decomposition policies.[3]
PyTorch maintains a public nightly performance dashboard that runs the same HuggingFace, TIMM, and TorchBench harnesses on 12 GCP A100 nodes (each with a 40 GB A100) and publishes speedup, memory, and pass-rate metrics for every nightly build.[17]
A second category of results comes from end-application studies. The November 2023 PyTorch blog "PyTorch compile to speed up inference on Llama 2" reports the following inference latencies, measured on A100 80 GB hardware and with torch.compile plus SDPA (and tensor parallelism for the 70B configuration) compared against unoptimized eager execution:[20]
| Model | GPUs | Latency with torch.compile (ms/token) | Speedup vs. unoptimized |
|---|---|---|---|
| Llama 2 7B | 1 x A100 | reported sub-linear scaling vs. sequence length | reported "significant" speedup[20] |
| Llama 2 13B | 1 x A100 | reported sub-linear scaling vs. batch size | reported "significant" speedup[20] |
| Llama 2 70B | 8 x A100 | 29 | 2.4x[20] |
For inference of compute-bound diffusion and transformer pipelines, the Hugging Face Diffusers PyTorch 2.0 documentation reports that torch.compile can provide "an additional speed-up of 5-300x on top of SDPA" depending on GPU architecture, with Ampere (A100, 3090), Ada (4090), and Hopper (H100) showing the largest gains; the upper end of that range corresponds to small batch sizes where kernel-launch and Python overheads dominate.[21]
torch.compile is now the recommended path for accelerating training and inference of standard PyTorch models without rewriting them. Concrete deployment patterns include:
model = torch.compile(model) line after model construction. The PyTorch 2.0 announcement cites Sylvain Gugger reporting a 1.5x to 2x speedup on Hugging Face training scripts with no other changes.[1]torch.compile largely without modification, per Ross Wightman's quoted endorsement in the PyTorch 2.0 announcement.[1]mode="reduce-overhead" with CUDA Graphs to amortize kernel-launch and Python-interpreter overhead, which dominates eager-mode execution for short generation steps in language models.[11][12]backend="onnxrt" for ONNX Runtime or backend="openxla" for XLA-targeted hardware.[14]torch.compile is integrated with both DistributedDataParallel and FullyShardedDataParallel; the announcement reports up to 15% (FP32) and 80% (AMP) gains for compiled DDP.[1]torch.compile, SDPA (with FlashAttention), and tensor parallelism yields 29 ms/token latency on a 70-billion-parameter Llama 2 model running on 8 A100 80 GB GPUs, a 2.4x improvement over unoptimized inference under the same conditions (512-token input, 50-token generation).[20]scaled_dot_product_attention automatically dispatches to FlashAttention-2 on supported hardware, achieving 50 to 73 percent of theoretical maximum FLOPs on A100; combining SDPA with torch.compile is the standard recipe in current PyTorch examples.[19]It also unlocks ahead-of-time export. The torch.export API leverages Dynamo's graph capture to produce a serializable program representation, and AOTInductor (introduced in PyTorch 2.2) takes that exported graph and compiles it into a shared library suitable for non-Python server-side deployments, sharing backend code with the JIT path.[19]
Despite the speedups, the system has known constraints and rough edges:
mode="max-autotune", which searches over Triton kernel variants and can take significantly longer to first-call than mode="default". GitHub issues report that max-autotune can lead to overall execution times longer than eager when models are run few enough times that compilation does not amortize.[13]dynamic=True is enabled or shape ranges are constrained. Each recompilation adds latency and produces a distinct cache entry.[9][10]TORCH_LOGS=recompiles,graph_breaks,dynamo) and tools (torch._dynamo.explain) to localize regressions.[9][10]max-autotune on CPU is less mature than on CUDA, with only a subset of operations using templated kernels at launch and ongoing work to extend the GEMM template coverage.[13]Subsequent PyTorch 2.x releases have continued to expand torch.compile rather than supersede it. Notable per-release changes include:
| Version | Released | Key torch.compile additions |
|---|---|---|
| 2.0 | 2023-03-15 | Initial public release of torch.compile with TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor; modes default, reduce-overhead, max-autotune; dynamic, fullgraph, backend parameters; SDPA front-end with FlashAttention v1 / Memory-Efficient Attention.[2] |
| 2.1 | October 2023 | Automatic dynamic shape support; NumPy API support inside compiled regions; CPU Inductor AVX-512 codegen; improved SDPA pattern matching.[18] |
| 2.2 | 2024-01-30 | FlashAttention-2 integration; AOTInductor (ahead-of-time deployment); compiled optimizer improvements; horizontal fusion for torch.cat; TORCH_LOGS standardization.[19] |
The PyTorch Foundation publishes per-release notes via the project blog and the GitHub release tags, and the developer mailing list dev-discuss.pytorch.org tracks ongoing torch.compile RFCs.[4][19]
torch.compile is one of several frameworks-level graph compilers active in the deep-learning ecosystem.
| System | Capture mechanism | Default codegen | Eager support | Sponsor |
|---|---|---|---|---|
torch.compile (PyTorch 2) | Python bytecode rewrite via PEP 523 | TorchInductor → Triton / C++ | Yes, opt-in JIT | Meta / PyTorch Foundation[1][3] |
| TorchScript (PyTorch 1) | Source tracing / scripting | TorchScript IR + custom fusers | Yes (separate path) | Meta |
| XLA (TensorFlow, JAX) | Lazy tensor tracing or jit decorator | XLA HLO → backend | TF eager, JAX trace-on-call | |
Apache TVM (tvm backend) | Relay/Relax IR import | TVM tensor IR | Standalone compiler | Apache, OctoML |
ONNX Runtime (onnxrt backend) | ONNX graph import | ORT execution providers | Standalone runtime | Microsoft, ONNX consortium |
Compared to TorchScript, torch.compile distinguishes itself by not requiring users to make their models scriptable: Dynamo's bytecode-level capture handles native Python control flow that TorchScript could not. Compared to JAX's jit, PyTorch keeps eager mode as the default and treats compile as an optional wrapper rather than the dominant API.[1][3]
A second axis of comparison is the codegen target. Where XLA and TVM use their own internal intermediate representations and target a wide range of hardware through compiler back-ends, TorchInductor stands out by emitting source code in two well-known languages: Triton (a Python-embedded DSL designed at OpenAI for GPU kernel programming) and standard C++ with OpenMP. The Python-implemented Inductor IR is intentionally small (around 50 operators), which the ASPLOS 2024 paper argues lets compiler engineers prototype optimizations in Python and avoids the upfront engineering cost of a custom IR per backend.[1][3]
torch.compile.jit).onnxrt backend for cross-framework deployment.