torch.compile

Deep Learning Developer Tools Training & Optimization

20 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v2 · 4,092 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

torch.compile is the just-in-time graph capture and compilation feature introduced in PyTorch 2.0, a release first announced at the PyTorch Conference on December 2, 2022 and shipped as a stable version on March 15, 2023.^[1]^[2] Wrapping a model with a single call (torch.compile(model)) routes its Python execution through a stack of new components, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor, that lift unmodified eager-mode code into optimized Triton kernels on NVIDIA A100-class GPUs or C++/OpenMP code on CPUs.^[1]^[3] Meta and the PyTorch Foundation reported geometric-mean training speedups of 38% on TIMM, 52% on HuggingFace Transformers, and 76% on TorchBench at the December 2022 announcement, with a follow-on ASPLOS 2024 paper measuring 2.27x inference and 1.41x training speedup across 180+ models on an A100.^[1]^[3] The design preserves PyTorch's "eager by default" programming model while introducing optional ahead-of-time compilation, an approach Meta describes as the largest architectural shift in PyTorch since the 1.0 release.^[1]^[4]

Infobox

Field	Value
Feature name	`torch.compile`
Project	PyTorch 2.x
Developer	Meta AI, PyTorch Foundation, broader open-source contributors
First announced	December 2, 2022 (PyTorch Conference)
Stable release	March 15, 2023 (PyTorch 2.0)^[2]
License	BSD-3-Clause (PyTorch repository)
Default backend	TorchInductor
GPU codegen target	OpenAI Triton
CPU codegen target	C++ with OpenMP
Repository	github.com/pytorch/pytorch
Reference paper	Ansel et al., ASPLOS 2024^[3]

Background

PyTorch was originally a "define-by-run" framework: each tensor operation executes immediately in the Python interpreter, which is the eager mode that made the library popular among researchers but historically slower than graph-based frameworks like TensorFlow 1 or XLA-backed systems.^[4] Several earlier attempts to add compilation to PyTorch (TorchScript, FX tracing, Lazy Tensor, and various nvFuser/NNC fusers) each captured only part of a model or required users to rewrite code, and none became the default path.^[3]

Meta began the work that became torch.compile as TorchDynamo in 2020, prototyped by Jason Ansel; the ASPLOS 2024 paper describes the system as the outcome of a roughly five-year effort to find a graph-capture mechanism robust enough to work on arbitrary PyTorch programs.^[3]^[5] The official PyTorch 2.0 retrospective frames TorchDynamo's bytecode-rewriting approach as "the result of 5 years of R&D into safe graph capture."^[1]

The public roll-out followed a deliberate sequence:

December 2, 2022. PyTorch 2.0 unveiled at the PyTorch Conference in New Orleans, with torch.compile available in nightly builds and a stable release promised for early March 2023.^[1]^[6]
March 15, 2023. PyTorch 2.0 released as a stable version, accompanied by the official blog post "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever."^[2]
October 2023. PyTorch 2.1 ships at the PyTorch Conference 2023, adding automatic dynamic shape support to torch.compile, NumPy API support inside compiled regions, AVX-512 codegen on CPU, and improved Inductor schedulers.^[18]
January 30, 2024. PyTorch 2.2 ships with FlashAttention-2 integration in scaled_dot_product_attention, the AOTInductor ahead-of-time deployment path, optimizer compilation improvements, horizontal fusion for torch.cat, and the unified TORCH_LOGS logging system.^[19]
April 27 to May 1, 2024. The 50+-author paper "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation" presented at the 29th ACM ASPLOS conference in San Diego, with the proceedings published in early 2024.^[3]^[5]

PyTorch 2.0 is fully backward-compatible with PyTorch 1.x: torch.compile is an opt-in wrapper, not a new programming model, and code that does not call it continues to execute eagerly with no changes.^[1]^[2] The PyTorch Foundation, the Linux Foundation-hosted body that took stewardship of the project in late 2022, presents torch.compile as the flagship feature of the 2.x line and the central piece around which subsequent releases organize their performance work.^[4]^[19]

How It Works

torch.compile(fn) returns a callable that, on first invocation, traces and compiles the underlying Python function or nn.Module. The stack underneath is layered, with each component responsible for a different stage of lowering a Python program down to device code.^[1]^[3]

TorchDynamo (graph capture)

TorchDynamo is a Python-level just-in-time compiler that hooks into the CPython frame evaluation API specified by PEP 523.^[7]^[8] When a wrapped function is called, Dynamo intercepts the frame, walks the function's bytecode, and rewrites it on the fly: PyTorch operations are extracted into an FX graph (torch.fx.GraphModule), while constructs that cannot be safely captured (data-dependent control flow, arbitrary Python side effects, calls to C extensions) cause a "graph break" and fall back to the regular CPython interpreter.^[3]^[9]

Because Dynamo operates on bytecode rather than the Python source, it can capture code paths that traditional symbolic tracing cannot reach, including if/else branches, list comprehensions, and many third-party library calls. The ASPLOS 2024 paper reports that Dynamo captures graphs more robustly than prior PyTorch approaches while adding minimal overhead.^[3] On a 7,000-plus repository GitHub corpus, Meta reports a 99% graph-capture rate, with the remaining 1% falling back to eager execution.^[1]

PEP 523, accepted into Python 3.6, added a hook in the CPython interpreter that lets external C code substitute its own frame-evaluation function in place of _PyEval_EvalFrameDefault. Originally motivated as a generic JIT entry point, it became the foundation TorchDynamo uses to intercept Python execution without modifying the user's source code or requiring AST-level analysis.^[7] When TorchDynamo is active it installs a custom frame evaluator that, for each Python function call, decides whether to rewrite the bytecode (extracting an FX graph and substituting calls to the compiled artifact) or to defer to the default evaluator. Cached compilations are keyed on the bytecode object plus the guard set, so successive calls with the same shapes and types reuse the existing compiled artifact at near-zero overhead.^[7]^[9]

Guard-based recompilation

Each compiled graph is paired with a set of "guards", runtime checks that confirm the assumptions Dynamo made during tracing (tensor shapes, dtypes, attribute values, Python types) still hold for a new call.^[9]^[10] If all guards pass the cached compiled artifact is reused; if a guard fails, Dynamo retraces and produces a new specialization, optionally widening the assumption (for example, converting a fixed shape into a symbolic one) so the next call is more likely to hit the cache.^[9]

By default dynamic=None, automatic dynamic shapes: the first call specializes on observed shapes, and if subsequent calls violate that guard the function is recompiled with symbolic shape support, using a ShapeEnv and SymPy expressions to reason about size relationships.^[10] Setting dynamic=True skips the first specialization and traces symbolically from the start. Setting TORCH_LOGS=recompiles causes the compiler to log each recompilation along with the guards that triggered it, which is the primary debugging tool for guard failures.^[10]

AOTAutograd (joint forward/backward capture)

To accelerate training and not just inference, PyTorch 2.0 introduced AOTAutograd, an "ahead-of-time" autograd engine that traces both the forward and backward graphs by re-using the standard dispatcher.^[1]^[3] AOTAutograd intercepts the captured Dynamo graph, decomposes operators against the PyTorch dispatcher, runs the autograd engine ahead of time on FakeTensors, and emits a joint forward-and-backward FX graph that the downstream compiler can optimize as a single unit.^[3] Without AOTAutograd, only the forward pass would be compiled, leaving training largely on the eager path; with it, the entire training step (including activation save/restore) becomes a single compiled artifact.^[1]

PrimTorch (operator decomposition)

PyTorch has on the order of two thousand operators (counting overloads), too many for any single backend to implement directly. PrimTorch defines two smaller, stable operator sets that the captured graph is lowered into: the "Prim ops" set of roughly 250 low-level primitives suited for compilers, and the "ATen ops" set of approximately 750 higher-level operators suited for backends that prefer to consume larger building blocks.^[1] Decompositions are written in Python so they can be reused across backends and traced through Dynamo.^[1]^[3]

TorchInductor (default code generator)

TorchInductor is the default backend that consumes the post-AOTAutograd, post-PrimTorch graph and produces device code.^[1]^[3] On NVIDIA GPUs it lowers to OpenAI Triton kernels; on CPUs it generates C++ source files compiled with a normal host compiler and parallelized with OpenMP.^[3] Inductor uses a Python loop-level intermediate representation with about 50 operators that is deliberately small and extensible, and it performs scheduling, fusion, and memory-planning passes before emitting kernels.^[1] The ASPLOS 2024 paper reports that Inductor outperforms six other compilers tested in the same harness across 180+ models.^[3]

Code generation proceeds in three logical phases. First, the captured FX graph is decomposed and normalized using PrimTorch decompositions; second, Inductor lowers the result into its loop-level IR and performs scheduling decisions (op fusion, tiling, memory layout); third, it emits Triton kernels (for CUDA and ROCm targets) or C++ source files (for CPU targets) and compiles them with the corresponding toolchain. The Triton path benefits from Triton's block-level programming model, which lets a single kernel author cover a wide range of tile sizes and broadcasting patterns without writing per-shape variants.^[1]^[3]

Compile modes

torch.compile accepts a mode argument that selects an optimization preset:

Mode	Behavior	Typical use
`"default"`	Balanced compile time and runtime, suitable for large models.	Most training and inference workloads.^[11]
`"reduce-overhead"`	Wraps the compiled graph in CUDA Graphs to eliminate per-launch Python and driver overhead, at the cost of additional memory.	Small models and low-latency inference where launch overhead dominates.^[11]^[12]
`"max-autotune"`	Searches over Triton kernel variants, fusion schedules, and templated GEMMs (with optional Cutlass and cuDNN templates); compile times are much longer but generated code is the fastest available.	Latency-critical deployment or repeated training runs where compile cost amortizes.^[13]

Additional keyword arguments include fullgraph=True (raise on any graph break, useful for export and debugging) and dynamic=True/False/None (control symbolic-shape behavior).^[9]^[10] The reduce-overhead mode is implemented via "CUDAGraph Trees", a refinement that records multiple captured graphs sharing a single memory pool so that different execution paths can be replayed without separate allocations.^[12]

CUDA Graph Trees specifically address a problem that arises when chaining several captured CUDA graphs together: a naive approach would force each graph to use a separate memory pool, blowing up activation memory and adding host-side copies to move intermediates between graphs. Trees instead share a single pool across all graphs in the tree and use a tensor-liveness tracker (implemented in torch/_inductor/cudagraph_trees.py) to reuse dead memory across replays, retaining the launch-overhead savings without paying the memory cost.^[12]^[22] Because CUDA Graphs require fixed tensor addresses and shapes, Trees re-record a fresh CUDAGraph for each unique input shape; for workloads with very high shape variability, the re-record cost can outweigh the savings, in which case mode="default" (without CUDA Graphs) is preferred.^[12]^[22]

`torch.export` and AOTInductor

torch.export is a sibling API that reuses Dynamo to capture a "complete" graph (one with no graph breaks) of an nn.Module plus example inputs, then serializes it as an ExportedProgram. PyTorch 2.2 added AOTInductor, an ahead-of-time variant of TorchInductor that consumes an exported program and emits a compiled shared library suitable for loading in non-Python server-side environments, sharing the same Triton/C++ codegen as the JIT path.^[19] This lets a single backend serve both interactive development (via torch.compile) and production deployment (via torch.export and AOTInductor).

Implementations and Backends

Although TorchInductor is the default, torch.compile exposes a backend= argument that selects an alternative compiler; the canonical list can be retrieved with torch.compiler.list_backends() and includes inductor, cudagraphs, onnxrt, openxla, openxla_eval, and tvm, with other backends moved out of core into companion packages.^[14] Custom backends can be registered through the torch._dynamo extension API; the developer documentation walks through the contract a backend must satisfy (it receives an FX GraphModule and a list of example inputs, and returns a callable).^[15]

Notable third-party backends include:

Backend	Maintainer	Target	Notes
TorchInductor	Meta / PyTorch	NVIDIA GPUs (Triton), AMD GPUs (Triton on ROCm), CPUs (C++/OpenMP)	Default backend; the focus of the ASPLOS 2024 paper.^[3]
`cudagraphs`	PyTorch	NVIDIA GPUs	Wraps captured graph in CUDA Graphs without Inductor codegen.^[12]
`onnxrt`	Microsoft / PyTorch	ONNX Runtime targets	Lowers FX graph to ONNX and dispatches via ONNX Runtime.^[14]
`openxla`	Google / PyTorch	XLA-supported hardware (TPUs, GPUs)	Routes through XLA via the PyTorch/XLA integration.^[14]
`tvm`	Apache TVM	Multiple	Lowers through Apache TVM's compiler stack.^[14]
Hidet	CentML, University of Toronto, AWS	NVIDIA GPUs	Selected via `backend="hidet"`; aimed at inference with operator-schedule search and Tensor Core utilization.^[16]
Intel Extension for PyTorch (IPEX)	Intel (OpenVINO ecosystem)	Intel CPUs and Intel GPUs	Provides Intel-optimized fusion patterns and dispatches to Intel libraries.^[14]

The PyTorch project moved most of these backends out of torch core into separate packages over the course of 2023 and 2024, leaving the in-tree set focused on Inductor, CUDA Graphs, and a small number of cross-vendor adapters.^[14]

Performance Numbers

Two distinct sets of headline numbers are commonly cited.

The PyTorch Conference 2022 keynote (December 2, 2022) and the official PyTorch 2.0 announcement reported, across 163 diverse open-source models tested on an A100:^[1]

Benchmark suite	Models tested	Training speedup (weighted)
HuggingFace Transformers	46	52%
TIMM (PyTorch Image Models)	61	38%
TorchBench	56	76%
Overall	163	43% (21% FP32, 51% AMP)

torch.compile was reported to work on 93% of the 163 models tested, with the weighted average defined as 0.75 times the AMP speedup plus 0.25 times the FP32 speedup to reflect that automatic mixed precision is the more common training configuration.^[1]

The ASPLOS 2024 paper, which uses a refreshed harness covering 180+ models on the same A100 hardware, reports a geometric-mean inference speedup of 2.27x and a geometric-mean training speedup of 1.41x for TorchInductor over eager mode, and finds that TorchInductor outperforms each of six other compared compilers across the harness.^[3] The paper also evaluates Dynamo's graph-capture coverage and overhead and presents detailed ablations of guard handling, dynamic shapes, and operator-decomposition policies.^[3]

PyTorch maintains a public nightly performance dashboard that runs the same HuggingFace, TIMM, and TorchBench harnesses on 12 GCP A100 nodes (each with a 40 GB A100) and publishes speedup, memory, and pass-rate metrics for every nightly build.^[17]

A second category of results comes from end-application studies. The November 2023 PyTorch blog "PyTorch compile to speed up inference on Llama 2" reports the following inference latencies, measured on A100 80 GB hardware and with torch.compile plus SDPA (and tensor parallelism for the 70B configuration) compared against unoptimized eager execution:^[20]

Model	GPUs	Latency with `torch.compile` (ms/token)	Speedup vs. unoptimized
Llama 2 7B	1 x A100	reported sub-linear scaling vs. sequence length	reported "significant" speedup^[20]
Llama 2 13B	1 x A100	reported sub-linear scaling vs. batch size	reported "significant" speedup^[20]
Llama 2 70B	8 x A100	29	2.4x^[20]

For inference of compute-bound diffusion and transformer pipelines, the Hugging Face Diffusers PyTorch 2.0 documentation reports that torch.compile can provide "an additional speed-up of 5-300x on top of SDPA" depending on GPU architecture, with Ampere (A100, 3090), Ada (4090), and Hopper (H100) showing the largest gains; the upper end of that range corresponds to small batch sizes where kernel-launch and Python overheads dominate.^[21]

Applications

torch.compile is now the recommended path for accelerating training and inference of standard PyTorch models without rewriting them. Concrete deployment patterns include:

Speeding up Transformer training in Hugging Face Transformers and similar libraries by adding a single model = torch.compile(model) line after model construction. The PyTorch 2.0 announcement cites Sylvain Gugger reporting a 1.5x to 2x speedup on Hugging Face training scripts with no other changes.^[1]
Accelerating computer-vision training: TIMM models pass through torch.compile largely without modification, per Ross Wightman's quoted endorsement in the PyTorch 2.0 announcement.^[1]
Reducing inference latency on small models by combining mode="reduce-overhead" with CUDA Graphs to amortize kernel-launch and Python-interpreter overhead, which dominates eager-mode execution for short generation steps in language models.^[11]^[12]
Exporting models to other runtimes via alternate backends, for example backend="onnxrt" for ONNX Runtime or backend="openxla" for XLA-targeted hardware.^[14]
Distributed training: torch.compile is integrated with both DistributedDataParallel and FullyShardedDataParallel; the announcement reports up to 15% (FP32) and 80% (AMP) gains for compiled DDP.^[1]
Large-language-model inference: a November 2023 PyTorch blog post from an IBM Research and Meta team reports that combining torch.compile, SDPA (with FlashAttention), and tensor parallelism yields 29 ms/token latency on a 70-billion-parameter Llama 2 model running on 8 A100 80 GB GPUs, a 2.4x improvement over unoptimized inference under the same conditions (512-token input, 50-token generation).^[20]
High-throughput attention: from PyTorch 2.2 onward, scaled_dot_product_attention automatically dispatches to FlashAttention-2 on supported hardware, achieving 50 to 73 percent of theoretical maximum FLOPs on A100; combining SDPA with torch.compile is the standard recipe in current PyTorch examples.^[19]

It also unlocks ahead-of-time export. The torch.export API leverages Dynamo's graph capture to produce a serializable program representation, and AOTInductor (introduced in PyTorch 2.2) takes that exported graph and compiles it into a shared library suitable for non-Python server-side deployments, sharing backend code with the JIT path.^[19]

Limitations and Criticisms

Despite the speedups, the system has known constraints and rough edges:

Coverage is high but not 100%. The original announcement places coverage at 93% of 163 tested models, and graph breaks (where Dynamo cannot proceed and falls back to the interpreter) reduce the size of the compiled region; in extreme cases, performance can match or trail eager mode.^[1]^[9]
Cold-start compile times are non-trivial, particularly for mode="max-autotune", which searches over Triton kernel variants and can take significantly longer to first-call than mode="default". GitHub issues report that max-autotune can lead to overall execution times longer than eager when models are run few enough times that compilation does not amortize.^[13]
Guard misses cause recompilation. Programs that pass tensors of many different shapes (variable-length sequences, ragged batches) may recompile repeatedly unless dynamic=True is enabled or shape ranges are constrained. Each recompilation adds latency and produces a distinct cache entry.^[9]^[10]
Hardware coverage at launch was narrow. The PyTorch 2.0 announcement explicitly noted that only NVIDIA Volta and Ampere GPUs and CPUs were supported out of the box, with desktop GPUs (NVIDIA 3090 and older) showing lower speedups than A100; broader GPU and accelerator coverage arrived in subsequent 2.x releases.^[1]
Debugging is harder than eager. Stack traces span Dynamo-rewritten bytecode, AOTAutograd-generated graphs, and Triton-emitted code, requiring developers to learn new logs (TORCH_LOGS=recompiles,graph_breaks,dynamo) and tools (torch._dynamo.explain) to localize regressions.^[9]^[10]
max-autotune on CPU is less mature than on CUDA, with only a subset of operations using templated kernels at launch and ongoing work to extend the GEMM template coverage.^[13]

Version Timeline

Subsequent PyTorch 2.x releases have continued to expand torch.compile rather than supersede it. Notable per-release changes include:

Version	Released	Key `torch.compile` additions
2.0	2023-03-15	Initial public release of `torch.compile` with TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor; modes `default`, `reduce-overhead`, `max-autotune`; `dynamic`, `fullgraph`, `backend` parameters; SDPA front-end with FlashAttention v1 / Memory-Efficient Attention.^[2]
2.1	October 2023	Automatic dynamic shape support; NumPy API support inside compiled regions; CPU Inductor AVX-512 codegen; improved SDPA pattern matching.^[18]
2.2	2024-01-30	FlashAttention-2 integration; AOTInductor (ahead-of-time deployment); compiled optimizer improvements; horizontal fusion for `torch.cat`; `TORCH_LOGS` standardization.^[19]

The PyTorch Foundation publishes per-release notes via the project blog and the GitHub release tags, and the developer mailing list dev-discuss.pytorch.org tracks ongoing torch.compile RFCs.^[4]^[19]

Comparison

torch.compile is one of several frameworks-level graph compilers active in the deep-learning ecosystem.

System	Capture mechanism	Default codegen	Eager support	Sponsor
`torch.compile` (PyTorch 2)	Python bytecode rewrite via PEP 523	TorchInductor → Triton / C++	Yes, opt-in JIT	Meta / PyTorch Foundation^[1]^[3]
TorchScript (PyTorch 1)	Source tracing / scripting	TorchScript IR + custom fusers	Yes (separate path)	Meta
XLA (TensorFlow, JAX)	Lazy tensor tracing or `jit` decorator	XLA HLO → backend	TF eager, JAX trace-on-call	Google
Apache TVM (`tvm` backend)	Relay/Relax IR import	TVM tensor IR	Standalone compiler	Apache, OctoML
ONNX Runtime (`onnxrt` backend)	ONNX graph import	ORT execution providers	Standalone runtime	Microsoft, ONNX consortium

Compared to TorchScript, torch.compile distinguishes itself by not requiring users to make their models scriptable: Dynamo's bytecode-level capture handles native Python control flow that TorchScript could not. Compared to JAX's jit, PyTorch keeps eager mode as the default and treats compile as an optional wrapper rather than the dominant API.^[1]^[3]

A second axis of comparison is the codegen target. Where XLA and TVM use their own internal intermediate representations and target a wide range of hardware through compiler back-ends, TorchInductor stands out by emitting source code in two well-known languages: Triton (a Python-embedded DSL designed at OpenAI for GPU kernel programming) and standard C++ with OpenMP. The Python-implemented Inductor IR is intentionally small (around 50 operators), which the ASPLOS 2024 paper argues lets compiler engineers prototype optimizations in Python and avoids the upfront engineering cost of a custom IR per backend.^[1]^[3]

PyTorch, the framework that hosts torch.compile.
Triton (compiler), the GPU kernel language TorchInductor emits.
JAX, a contemporary framework with a different graph-capture model (jit).
TensorFlow and its XLA backend, the major prior-generation graph compiler.
ONNX and the onnxrt backend for cross-framework deployment.
OpenVINO and Intel Extension for PyTorch, an Intel-targeted backend lineage.
NVIDIA A100, the GPU on which the headline benchmarks were collected.
Python PEP 523, the CPython frame evaluation API TorchDynamo uses.
FX, the PyTorch IR that TorchDynamo emits and TorchInductor consumes.

References

PyTorch Team, "PyTorch 2.x: faster, more pythonic and dynamic as ever", PyTorch, 2022-12-02 (updated for the 2.0 release). https://pytorch.org/get-started/pytorch-2-x/. Accessed 2026-05-21. ↩
PyTorch Foundation, "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever", PyTorch Blog, 2023-03-15. https://docs.pytorch.org/blog/pytorch-2.0-release/. Accessed 2026-05-21. ↩
Jason Ansel, Edward Yang, Horace He, et al., "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '24), 2024-04-27. https://docs.pytorch.org/assets/pytorch2-2.pdf. Accessed 2026-05-21. ↩
PyTorch Team, "PyTorch 2 paper and tutorial @ ASPLOS 2024", PyTorch Blog, 2024-04. https://pytorch.org/blog/pytorch-pytorch-2-paper-tutorial/. Accessed 2026-05-21. ↩
Jason Ansel, "About", jasonansel.com, accessed 2026. https://jasonansel.com/. Accessed 2026-05-21. ↩
PyTorch (official account), "We just introduced PyTorch 2.0 at the #PyTorchConference, introducing torch.compile! Available in the nightlies today, stable release Early March 2023", X (Twitter), 2022-12-02. https://x.com/PyTorch/status/1598708792598069249. Accessed 2026-05-21. ↩
Brett Cannon, "PEP 523: Adding a frame evaluation API to CPython", Python Software Foundation, 2016. https://peps.python.org/pep-0523/. Accessed 2026-05-21. ↩
PyTorch contributors, "torchdynamo (project README)", PyPI, 2022. https://pypi.org/project/torchdynamo/. Accessed 2026-05-21. ↩
PyTorch Team, "Dynamo Overview", PyTorch documentation. https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html. Accessed 2026-05-21. ↩
PyTorch Team, "Dynamic Shapes Core Concepts", PyTorch documentation. https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/dynamic_shapes_core_concepts.html. Accessed 2026-05-21. ↩
PyTorch Team, "Introduction to torch.compile", PyTorch Tutorials. https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html. Accessed 2026-05-21. ↩
PyTorch Team, "CUDAGraph Trees", PyTorch documentation. https://docs.pytorch.org/docs/stable/user_guide/torch_compiler/torch.compiler_cudagraph_trees.html. Accessed 2026-05-21. ↩
PyTorch Team, "Using Max-Autotune Compilation on CPU for Better Performance", PyTorch Tutorials. https://docs.pytorch.org/tutorials/unstable/max_autotune_on_CPU_tutorial.html. Accessed 2026-05-21. ↩
PyTorch contributors, "[RFC]: Moving most torch.compile backends out of core", pytorch/pytorch issue #109687, GitHub, 2023-09. https://github.com/pytorch/pytorch/issues/109687. Accessed 2026-05-21. ↩
PyTorch Forums, "Where to begin developing custom backend for torch.compiler?", discuss.pytorch.org. https://discuss.pytorch.org/t/where-to-begin-developing-custom-backend-for-torch-compiler/191182. Accessed 2026-05-21. ↩
Yaoyao Ding, Bojian Zheng, Allan Lin, et al., "Introducing Hidet: A Deep Learning Compiler for Efficient Model Serving", PyTorch Blog, 2023-04-28. https://pytorch.org/blog/introducing-hidet/. Accessed 2026-05-21. ↩
PyTorch Team, "PyTorch 2.0 Performance Dashboard", PyTorch documentation. https://docs.pytorch.org/docs/stable/torch.compiler_performance_dashboard.html. Accessed 2026-05-21. ↩
Anthony Alford, "PyTorch 2.1 Release Supports Automatic Dynamic Shape Support and Distributed Training Enhancements", InfoQ, 2023-10. https://www.infoq.com/news/2023/10/pytorch21-at-pytorch-con-2023/. Accessed 2026-05-21. ↩
PyTorch Team, "PyTorch 2.2: FlashAttention-v2 integration, AOTInductor", PyTorch Blog, 2024-01-30. https://pytorch.org/blog/pytorch2-2/. Accessed 2026-05-21. ↩
Antoni Viros i Martin, Brian Vaughan, Davis Wertheimer, Joshua Rosenkranz, Mudhakar Srivatsa, Nelson Mimura Gonzalez, Raghu Ganti, Supriyo Chakraborty, Zhuoran Liu, Geeta Chauhan, Hamid Shojanazeri, "PyTorch compile to speed up inference on Llama 2", PyTorch Blog, 2023-11-07. https://pytorch.org/blog/pytorch-compile-to-speed-up-inference/. Accessed 2026-05-21. ↩
Hugging Face, "Accelerated PyTorch 2.0 support in Diffusers", Diffusers documentation, 2023. https://huggingface.co/docs/diffusers/v0.14.0/optimization/torch2.0. Accessed 2026-05-21. ↩
PyTorch contributors, "torch/_inductor/cudagraph_trees.py", pytorch/pytorch source tree, GitHub. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Operation (op)vLLM

Infobox

Background

How It Works

TorchDynamo (graph capture)

Guard-based recompilation

AOTAutograd (joint forward/backward capture)

PrimTorch (operator decomposition)

TorchInductor (default code generator)

Compile modes

torch.export and AOTInductor

Implementations and Backends

Performance Numbers

Applications

Limitations and Criticisms

Version Timeline

Comparison

Related Work

See also

References

Improve this article

Related Articles

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

Axolotl

Unsloth

LLaMA-Factory

HuggingFace PEFT

What links here

Related Articles

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

Axolotl

Unsloth

LLaMA-Factory

HuggingFace PEFT

What links here

`torch.export` and AOTInductor