# torch.compile

> Source: https://aiwiki.ai/wiki/torch_compile
> Updated: 2026-06-07
> Categories: Deep Learning, Developer Tools, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

`torch.compile` is the just-in-time graph capture and compilation feature introduced in [PyTorch](/wiki/pytorch) 2.0, a release first announced at the PyTorch Conference on December 2, 2022 and shipped as a stable version on March 15, 2023.[^1][^2] Wrapping a model with a single call (`torch.compile(model)`) routes its [Python](/wiki/python) execution through a stack of new components, TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor, that lift unmodified eager-mode code into optimized [Triton](/wiki/triton) kernels on [NVIDIA A100](/wiki/nvidia_a100)-class GPUs or C++/OpenMP code on CPUs.[^1][^3] Meta and the PyTorch Foundation reported geometric-mean training speedups of 38% on TIMM, 52% on HuggingFace Transformers, and 76% on TorchBench at the December 2022 announcement, with a follow-on ASPLOS 2024 paper measuring 2.27x inference and 1.41x training speedup across 180+ models on an A100.[^1][^3] The design preserves PyTorch's "eager by default" programming model while introducing optional ahead-of-time compilation, an approach Meta describes as the largest architectural shift in PyTorch since the 1.0 release.[^1][^4]

## Infobox

| Field | Value |
|---|---|
| Feature name | `torch.compile` |
| Project | PyTorch 2.x |
| Developer | Meta AI, PyTorch Foundation, broader open-source contributors |
| First announced | December 2, 2022 (PyTorch Conference) |
| Stable release | March 15, 2023 (PyTorch 2.0)[^2] |
| License | BSD-3-Clause (PyTorch repository) |
| Default backend | TorchInductor |
| GPU codegen target | OpenAI Triton |
| CPU codegen target | C++ with OpenMP |
| Repository | github.com/pytorch/pytorch |
| Reference paper | Ansel et al., ASPLOS 2024[^3] |

## Background

PyTorch was originally a "define-by-run" framework: each tensor operation executes immediately in the Python interpreter, which is the eager mode that made the library popular among researchers but historically slower than graph-based frameworks like [TensorFlow](/wiki/tensorflow) 1 or [XLA](/wiki/xla)-backed systems.[^4] Several earlier attempts to add compilation to PyTorch (TorchScript, FX tracing, Lazy Tensor, and various nvFuser/NNC fusers) each captured only part of a model or required users to rewrite code, and none became the default path.[^3]

Meta began the work that became `torch.compile` as TorchDynamo in 2020, prototyped by Jason Ansel; the ASPLOS 2024 paper describes the system as the outcome of a roughly five-year effort to find a graph-capture mechanism robust enough to work on arbitrary PyTorch programs.[^3][^5] The official PyTorch 2.0 retrospective frames TorchDynamo's bytecode-rewriting approach as "the result of 5 years of R&D into safe graph capture."[^1]

The public roll-out followed a deliberate sequence:

* December 2, 2022. PyTorch 2.0 unveiled at the PyTorch Conference in New Orleans, with `torch.compile` available in nightly builds and a stable release promised for early March 2023.[^1][^6]
* March 15, 2023. PyTorch 2.0 released as a stable version, accompanied by the official blog post "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever."[^2]
* October 2023. PyTorch 2.1 ships at the PyTorch Conference 2023, adding automatic dynamic shape support to `torch.compile`, NumPy API support inside compiled regions, AVX-512 codegen on CPU, and improved Inductor schedulers.[^18]
* January 30, 2024. PyTorch 2.2 ships with FlashAttention-2 integration in `scaled_dot_product_attention`, the AOTInductor ahead-of-time deployment path, optimizer compilation improvements, horizontal fusion for `torch.cat`, and the unified `TORCH_LOGS` logging system.[^19]
* April 27 to May 1, 2024. The 50+-author paper "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation" presented at the 29th ACM ASPLOS conference in San Diego, with the proceedings published in early 2024.[^3][^5]

PyTorch 2.0 is fully backward-compatible with PyTorch 1.x: `torch.compile` is an opt-in wrapper, not a new programming model, and code that does not call it continues to execute eagerly with no changes.[^1][^2] The PyTorch Foundation, the Linux Foundation-hosted body that took stewardship of the project in late 2022, presents `torch.compile` as the flagship feature of the 2.x line and the central piece around which subsequent releases organize their performance work.[^4][^19]

## How It Works

`torch.compile(fn)` returns a callable that, on first invocation, traces and compiles the underlying Python function or `nn.Module`. The stack underneath is layered, with each component responsible for a different stage of lowering a Python program down to device code.[^1][^3]

### TorchDynamo (graph capture)

TorchDynamo is a Python-level just-in-time compiler that hooks into the CPython frame evaluation API specified by PEP 523.[^7][^8] When a wrapped function is called, Dynamo intercepts the frame, walks the function's bytecode, and rewrites it on the fly: PyTorch operations are extracted into an [FX](/wiki/graph) graph (`torch.fx.GraphModule`), while constructs that cannot be safely captured (data-dependent control flow, arbitrary Python side effects, calls to C extensions) cause a "graph break" and fall back to the regular CPython interpreter.[^3][^9]

Because Dynamo operates on bytecode rather than the Python source, it can capture code paths that traditional symbolic tracing cannot reach, including `if`/`else` branches, list comprehensions, and many third-party library calls. The ASPLOS 2024 paper reports that Dynamo captures graphs more robustly than prior PyTorch approaches while adding minimal overhead.[^3] On a 7,000-plus repository GitHub corpus, Meta reports a 99% graph-capture rate, with the remaining 1% falling back to eager execution.[^1]

PEP 523, accepted into [Python](/wiki/python) 3.6, added a hook in the CPython interpreter that lets external C code substitute its own frame-evaluation function in place of `_PyEval_EvalFrameDefault`. Originally motivated as a generic JIT entry point, it became the foundation TorchDynamo uses to intercept Python execution without modifying the user's source code or requiring AST-level analysis.[^7] When TorchDynamo is active it installs a custom frame evaluator that, for each Python function call, decides whether to rewrite the bytecode (extracting an FX graph and substituting calls to the compiled artifact) or to defer to the default evaluator. Cached compilations are keyed on the bytecode object plus the guard set, so successive calls with the same shapes and types reuse the existing compiled artifact at near-zero overhead.[^7][^9]

### Guard-based recompilation

Each compiled graph is paired with a set of "guards", runtime checks that confirm the assumptions Dynamo made during tracing (tensor shapes, dtypes, attribute values, Python types) still hold for a new call.[^9][^10] If all guards pass the cached compiled artifact is reused; if a guard fails, Dynamo retraces and produces a new specialization, optionally widening the assumption (for example, converting a fixed shape into a symbolic one) so the next call is more likely to hit the cache.[^9]

By default `dynamic=None`, automatic dynamic shapes: the first call specializes on observed shapes, and if subsequent calls violate that guard the function is recompiled with symbolic shape support, using a `ShapeEnv` and SymPy expressions to reason about size relationships.[^10] Setting `dynamic=True` skips the first specialization and traces symbolically from the start. Setting `TORCH_LOGS=recompiles` causes the compiler to log each recompilation along with the guards that triggered it, which is the primary debugging tool for guard failures.[^10]

### AOTAutograd (joint forward/backward capture)

To accelerate training and not just inference, PyTorch 2.0 introduced AOTAutograd, an "ahead-of-time" autograd engine that traces both the forward and backward graphs by re-using the standard dispatcher.[^1][^3] AOTAutograd intercepts the captured Dynamo graph, decomposes operators against the PyTorch dispatcher, runs the autograd engine ahead of time on FakeTensors, and emits a joint forward-and-backward FX graph that the downstream compiler can optimize as a single unit.[^3] Without AOTAutograd, only the forward pass would be compiled, leaving training largely on the eager path; with it, the entire training step (including activation save/restore) becomes a single compiled artifact.[^1]

### PrimTorch (operator decomposition)

PyTorch has on the order of two thousand operators (counting overloads), too many for any single backend to implement directly. PrimTorch defines two smaller, stable operator sets that the captured graph is lowered into: the "Prim ops" set of roughly 250 low-level primitives suited for compilers, and the "ATen ops" set of approximately 750 higher-level operators suited for backends that prefer to consume larger building blocks.[^1] Decompositions are written in Python so they can be reused across backends and traced through Dynamo.[^1][^3]

### TorchInductor (default code generator)

TorchInductor is the default backend that consumes the post-AOTAutograd, post-PrimTorch graph and produces device code.[^1][^3] On NVIDIA GPUs it lowers to [OpenAI Triton](/wiki/triton) kernels; on CPUs it generates C++ source files compiled with a normal host compiler and parallelized with OpenMP.[^3] Inductor uses a Python loop-level intermediate representation with about 50 operators that is deliberately small and extensible, and it performs scheduling, fusion, and memory-planning passes before emitting kernels.[^1] The ASPLOS 2024 paper reports that Inductor outperforms six other compilers tested in the same harness across 180+ models.[^3]

Code generation proceeds in three logical phases. First, the captured FX graph is decomposed and normalized using PrimTorch decompositions; second, Inductor lowers the result into its loop-level IR and performs scheduling decisions (op fusion, tiling, memory layout); third, it emits Triton kernels (for CUDA and ROCm targets) or C++ source files (for CPU targets) and compiles them with the corresponding toolchain. The Triton path benefits from Triton's block-level programming model, which lets a single kernel author cover a wide range of tile sizes and broadcasting patterns without writing per-shape variants.[^1][^3]

### Compile modes

`torch.compile` accepts a `mode` argument that selects an optimization preset:

| Mode | Behavior | Typical use |
|---|---|---|
| `"default"` | Balanced compile time and runtime, suitable for large models. | Most training and inference workloads.[^11] |
| `"reduce-overhead"` | Wraps the compiled graph in [CUDA](/wiki/cuda) Graphs to eliminate per-launch Python and driver overhead, at the cost of additional memory. | Small models and low-latency inference where launch overhead dominates.[^11][^12] |
| `"max-autotune"` | Searches over Triton kernel variants, fusion schedules, and templated GEMMs (with optional Cutlass and cuDNN templates); compile times are much longer but generated code is the fastest available. | Latency-critical deployment or repeated training runs where compile cost amortizes.[^13] |

Additional keyword arguments include `fullgraph=True` (raise on any graph break, useful for export and debugging) and `dynamic=True/False/None` (control symbolic-shape behavior).[^9][^10] The reduce-overhead mode is implemented via "CUDAGraph Trees", a refinement that records multiple captured graphs sharing a single memory pool so that different execution paths can be replayed without separate allocations.[^12]

CUDA Graph Trees specifically address a problem that arises when chaining several captured CUDA graphs together: a naive approach would force each graph to use a separate memory pool, blowing up activation memory and adding host-side copies to move intermediates between graphs. Trees instead share a single pool across all graphs in the tree and use a tensor-liveness tracker (implemented in `torch/_inductor/cudagraph_trees.py`) to reuse dead memory across replays, retaining the launch-overhead savings without paying the memory cost.[^12][^22] Because CUDA Graphs require fixed tensor addresses and shapes, Trees re-record a fresh CUDAGraph for each unique input shape; for workloads with very high shape variability, the re-record cost can outweigh the savings, in which case `mode="default"` (without CUDA Graphs) is preferred.[^12][^22]

### `torch.export` and AOTInductor

`torch.export` is a sibling API that reuses Dynamo to capture a "complete" graph (one with no graph breaks) of an `nn.Module` plus example inputs, then serializes it as an `ExportedProgram`. PyTorch 2.2 added AOTInductor, an ahead-of-time variant of TorchInductor that consumes an exported program and emits a compiled shared library suitable for loading in non-Python server-side environments, sharing the same Triton/C++ codegen as the JIT path.[^19] This lets a single backend serve both interactive development (via `torch.compile`) and production deployment (via `torch.export` and AOTInductor).

## Implementations and Backends

Although TorchInductor is the default, `torch.compile` exposes a `backend=` argument that selects an alternative compiler; the canonical list can be retrieved with `torch.compiler.list_backends()` and includes `inductor`, `cudagraphs`, `onnxrt`, `openxla`, `openxla_eval`, and `tvm`, with other backends moved out of core into companion packages.[^14] Custom backends can be registered through the `torch._dynamo` extension API; the developer documentation walks through the contract a backend must satisfy (it receives an FX `GraphModule` and a list of example inputs, and returns a callable).[^15]

Notable third-party backends include:

| Backend | Maintainer | Target | Notes |
|---|---|---|---|
| TorchInductor | Meta / PyTorch | NVIDIA GPUs (Triton), AMD GPUs (Triton on ROCm), CPUs (C++/OpenMP) | Default backend; the focus of the ASPLOS 2024 paper.[^3] |
| `cudagraphs` | PyTorch | NVIDIA GPUs | Wraps captured graph in CUDA Graphs without Inductor codegen.[^12] |
| `onnxrt` | Microsoft / PyTorch | ONNX Runtime targets | Lowers FX graph to [ONNX](/wiki/onnx) and dispatches via ONNX Runtime.[^14] |
| `openxla` | Google / PyTorch | XLA-supported hardware (TPUs, GPUs) | Routes through [XLA](/wiki/xla) via the PyTorch/XLA integration.[^14] |
| `tvm` | Apache TVM | Multiple | Lowers through Apache TVM's compiler stack.[^14] |
| Hidet | CentML, University of Toronto, AWS | NVIDIA GPUs | Selected via `backend="hidet"`; aimed at inference with operator-schedule search and Tensor Core utilization.[^16] |
| Intel Extension for PyTorch (IPEX) | Intel ([OpenVINO](/wiki/openvino) ecosystem) | Intel CPUs and Intel GPUs | Provides Intel-optimized fusion patterns and dispatches to Intel libraries.[^14] |

The PyTorch project moved most of these backends out of `torch` core into separate packages over the course of 2023 and 2024, leaving the in-tree set focused on Inductor, CUDA Graphs, and a small number of cross-vendor adapters.[^14]

## Performance Numbers

Two distinct sets of headline numbers are commonly cited.

The PyTorch Conference 2022 keynote (December 2, 2022) and the official PyTorch 2.0 announcement reported, across 163 diverse open-source models tested on an A100:[^1]

| Benchmark suite | Models tested | Training speedup (weighted) |
|---|---|---|
| HuggingFace Transformers | 46 | 52% |
| TIMM (PyTorch Image Models) | 61 | 38% |
| TorchBench | 56 | 76% |
| Overall | 163 | 43% (21% FP32, 51% AMP) |

`torch.compile` was reported to work on 93% of the 163 models tested, with the weighted average defined as 0.75 times the AMP speedup plus 0.25 times the FP32 speedup to reflect that automatic mixed precision is the more common training configuration.[^1]

The ASPLOS 2024 paper, which uses a refreshed harness covering 180+ models on the same A100 hardware, reports a geometric-mean inference speedup of 2.27x and a geometric-mean training speedup of 1.41x for TorchInductor over eager mode, and finds that TorchInductor outperforms each of six other compared compilers across the harness.[^3] The paper also evaluates Dynamo's graph-capture coverage and overhead and presents detailed ablations of guard handling, dynamic shapes, and operator-decomposition policies.[^3]

PyTorch maintains a public nightly performance dashboard that runs the same HuggingFace, TIMM, and TorchBench harnesses on 12 GCP A100 nodes (each with a 40 GB A100) and publishes speedup, memory, and pass-rate metrics for every nightly build.[^17]

A second category of results comes from end-application studies. The November 2023 PyTorch blog "PyTorch compile to speed up inference on Llama 2" reports the following inference latencies, measured on A100 80 GB hardware and with `torch.compile` plus SDPA (and tensor parallelism for the 70B configuration) compared against unoptimized eager execution:[^20]

| Model | GPUs | Latency with `torch.compile` (ms/token) | Speedup vs. unoptimized |
|---|---|---|---|
| Llama 2 7B | 1 x A100 | reported sub-linear scaling vs. sequence length | reported "significant" speedup[^20] |
| Llama 2 13B | 1 x A100 | reported sub-linear scaling vs. batch size | reported "significant" speedup[^20] |
| Llama 2 70B | 8 x A100 | 29 | 2.4x[^20] |

For inference of compute-bound diffusion and transformer pipelines, the Hugging Face Diffusers PyTorch 2.0 documentation reports that `torch.compile` can provide "an additional speed-up of 5-300x on top of SDPA" depending on GPU architecture, with Ampere (A100, 3090), Ada (4090), and Hopper (H100) showing the largest gains; the upper end of that range corresponds to small batch sizes where kernel-launch and Python overheads dominate.[^21]

## Applications

`torch.compile` is now the recommended path for accelerating training and inference of standard PyTorch models without rewriting them. Concrete deployment patterns include:

* Speeding up [Transformer](/wiki/transformers) training in Hugging Face Transformers and similar libraries by adding a single `model = torch.compile(model)` line after model construction. The PyTorch 2.0 announcement cites Sylvain Gugger reporting a 1.5x to 2x speedup on Hugging Face training scripts with no other changes.[^1]
* Accelerating computer-vision training: TIMM models pass through `torch.compile` largely without modification, per Ross Wightman's quoted endorsement in the PyTorch 2.0 announcement.[^1]
* Reducing inference latency on small models by combining `mode="reduce-overhead"` with CUDA Graphs to amortize kernel-launch and Python-interpreter overhead, which dominates eager-mode execution for short generation steps in language models.[^11][^12]
* Exporting models to other runtimes via alternate backends, for example `backend="onnxrt"` for ONNX Runtime or `backend="openxla"` for XLA-targeted hardware.[^14]
* Distributed training: `torch.compile` is integrated with both DistributedDataParallel and FullyShardedDataParallel; the announcement reports up to 15% (FP32) and 80% (AMP) gains for compiled DDP.[^1]
* Large-language-model inference: a November 2023 PyTorch blog post from an IBM Research and Meta team reports that combining `torch.compile`, SDPA (with FlashAttention), and tensor parallelism yields 29 ms/token latency on a 70-billion-parameter Llama 2 model running on 8 A100 80 GB GPUs, a 2.4x improvement over unoptimized inference under the same conditions (512-token input, 50-token generation).[^20]
* High-throughput attention: from PyTorch 2.2 onward, `scaled_dot_product_attention` automatically dispatches to FlashAttention-2 on supported hardware, achieving 50 to 73 percent of theoretical maximum FLOPs on A100; combining SDPA with `torch.compile` is the standard recipe in current PyTorch examples.[^19]

It also unlocks ahead-of-time export. The `torch.export` API leverages Dynamo's graph capture to produce a serializable program representation, and AOTInductor (introduced in PyTorch 2.2) takes that exported graph and compiles it into a shared library suitable for non-Python server-side deployments, sharing backend code with the JIT path.[^19]

## Limitations and Criticisms

Despite the speedups, the system has known constraints and rough edges:

* Coverage is high but not 100%. The original announcement places coverage at 93% of 163 tested models, and graph breaks (where Dynamo cannot proceed and falls back to the interpreter) reduce the size of the compiled region; in extreme cases, performance can match or trail eager mode.[^1][^9]
* Cold-start compile times are non-trivial, particularly for `mode="max-autotune"`, which searches over Triton kernel variants and can take significantly longer to first-call than `mode="default"`. GitHub issues report that `max-autotune` can lead to overall execution times longer than eager when models are run few enough times that compilation does not amortize.[^13]
* Guard misses cause recompilation. Programs that pass tensors of many different shapes (variable-length sequences, ragged batches) may recompile repeatedly unless `dynamic=True` is enabled or shape ranges are constrained. Each recompilation adds latency and produces a distinct cache entry.[^9][^10]
* Hardware coverage at launch was narrow. The PyTorch 2.0 announcement explicitly noted that only NVIDIA Volta and Ampere GPUs and CPUs were supported out of the box, with desktop GPUs (NVIDIA 3090 and older) showing lower speedups than A100; broader GPU and accelerator coverage arrived in subsequent 2.x releases.[^1]
* Debugging is harder than eager. Stack traces span Dynamo-rewritten bytecode, AOTAutograd-generated graphs, and Triton-emitted code, requiring developers to learn new logs (`TORCH_LOGS=recompiles,graph_breaks,dynamo`) and tools (`torch._dynamo.explain`) to localize regressions.[^9][^10]
* `max-autotune` on CPU is less mature than on CUDA, with only a subset of operations using templated kernels at launch and ongoing work to extend the GEMM template coverage.[^13]

## Version Timeline

Subsequent PyTorch 2.x releases have continued to expand `torch.compile` rather than supersede it. Notable per-release changes include:

| Version | Released | Key `torch.compile` additions |
|---|---|---|
| 2.0 | 2023-03-15 | Initial public release of `torch.compile` with TorchDynamo, AOTAutograd, PrimTorch, and TorchInductor; modes `default`, `reduce-overhead`, `max-autotune`; `dynamic`, `fullgraph`, `backend` parameters; SDPA front-end with FlashAttention v1 / Memory-Efficient Attention.[^2] |
| 2.1 | October 2023 | Automatic dynamic shape support; NumPy API support inside compiled regions; CPU Inductor AVX-512 codegen; improved SDPA pattern matching.[^18] |
| 2.2 | 2024-01-30 | FlashAttention-2 integration; AOTInductor (ahead-of-time deployment); compiled optimizer improvements; horizontal fusion for `torch.cat`; `TORCH_LOGS` standardization.[^19] |

The PyTorch Foundation publishes per-release notes via the project blog and the GitHub release tags, and the developer mailing list `dev-discuss.pytorch.org` tracks ongoing torch.compile RFCs.[^4][^19]

## Comparison

`torch.compile` is one of several frameworks-level graph compilers active in the deep-learning ecosystem.

| System | Capture mechanism | Default codegen | Eager support | Sponsor |
|---|---|---|---|---|
| `torch.compile` (PyTorch 2) | Python bytecode rewrite via PEP 523 | TorchInductor → Triton / C++ | Yes, opt-in JIT | Meta / PyTorch Foundation[^1][^3] |
| TorchScript (PyTorch 1) | Source tracing / scripting | TorchScript IR + custom fusers | Yes (separate path) | Meta |
| [XLA](/wiki/xla) (TensorFlow, JAX) | Lazy tensor tracing or `jit` decorator | XLA HLO → backend | TF eager, [JAX](/wiki/jax) trace-on-call | Google |
| Apache TVM (`tvm` backend) | Relay/Relax IR import | TVM tensor IR | Standalone compiler | Apache, OctoML |
| ONNX Runtime (`onnxrt` backend) | ONNX graph import | ORT execution providers | Standalone runtime | Microsoft, ONNX consortium |

Compared to TorchScript, `torch.compile` distinguishes itself by not requiring users to make their models scriptable: Dynamo's bytecode-level capture handles native Python control flow that TorchScript could not. Compared to JAX's `jit`, PyTorch keeps eager mode as the default and treats `compile` as an optional wrapper rather than the dominant API.[^1][^3]

A second axis of comparison is the codegen target. Where XLA and TVM use their own internal intermediate representations and target a wide range of hardware through compiler back-ends, TorchInductor stands out by emitting source code in two well-known languages: Triton (a Python-embedded DSL designed at OpenAI for GPU kernel programming) and standard C++ with OpenMP. The Python-implemented Inductor IR is intentionally small (around 50 operators), which the ASPLOS 2024 paper argues lets compiler engineers prototype optimizations in Python and avoids the upfront engineering cost of a custom IR per backend.[^1][^3]

## Related Work

* [PyTorch](/wiki/pytorch), the framework that hosts `torch.compile`.
* [Triton (compiler)](/wiki/triton), the GPU kernel language TorchInductor emits.
* [JAX](/wiki/jax), a contemporary framework with a different graph-capture model (`jit`).
* [TensorFlow](/wiki/tensorflow) and its [XLA](/wiki/xla) backend, the major prior-generation graph compiler.
* [ONNX](/wiki/onnx) and the `onnxrt` backend for cross-framework deployment.
* [OpenVINO](/wiki/openvino) and Intel Extension for PyTorch, an Intel-targeted backend lineage.
* [NVIDIA A100](/wiki/nvidia_a100), the GPU on which the headline benchmarks were collected.
* [Python](/wiki/python) PEP 523, the CPython frame evaluation API TorchDynamo uses.
* [FX](/wiki/graph), the PyTorch IR that TorchDynamo emits and TorchInductor consumes.

## See also

* [PyTorch](/wiki/pytorch)
* [PyTorch Lightning](/wiki/pytorch_lightning)
* [Triton (compiler)](/wiki/triton)
* [XLA (Accelerated Linear Algebra)](/wiki/xla)
* [JAX](/wiki/jax)
* [TensorFlow](/wiki/tensorflow)
* [ONNX](/wiki/onnx)
* [OpenVINO](/wiki/openvino)
* [NVIDIA A100](/wiki/nvidia_a100)
* [CUDA](/wiki/cuda)
* [Hugging Face Transformers](/wiki/transformers_library)
* [Python (programming language)](/wiki/python)
* [Deep Learning](/wiki/deep_learning)
* [Meta AI (Company)](/wiki/meta_ai_company)

## References

[^1]: PyTorch Team, "PyTorch 2.x: faster, more pythonic and dynamic as ever", PyTorch, 2022-12-02 (updated for the 2.0 release). https://pytorch.org/get-started/pytorch-2-x/. Accessed 2026-05-21.
[^2]: PyTorch Foundation, "PyTorch 2.0: Our next generation release that is faster, more Pythonic and Dynamic as ever", PyTorch Blog, 2023-03-15. https://docs.pytorch.org/blog/pytorch-2.0-release/. Accessed 2026-05-21.
[^3]: Jason Ansel, Edward Yang, Horace He, et al., "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation", Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '24), 2024-04-27. https://docs.pytorch.org/assets/pytorch2-2.pdf. Accessed 2026-05-21.
[^4]: PyTorch Team, "PyTorch 2 paper and tutorial @ ASPLOS 2024", PyTorch Blog, 2024-04. https://pytorch.org/blog/pytorch-pytorch-2-paper-tutorial/. Accessed 2026-05-21.
[^5]: Jason Ansel, "About", jasonansel.com, accessed 2026. https://jasonansel.com/. Accessed 2026-05-21.
[^6]: PyTorch (official account), "We just introduced PyTorch 2.0 at the #PyTorchConference, introducing torch.compile! Available in the nightlies today, stable release Early March 2023", X (Twitter), 2022-12-02. https://x.com/PyTorch/status/1598708792598069249. Accessed 2026-05-21.
[^7]: Brett Cannon, "PEP 523: Adding a frame evaluation API to CPython", Python Software Foundation, 2016. https://peps.python.org/pep-0523/. Accessed 2026-05-21.
[^8]: PyTorch contributors, "torchdynamo (project README)", PyPI, 2022. https://pypi.org/project/torchdynamo/. Accessed 2026-05-21.
[^9]: PyTorch Team, "Dynamo Overview", PyTorch documentation. https://docs.pytorch.org/docs/stable/torch.compiler_dynamo_overview.html. Accessed 2026-05-21.
[^10]: PyTorch Team, "Dynamic Shapes Core Concepts", PyTorch documentation. https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/dynamic_shapes_core_concepts.html. Accessed 2026-05-21.
[^11]: PyTorch Team, "Introduction to torch.compile", PyTorch Tutorials. https://docs.pytorch.org/tutorials/intermediate/torch_compile_tutorial.html. Accessed 2026-05-21.
[^12]: PyTorch Team, "CUDAGraph Trees", PyTorch documentation. https://docs.pytorch.org/docs/stable/user_guide/torch_compiler/torch.compiler_cudagraph_trees.html. Accessed 2026-05-21.
[^13]: PyTorch Team, "Using Max-Autotune Compilation on CPU for Better Performance", PyTorch Tutorials. https://docs.pytorch.org/tutorials/unstable/max_autotune_on_CPU_tutorial.html. Accessed 2026-05-21.
[^14]: PyTorch contributors, "[RFC]: Moving most torch.compile backends out of core", pytorch/pytorch issue #109687, GitHub, 2023-09. https://github.com/pytorch/pytorch/issues/109687. Accessed 2026-05-21.
[^15]: PyTorch Forums, "Where to begin developing custom backend for torch.compiler?", discuss.pytorch.org. https://discuss.pytorch.org/t/where-to-begin-developing-custom-backend-for-torch-compiler/191182. Accessed 2026-05-21.
[^16]: Yaoyao Ding, Bojian Zheng, Allan Lin, et al., "Introducing Hidet: A Deep Learning Compiler for Efficient Model Serving", PyTorch Blog, 2023-04-28. https://pytorch.org/blog/introducing-hidet/. Accessed 2026-05-21.
[^17]: PyTorch Team, "PyTorch 2.0 Performance Dashboard", PyTorch documentation. https://docs.pytorch.org/docs/stable/torch.compiler_performance_dashboard.html. Accessed 2026-05-21.
[^18]: Anthony Alford, "PyTorch 2.1 Release Supports Automatic Dynamic Shape Support and Distributed Training Enhancements", InfoQ, 2023-10. https://www.infoq.com/news/2023/10/pytorch21-at-pytorch-con-2023/. Accessed 2026-05-21.
[^19]: PyTorch Team, "PyTorch 2.2: FlashAttention-v2 integration, AOTInductor", PyTorch Blog, 2024-01-30. https://pytorch.org/blog/pytorch2-2/. Accessed 2026-05-21.
[^20]: Antoni Viros i Martin, Brian Vaughan, Davis Wertheimer, Joshua Rosenkranz, Mudhakar Srivatsa, Nelson Mimura Gonzalez, Raghu Ganti, Supriyo Chakraborty, Zhuoran Liu, Geeta Chauhan, Hamid Shojanazeri, "PyTorch compile to speed up inference on Llama 2", PyTorch Blog, 2023-11-07. https://pytorch.org/blog/pytorch-compile-to-speed-up-inference/. Accessed 2026-05-21.
[^21]: Hugging Face, "Accelerated PyTorch 2.0 support in Diffusers", Diffusers documentation, 2023. https://huggingface.co/docs/diffusers/v0.14.0/optimization/torch2.0. Accessed 2026-05-21.
[^22]: PyTorch contributors, "torch/_inductor/cudagraph_trees.py", pytorch/pytorch source tree, GitHub. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py. Accessed 2026-05-21.

