PyTorch

Deep Learning Developer Tools Machine Learning Open Source AI

29 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

27 citations

Revision

v8 · 5,859 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

PyTorch is an open-source machine learning framework, primarily developed by Meta AI and now governed by the PyTorch Foundation under the Linux Foundation, that is the most widely used framework for deep learning research. It provides a flexible platform for building and training models, with particular strengths in dynamic computation graphs, an intuitive Pythonic API, and seamless GPU acceleration. PyTorch was first released on GitHub on January 19, 2017, is distributed under the permissive BSD 3-Clause license, and as of 2026 has roughly 101,000 stars and 28,000 forks on GitHub. The framework's design thesis, stated in its original 2019 paper, is that usability and speed need not be a tradeoff: "PyTorch is a machine learning library that shows that these two goals are in fact compatible." ^[4] By the PyTorch Foundation's own 2024 figures, over 70% of AI research implementations now use PyTorch, and the project reports a 63% model-training adoption rate in the Linux Foundation's generative AI survey ^[1].

What is PyTorch used for?

PyTorch is used to define, train, and deploy neural networks across the full range of modern AI workloads: computer vision, natural language processing, speech and audio, recommendation systems, reinforcement learning, and especially large language model pre-training and fine-tuning. Because of its dominance in research, most frontier AI models, including large language models from many labs, are initially developed and trained in PyTorch before any framework conversion for deployment. Its tensor library, automatic differentiation engine, distributed-training stack, and torch.compile compiler make it usable for everything from a single-GPU experiment to multi-thousand-GPU production training runs. PyTorch also reaches the edge: through ExecuTorch, on-device PyTorch models serve AI features to billions of users across Meta's apps ^[23].

History and Origins

Torch and Lua Roots

PyTorch traces its lineage to Torch, a scientific computing framework written in the Lua programming language that originated around 2002. Torch (often called Torch7 in its later iterations) provided a mature set of tensor operations and neural network modules, and it was widely used in academic research during the early 2010s. However, Lua's relatively niche status as a programming language limited Torch's adoption among the broader machine learning community, which was increasingly gravitating toward Python ^[2].

Who created PyTorch?

The groundwork for PyTorch started in early 2016 among a group of Torch7 contributors. Adam Paszke, then a student at the University of Warsaw, reached out to Soumith Chintala at Meta AI (then Facebook AI Research, or FAIR) looking for an internship. Chintala invited Paszke to build the next generation of the Torch framework with a modern design centered on Python. The project drew significant inspiration from several existing systems: Lua Torch for its C/CUDA backend libraries (TH, THC, THNN, THCUNN), the Chainer framework for its define-by-run approach to computation graphs, and the HIPS Autograd library by Dougal Maclaurin for its approach to automatic differentiation in Python ^[3].

In his account of the project's design origins, Chintala recalls that the team set out to build "a Python-first framework that did not compromise on speed," porting Torch's battle-tested numeric backends rather than rewriting them ^[3]. In mid-2016, developers refactored the codebase to decouple the frontend from the backend, producing a Python-first framework that retained Torch's C and CUDA kernels underneath. The initial public release came on January 19, 2017, on GitHub, and the framework quickly attracted attention for its developer-friendly design and flexibility ^[2].

Key Contributors

Beyond Adam Paszke and Soumith Chintala, the early PyTorch team included Sam Gross, Gregory Chanan, and several other researchers at FAIR. The original research paper, "Automatic differentiation in PyTorch," was presented at the NeurIPS 2017 workshop by Paszke, Gross, Chintala, Chanan, and colleagues. The more comprehensive 2019 paper, "PyTorch: An Imperative Style, High-Performance Deep Learning Library," was published at NeurIPS 2019 and lists 21 authors ^[4]. Over time, the contributor base expanded dramatically; the PyTorch Foundation reported contributions in 2024 from more than 3,500 individuals and 3,000 organizations worldwide ^[1].

Core Architecture and Features

Dynamic Computation Graphs (Eager Execution)

The defining technical choice in PyTorch's design is its use of dynamic computation graphs, also known as eager execution or define-by-run. In this paradigm, the computation graph is constructed on the fly as operations execute, rather than being defined statically before execution. This means developers can use standard Python control flow (if statements, for loops, print statements for debugging) directly within their model code, and the graph will adapt accordingly at each forward pass ^[4].

This was a significant departure from TensorFlow's original approach, which required users to define a static computation graph upfront before running any computations. As the PyTorch paper puts it, "Every aspect of PyTorch is a regular Python program under the full control of its user." ^[4] PyTorch's eager execution made debugging substantially easier, since developers could use standard Python debuggers, inspect intermediate tensor values at any point, and write models that behaved differently depending on their inputs.

Autograd: Automatic Differentiation

PyTorch's autograd engine is a tape-based automatic differentiation system that records operations performed on tensors and constructs a directed acyclic graph (DAG) for computing gradients during the backward pass. When a forward computation is performed on tensors with requires_grad=True, autograd records every operation. Calling .backward() on the output then traverses this graph in reverse to compute gradients for all participating parameters ^[4].

The system supports both forward-mode and reverse-mode differentiation, higher-order gradients, and gradient computation for arbitrary Python functions. This flexibility has made PyTorch particularly popular for research involving novel training procedures, custom loss functions, and non-standard optimization techniques.

Pythonic API Design

PyTorch was designed to feel like a natural extension of Python and NumPy. Tensors in PyTorch behave similarly to NumPy arrays but with GPU support and automatic differentiation. The torch.nn module provides a high-level API for defining neural network layers, and the torch.optim module supplies standard optimization algorithms. The overall design philosophy prioritizes usability and transparency over abstraction, allowing researchers to understand and modify every aspect of their training pipeline ^[4].

GPU Acceleration and Hardware Support

PyTorch provides first-class support for NVIDIA CUDA GPUs. Tensors can be moved to GPU memory with a simple .to('cuda') or .cuda() call, and all standard operations have CUDA implementations. The framework also supports multi-GPU training through torch.nn.DataParallel and the more scalable torch.nn.parallel.DistributedDataParallel (DDP).

As of PyTorch 2.10 (January 2026), hardware support has expanded considerably beyond NVIDIA GPUs:

Hardware Platform	Backend	Status (PyTorch 2.10)
NVIDIA GPUs (CUDA)	CUDA	Stable, first-class support
AMD GPUs (ROCm)	ROCm	Stable; pre-built wheels available
Intel GPUs (Arc, Data Center Max)	XPU (SYCL)	Stable since PyTorch 2.6
Apple Silicon (M1/M2/M3/M4)	MPS (Metal Performance Shaders)	Beta; eager mode stable, torch.compile limited
Google TPUs	PyTorch/XLA	Experimental; maintained by Google
Intel CPUs (AMX, AVX-512)	CPU	Stable; FP16 and BF16 support
Arm CPUs (Neoverse, Graviton)	CPU	Stable; optimized kernels via KleidiAI

The MPS backend, introduced in PyTorch 1.12 for Apple Silicon, allows GPU-accelerated training and inference on Mac devices using Apple's Metal Performance Shaders framework. While it has matured considerably, torch.compile support for MPS remains limited compared to the CUDA backend ^[5].

TorchScript

TorchScript is a way to create serializable and optimizable models from PyTorch code. It provides two mechanisms: tracing (which records operations executed during a sample forward pass) and scripting (which directly analyzes the Python source code). TorchScript models can be saved and loaded in environments that do not require Python, such as C++ applications, enabling deployment in production settings. While TorchScript was important in PyTorch's evolution toward production readiness, the torch.compile approach introduced in PyTorch 2.0 has increasingly become the preferred path for optimization ^[6].

PyTorch 2.0 and the Compiler Stack

PyTorch 2.0, released in March 2023, represented the most significant technical evolution of the framework since its inception. The headline feature was torch.compile(), which the PyTorch team announced as "a feature that pushes PyTorch performance to new heights and starts the move for parts of PyTorch from C++ back into Python." ^[7] It is a single function call that can accelerate existing PyTorch models without requiring code changes. Under the hood, torch.compile is powered by a suite of new compiler technologies ^[7].

TorchDynamo

TorchDynamo is a Python-level JIT compiler that captures PyTorch operations using Python's frame evaluation hooks (PEP 523). Unlike previous graph capture approaches that struggled with Python's dynamic nature, TorchDynamo can capture computational graphs from arbitrary Python code robustly. When it encounters Python constructs it cannot handle, it falls back gracefully to regular Python execution for those portions, a technique called "graph breaks." This design was the result of several years of research and development into safe graph capture ^[7].

TorchInductor

TorchInductor is the default compiler backend that takes the captured graph and generates optimized code. For NVIDIA GPUs, it produces Triton kernels; for CPUs, it generates C++/OpenMP code. TorchInductor applies a range of optimizations including operator fusion (combining multiple operations into a single kernel to reduce memory traffic), memory planning, and automatic tuning of kernel configurations. The backend uses a Pythonic define-by-run loop-level intermediate representation (IR) that makes it accessible and extensible ^[7].

As of PyTorch 2.8, the Inductor CUTLASS backend is also available for both torch.compile and AOTInductor, supporting GEMMs such as mm, FP8 mm, addmm, and bmm. Generated CUTLASS kernels have achieved up to 10-16% speedups over Triton and cuBLAS on certain production workloads ^[15].

AOTAutograd and PrimTorch

AOTAutograd (Ahead-of-Time Autograd) traces the backward pass at compile time rather than at runtime, enabling the compiler to optimize both the forward and backward computations together. PrimTorch canonicalizes PyTorch's roughly 2,000 operators down to a closed set of approximately 250 primitive operators, providing a standardized target for backend developers and simplifying the compiler stack ^[7].

Why did torch.compile make PyTorch faster?

torch.compile delivers significant speedups across a wide range of models. At launch, the PyTorch team reported that across 163 open-source models torch.compile works 93% of the time, and the model "runs 43% faster in training on an NVIDIA A100 GPU" (21% faster at float32 precision and 51% faster at automatic mixed precision) ^[7]. The peer-reviewed ASPLOS 2024 paper measured TorchInductor delivering a 2.27x inference and 1.41x training geometric-mean speedup on an NVIDIA A100 across more than 180 real-world models, outperforming six other compilers ^[7]. For large language model workloads, the speedups are often more pronounced due to the opportunities for operator fusion and memory optimization. Complex models can see speedups as high as 5x, while simpler models may see more modest gains. The compiler offers multiple modes: default for a balance of compile time and performance, reduce-overhead for minimizing framework overhead, and max-autotune for maximum runtime performance at the cost of longer compilation ^[7].

Component	Role	Output
TorchDynamo	Python-level graph capture via frame evaluation hooks	FX graph of PyTorch operations
AOTAutograd	Ahead-of-time backward pass tracing	Joint forward/backward graph
PrimTorch	Operator canonicalization (~2000 to ~250 ops)	Simplified primitive operations
TorchInductor	Code generation and optimization	Triton kernels (GPU) or C++ (CPU)

torch.export

Introduced as a prototype in PyTorch 2.1 and progressively stabilized, torch.export provides a sound full-graph capture mechanism that produces clean, portable graph representations of PyTorch programs. Unlike TorchDynamo's graph capture (which allows graph breaks and fallbacks), torch.export aims for complete graph capture with no Python dependencies, making it suitable for deployment to non-Python environments. torch.export serves as the entry point for ExecuTorch on-device deployment and AOTInductor server-side compilation ^[8].

FlexAttention

FlexAttention is a PyTorch API introduced as a prototype in PyTorch 2.5 (October 2024) that provides a programmable interface for implementing custom attention mechanisms. It addresses a key tension in the deep learning ecosystem: while fused attention implementations like FlashAttention have substantially improved performance and enabled long context windows, their monolithic nature made it difficult for researchers to experiment with new attention variants without writing custom CUDA kernels ^[16].

FlexAttention works by allowing users to define an arbitrary score_mod function in idiomatic PyTorch code that modifies attention scores after they have been computed between query and key tensors. The compiler then lowers this into a fused FlashAttention-style kernel via torch.compile, generating a kernel that does not materialize extra memory and achieves performance competitive with handwritten implementations. The backward pass is generated automatically ^[27].

Many existing attention variants can be expressed through FlexAttention, including ALiBi (attention with linear biases), document masking, PagedAttention for KV cache management, sliding window attention, and causal masking. Performance benchmarks show FlexAttention achieves 0.68x to 1.43x the performance of FlashAttention v2, with end-to-end improvements of up to 2.04x for inference in gpt-fast (16k context) and 2.4x for training in torchtune ^[16].

PyTorch 2.6 extended FlexAttention to x86 CPUs through the TorchInductor C++ backend, supporting attention variants like PagedAttention critical for LLM inference. PyTorch 2.7 further improved FlexAttention for LLM first-token processing and throughput mode inference. PyTorch 2.10 added varlen_attn(), a new attention operation for ragged and packed sequences that supports both forward and backward passes and is torch.compile-compatible ^[17].

Evolution Through the 2.x Series

Following the 2.0 release, PyTorch has maintained a rapid release cadence with significant improvements in each version.

PyTorch 2.1 (October 2023)

PyTorch 2.1 introduced automatic dynamic shape support in torch.compile, which tracks and generates code based on symbolic tensor shapes rather than static shapes, allowing a single compiled kernel to handle many input sizes at only a modest cost to efficiency. This was particularly important for LLM workloads where sequence lengths vary. The release also added torch.distributed.checkpoint for saving and loading distributed models across multiple ranks in parallel, torch.compile support for the NumPy API, and a prototype of torch.export for sound full-graph capture. This release comprised 6,682 commits from 784 contributors ^[8].

PyTorch 2.2 (January 2024)

PyTorch 2.2 integrated FlashAttention-2 as the default backend for scaled dot-product attention (SDPA), delivering approximately 2x performance improvements for attention computations. The release also introduced AOTInductor, a new ahead-of-time compilation and deployment tool built for non-Python server-side deployments, along with improved torch.compile support for optimizers and a new TORCH_LOGS logging mechanism for debugging compilation ^[9].

PyTorch 2.3 (April 2024)

PyTorch 2.3 added support for user-defined Triton kernels in torch.compile, allowing users to integrate custom Triton kernels without performance complications or graph breaks. Tensor Parallelism support was validated on 100-billion-parameter model training runs using native PyTorch functions. The release also introduced the DeviceMesh abstraction for managing multi-dimensional device topologies and distributed checkpointing improvements ^[10].

PyTorch 2.4 (July 2024)

PyTorch 2.4 expanded Python 3.12 support for torch.compile (previously limited to Python 3.8-3.11), introduced AOTInductor freezing for CPU deployments, and added a new default TCPStore server backend utilizing libuv that significantly reduces initialization times for large-scale distributed jobs. A new Python Custom Operator API simplified integration of custom kernels into torch.compile. This release comprised 3,661 commits from 475 contributors ^[11].

PyTorch 2.5 (October 2024)

PyTorch 2.5 introduced a cuDNN backend for SDPA that provides up to 75% speedup over FlashAttention v2 on NVIDIA H100 and newer GPUs. Regional compilation was added to reduce torch.compile cold startup time, particularly useful for LLMs with repeated transformer layers. The release also brought the FlexAttention API for programmable attention mechanisms, enhanced FP16 support in the TorchInductor CPU backend, and expanded Intel GPU support for both Data Center GPU Max Series and Intel Arc client GPUs. This release comprised 4,095 commits from 504 contributors ^[12].

PyTorch 2.6 (January 2025)

PyTorch 2.6 added torch.compile support for Python 3.13 and introduced torch.compiler.set_stance, a feature that allows users to specify different compilation behaviors between invocations (for example, running eagerly when recompilation would be needed). FlexAttention was extended to x86 CPUs. Intel GPU support reached stable status with simplified one-click installation of torch-xpu PIP wheels and expanded coverage including Intel Arc B-Series discrete graphics. FP16 on x86 CPUs was promoted to beta status. As a security improvement, the default weights_only parameter of torch.load was changed ^[13].

PyTorch 2.7 (April 2025)

PyTorch 2.7 brought support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures. torch.compile gained support for Torch Function Modes, enabling users to override any torch operation with custom behavior. The Mega Cache feature enabled end-to-end portable caching for torch.compile. FlexAttention received further optimizations for LLM inference throughput on x86 CPUs. This release comprised 3,262 commits from 457 contributors ^[14].

PyTorch 2.8 (July 2025)

PyTorch 2.8 introduced five control flow operators (cond, while_loop, scan, associative_scan, and map) for compiling and exporting models with data-dependent control flow. The release added support for saving, loading, and re-sharding checkpoints in the SafeTensors format for interoperability with the Hugging Face ecosystem. The Inductor CUTLASS backend became available for both torch.compile and AOTInductor. This release comprised 4,164 commits from 585 contributors ^[15].

PyTorch 2.9 (October 2025)

PyTorch 2.9 raised the minimum Python version to 3.10 and added preview support for Python 3.14 and Python 3.14t (the free-threaded build). The release introduced the symmetric memory programming model for ultra-low latency direct GPU-to-GPU communication within kernels (put/get operations), expanded the hardware support matrix with ROCm, XPU, and CUDA 13 wheel variants, and refined the stable ABI for C++ and CUDA extensions to improve cross-version compatibility. Arm platform support was broadened with optimized operators on AArch64 and new Arm Neoverse V2-based CI coverage on AWS Graviton 4 instances ^[18].

PyTorch 2.10 (January 2026)

PyTorch 2.10 is the current latest stable version as of early 2026. It added Python 3.14 support for torch.compile and experimental support for the Python 3.14t free-threaded build. Combo-kernel horizontal fusion in TorchInductor reduces kernel launch overhead by fusing multiple independent operations with no data dependencies into a single GPU kernel. FP8 support was added for Intel GPUs with commonly used basic operators and scaled matrix multiplication. torch.compile now respects use_deterministic_mode, making reproducible training easier. A new varlen_attn() operation supports ragged and packed sequences for attention. This release comprised 4,160 commits from 536 contributors. The project has also increased its release cadence from quarterly to bimonthly for 2026 ^[19].

Major Versions Summary

Version	Release Date	Key Features
1.0	December 2018	TorchScript, C++ frontend, distributed training
1.5	April 2020	Stable C++ frontend, updated autograd
1.8	March 2021	AMD ROCm support, PyTorch Profiler
1.12	June 2022	Apple MPS backend, Functorch
2.0	March 2023	torch.compile, TorchDynamo, TorchInductor, Accelerated Transformers
2.1	October 2023	Automatic dynamic shapes, torch.export prototype, distributed checkpointing
2.2	January 2024	FlashAttention-2 in SDPA, AOTInductor
2.3	April 2024	User-defined Triton kernels, Tensor Parallelism, DeviceMesh
2.4	July 2024	Python 3.12 support, AOTInductor freezing, Custom Operator API
2.5	October 2024	cuDNN SDPA backend, regional compilation, FlexAttention, Intel GPU support
2.6	January 2025	Python 3.13 support, compiler stances, Intel GPU stable, FlexAttention on CPU
2.7	April 2025	NVIDIA Blackwell support, CUDA 12.8, Torch Function Modes, Mega Cache
2.8	July 2025	Control flow operators, SafeTensors checkpointing, CUTLASS backend
2.9	October 2025	Python 3.14 preview, symmetric memory, CUDA 13, stable ABI
2.10	January 2026	Combo-kernel fusion, FP8 on Intel GPUs, Python 3.14 for torch.compile

Distributed Training

PyTorch provides a comprehensive suite of tools for distributed training across multiple GPUs and machines, organized under the torch.distributed module.

DistributedDataParallel (DDP)

DDP is the standard approach for data-parallel training, where the model is replicated across each worker and each replica processes a different subset of the training data. DDP uses collective communication (all-reduce) to synchronize gradients after the backward pass, ensuring all replicas maintain identical model parameters. DDP is the most widely used distributed training strategy for models that fit within a single GPU's memory ^[20].

Fully Sharded Data Parallel (FSDP and FSDP2)

FSDP, inspired by Microsoft's ZeRO optimizer, shards model parameters, gradients, and optimizer states across workers to enable training models larger than a single GPU's memory. The original FSDP (now called FSDP1) flattens, concatenates, and chunks a group of tensors together for sharding.

FSDP2, the next-generation implementation, uses per-parameter sharding (chunking each parameter individually on dim-0 across data parallel workers) for improved usability and composability. FSDP2 offers several advantages over FSDP1: it avoids record_stream usage for deterministic memory release, requires approximately 7% lower GPU memory on average (benchmarked on Llama 2 7B), and provides roughly 1.5% faster throughput. Per-parameter sharding relaxes constraints around frozen parameters and enables communication-free sharded state dicts without the all-gathers required in FSDP1. FSDP2 also supports both implicit prefetching (works out of the box) and explicit prefetching for advanced users who want to control all-gather schedules ^[21].

Tensor Parallelism and Pipeline Parallelism

Tensor Parallelism (TP) splits individual layers across multiple devices, allowing single operations (such as large matrix multiplications) to be distributed across GPUs. PyTorch's TP implementation leverages DTensor (Distributed Tensor) and the DeviceMesh abstraction for device management. Pipeline Parallelism (PP) splits the model into stages, with each stage assigned to a different device, and micro-batches flowing through the pipeline to maximize hardware utilization.

These parallelism strategies can be composed hierarchically. In a typical 3D parallelism configuration, TP shards within nodes, FSDP shards across nodes, and PP divides the model across pipeline stages, all managed through different dimensions of a DeviceMesh. This composability was validated at scale through the TorchTitan framework, which demonstrated stackable FSDP2, TP, and PP implementations for production LLM pre-training ^[22].

Symmetric Memory (PyTorch 2.9+)

Introduced in PyTorch 2.9, the symmetric memory programming model supports direct communication within GPU kernels using put/get operations. This enables ultra-low latency remote memory access, including one-way operations that do not require remote GPU coordination, opening new possibilities for custom communication patterns in distributed training ^[18].

ExecuTorch: On-Device and Edge Deployment

ExecuTorch is PyTorch's unified solution for deploying AI models on-device, from smartphones and wearables to microcontrollers and embedded systems. It succeeded PyTorch Mobile, which was deprecated in favor of this more comprehensive approach. ExecuTorch maintains a minimal 50KB base runtime footprint, making it suitable for severely resource-constrained environments ^[23].

The framework works by taking a PyTorch model exported via torch.export, optimizing it for the target hardware, and running it through a lightweight runtime. ExecuTorch supports over 12 hardware backends with acceleration for Apple (Core ML, MPS), Qualcomm (Hexagon NPU), Arm (Ethos-U NPU, CPU via KleidiAI), MediaTek, Samsung (Exynos NPU and GPU), Intel (OpenVINO), NXP Semiconductors, and Vulkan for cross-platform GPU inference.

ExecuTorch 1.0 was released on October 22, 2025, marking the framework's production-ready status. Key features of the 1.0 release include new hardware backends (Arm VGF, NXP eIQ Neutron NPU, Samsung Exynos), several backends promoted from beta to production-ready status, and support for native C++ desktop and laptop applications. Meta has deployed ExecuTorch across its family of apps, with on-device AI features serving billions of users on Instagram, WhatsApp, Messenger, and Facebook ^[23].

Ecosystem and Libraries

PyTorch's ecosystem extends well beyond the core framework, encompassing a rich set of domain-specific libraries and third-party integrations.

Official Domain Libraries

Library	Domain	Key Features
torchvision	Computer vision	Pre-trained models (ResNet, EfficientNet, ViT), datasets (ImageNet, COCO), image transforms
torchaudio	Audio processing	Audio I/O, feature extraction (spectrograms, MFCCs), pre-trained models (wav2vec 2.0, HuBERT)
torchtext	Natural language processing	Text preprocessing, vocabulary management, dataset loaders
TorchRec	Recommendation systems	Distributed embeddings, sharding strategies for large embedding tables
TorchServe	Model serving	REST/gRPC APIs, model versioning, batching, multi-model serving
PyTorch Lightning	Training framework	Simplified training loops, multi-GPU/TPU support, experiment tracking integration
torchtune	LLM fine-tuning	Native PyTorch recipes for fine-tuning LLMs, LoRA/QLoRA support
TorchTitan	Distributed pre-training	Stackable FSDP2, TP, PP implementations for production LLM pre-training
ExecuTorch	Edge deployment	On-device inference for mobile, embedded, and edge devices

Hugging Face Integration

The Hugging Face Transformers library is perhaps the most significant third-party integration in the PyTorch ecosystem. Hugging Face's model hub hosts hundreds of thousands of pre-trained models, the vast majority of which are PyTorch-native. The Transformers library provides a unified API for loading, fine-tuning, and deploying these models. Integration features include native support for torch.compile, FlashAttention, and automatic mixed precision training. The Hugging Face Accelerate library further simplifies distributed training across multiple GPUs and machines using PyTorch's distributed primitives. PyTorch 2.8's SafeTensors checkpoint support further improved interoperability with the Hugging Face ecosystem ^[24].

PyTorch Foundation

Is PyTorch open source?

Yes. PyTorch has always been free and open source, released under the permissive BSD 3-Clause license, which allows commercial use, modification, and redistribution with minimal restrictions ^[2]. In September 2022, Meta transitioned PyTorch's governance to the newly formed PyTorch Foundation, hosted under the Linux Foundation. The founding premier members included AMD, Amazon Web Services, Google Cloud, Meta, Microsoft Azure, and NVIDIA. The Foundation's formation was motivated by a desire to ensure neutral governance, separating business interests from technical decision-making ^[25].

The Foundation adheres to four core principles: remaining open, maintaining neutral branding, staying fair, and forging a strong technical identity. The governing board includes representatives from the founding members, while technical governance follows a hierarchical maintainer structure with clear processes for day-to-day development and escalations. The Technical Advisory Council (TAC) serves as a bridge between the industry (including Foundation members), the community, and the core development team ^[25].

Since its formation, the Foundation has expanded its membership significantly. In February 2026, the Foundation announced nine new members, including Silver members Clockwork.io, Emmi AI, and the National IT Industry Promotion Agency (NIPA), as well as Associate members Carnegie Mellon University and Monash University. Ray, the open-source distributed computing framework for AI workloads, joined as a Foundation-hosted project in October 2025. The annual PyTorch Conference brought in triple the registrations in 2024 compared to 2023, and the PyTorch Tools ecosystem grew by over 25% in 2024. The Foundation offers tiered membership levels (Premier, General, Silver, Associate) with different governance participation rights ^[25] ^[26].

Adoption and Community

How dominant is PyTorch in AI research?

PyTorch's adoption in AI research has been remarkable and continues to grow. According to the PyTorch Foundation's 2024 Year in Review, over 70% of AI research implementations are now using PyTorch, and the framework leads the model-training space with a 63% adoption rate in the Linux Foundation's Shaping the Future of Generative AI report ^[1]. The Foundation reported that contributions were up 133% year over year, coming from double the number of organizations compared to the previous year, alongside 20% year-over-year growth in new repositories using PyTorch and a 30% increase in forks and users ^[1]. Independent tracking from sources such as Papers With Code has long shown PyTorch as the dominant framework in published research, far ahead of TensorFlow.

Community Statistics

Metric	Value (as of 2026)
GitHub stars	~101,000
GitHub forks	~28,000
Contributors (2024, individuals)	3,500+
Contributing organizations (2024)	3,000+
PyPI monthly downloads	Tens of millions
PyTorch Conference 2024 registrations	3x increase over 2023
Year-over-year contribution growth (2024)	133% ^[1]
PyTorch website visitors (2024)	~10 million ^[1]
AI research implementations using PyTorch	Over 70% ^[1]

Industry Adoption

Beyond research, PyTorch powers production AI systems at many major technology companies and AI labs. Meta uses PyTorch extensively for its recommendation systems, content moderation, generative AI products, and on-device inference via ExecuTorch. Microsoft uses PyTorch as the primary framework for many of its AI services and it is the default framework for Azure Machine Learning. Tesla, OpenAI, and numerous other companies rely on PyTorch for training and deploying models at scale. Google DeepMind, while historically associated with TensorFlow and JAX, has researchers who use PyTorch as well. The framework's dominance in research means that most frontier AI models, including large language models from various labs, are initially developed and trained in PyTorch before any framework conversion for deployment.

Comparison with Other Frameworks

How does PyTorch differ from TensorFlow?

The relationship between PyTorch and TensorFlow has shaped the evolution of both frameworks. The core difference is execution model and audience: PyTorch was designed eager-first and Pythonic and won the research community, while TensorFlow originated with static graphs and a stronger early production story. They have since converged in many areas (TensorFlow adopted eager execution in TF 2.0; PyTorch added compilation with torch.compile), but key differences remain.

Feature	PyTorch	TensorFlow
Default execution mode	Eager (dynamic graphs)	Eager (since TF 2.0; originally static graphs)
Graph compilation	torch.compile (TorchDynamo + TorchInductor)	tf.function with XLA
Primary API style	Pythonic, imperative	Keras high-level API
Production deployment	TorchServe, AOTInductor, ExecuTorch	TF Serving, TF Lite, TensorFlow.js
Mobile/edge deployment	ExecuTorch	TensorFlow Lite, TensorFlow.js
TPU support	Via PyTorch/XLA (experimental)	Native, first-class
Distributed training	DDP, FSDP/FSDP2, DeviceMesh	tf.distribute.Strategy
Primary backer	Meta (via PyTorch Foundation)	Google
License	BSD 3-Clause	Apache 2.0

TensorFlow's advantages include its mature production deployment ecosystem (particularly TF Serving and TF Lite for mobile), native TPU support, and the TensorFlow.js ecosystem for browser-based ML. PyTorch's advantages include its dominant research community, more intuitive debugging experience, and the rapidly maturing torch.compile compiler stack. By the PyTorch Foundation's 2024 figures, over 70% of AI research implementations now use PyTorch ^[1].

PyTorch vs. JAX

JAX, developed by Google, has emerged as a significant alternative framework, particularly for performance-critical research. JAX takes a functional programming approach, providing composable transformations (jit, grad, vmap, pmap) over Python and NumPy code. JAX compiles to XLA (Accelerated Linear Algebra), which provides strong performance on TPUs and GPUs.

JAX's strengths include its functional purity (which makes programs easier to reason about mathematically), excellent built-in support for parallelism across multiple devices, and strong TPU performance. However, JAX has a steeper learning curve than PyTorch, a smaller ecosystem, and less industry adoption. Google DeepMind has been a major user of JAX, and some research groups prefer it for specific workloads involving heavy parallelism or TPU usage. PyTorch's far larger community is reflected in GitHub interest: PyTorch has roughly 101,000 stars compared to JAX's tens of thousands ^[2].

Current State and Future Direction (2025-2026)

As of early 2026, PyTorch continues to solidify its position as the leading ML framework. The 2.x series has successfully addressed many of PyTorch's historical limitations around performance and deployment, with torch.compile offering competitive or superior performance to static graph frameworks on most workloads.

Key trends and developments include:

Compiler maturity. torch.compile is now stable and integrated into most major model libraries, including Hugging Face Transformers. Regional compilation and dynamic shapes support have made it practical for LLM workloads with variable sequence lengths. As of August 2025, TorchBench, HuggingFace, and TIMM test suites in torch.compile mode run faster than eager mode across the board.

Hardware diversification. PyTorch is expanding well beyond its NVIDIA-centric roots. Intel GPU support reached stable status in PyTorch 2.6, AMD ROCm support has matured with pre-built wheels, and NVIDIA Blackwell architecture is supported as of PyTorch 2.7 with CUDA 12.8. The MPS backend for Apple Silicon continues to improve, though it lags behind the CUDA backend in torch.compile coverage.

Bimonthly release cadence. Starting in 2026, PyTorch has shifted from quarterly to bimonthly releases, with versions 2.11 through 2.16 planned throughout 2026. This accelerated pace reflects the rapid evolution of the AI hardware and software landscape.

On-device inference. With ExecuTorch 1.0 reaching general availability in October 2025 and being deployed at scale across Meta's apps, PyTorch now has a competitive story for edge deployment, an area where TensorFlow Lite had historically led.

LLM and generative AI focus. The PyTorch team has prioritized making torch.compile work seamlessly across all stages of LLM workflows: pre-training, fine-tuning, and inference optimization. Integration with FlexAttention, mixed precision training, quantization libraries, and the torchtune fine-tuning framework reflects this focus.

Growing Foundation ecosystem. The PyTorch Foundation continues to expand under the Linux Foundation, with new members joining regularly and Ray joining as a hosted project. The Foundation's vendor-neutral governance has helped attract contributions from companies beyond Meta, strengthening the project's long-term sustainability.

PyTorch's trajectory from a research-focused alternative to Torch into the most widely used deep learning framework is one of the notable success stories in open-source AI infrastructure. Its combination of usability, flexibility, and an increasingly competitive performance story positions it well for continued dominance as AI development accelerates.

References

PyTorch Foundation. (2024). "PyTorch Grows as the Dominant Open Source Framework for AI and ML: 2024 Year in Review." https://pytorch.org/blog/2024-year-in-review/ ↩
Wikipedia. "PyTorch." https://en.wikipedia.org/wiki/PyTorch ↩
Chintala, S. (2023). "PyTorch's design origins." https://soumith.ch/blog/2023-12-17-pytorch-design-origins.md.html ↩
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." Advances in Neural Information Processing Systems 32. https://arxiv.org/abs/1912.01703 ↩
PyTorch documentation. "MPS backend." https://docs.pytorch.org/docs/stable/notes/mps.html ↩
PyTorch documentation. "TorchScript." https://pytorch.org/docs/stable/jit.html ↩
Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., et al. (2024). "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation." ASPLOS 2024. https://pytorch.org/get-started/pytorch-2-x/ ↩
PyTorch Blog. (2023). "PyTorch 2.1: automatic dynamic shape compilation, distributed checkpointing." https://pytorch.org/blog/pytorch-2-1/ ↩
PyTorch Blog. (2024). "PyTorch 2.2: FlashAttention-v2 integration, AOTInductor." https://pytorch.org/blog/pytorch2-2/ ↩
GitHub. "Release PyTorch 2.3: User-Defined Triton Kernels, Tensor Parallelism." https://github.com/pytorch/pytorch/releases/tag/v2.3.0 ↩
PyTorch Blog. (2024). "PyTorch 2.4 Release Blog." https://pytorch.org/blog/pytorch2-4/ ↩
PyTorch Blog. (2024). "PyTorch 2.5 Release Blog." https://pytorch.org/blog/pytorch2-5/ ↩
PyTorch Blog. (2025). "PyTorch 2.6 Release Blog." https://pytorch.org/blog/pytorch2-6/ ↩
PyTorch Blog. (2025). "PyTorch 2.7 Release." https://pytorch.org/blog/pytorch-2-7/ ↩
PyTorch Blog. (2025). "PyTorch 2.8 Release Blog." https://pytorch.org/blog/pytorch-2-8/ ↩
PyTorch Blog. (2024). "FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention." https://pytorch.org/blog/flexattention/ ↩
PyTorch Blog. (2026). "PyTorch 2.10 Release Blog." https://pytorch.org/blog/pytorch-2-10-release-blog/ ↩
PyTorch Blog. (2025). "PyTorch 2.9 Release Blog." https://pytorch.org/blog/pytorch-2-9/ ↩
GitHub. "Release PyTorch 2.10.0." https://github.com/pytorch/pytorch/releases/tag/v2.10.0 ↩
PyTorch documentation. "Distributed Data-Parallel Training." https://pytorch.org/tutorials/intermediate/ddp_tutorial.html ↩
PyTorch documentation. "Getting Started with Fully Sharded Data Parallel (FSDP2)." https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html ↩
Wang, T., et al. (2024). "TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training." https://arxiv.org/abs/2410.06511 ↩
PyTorch Blog. (2025). "Introducing ExecuTorch 1.0: Powering the next generation of edge AI." https://pytorch.org/blog/introducing-executorch-1-0/ ↩
Hugging Face documentation. "Transformers." https://huggingface.co/docs/transformers/ ↩
Linux Foundation. (2022). "Meta Transitions PyTorch to the Linux Foundation." https://www.linuxfoundation.org/press/press-release/meta-transitions-pytorch-to-the-linux-foundation ↩
Linux Foundation. (2026). "PyTorch Foundation Announces New Members as Agentic AI Demand Grows." https://www.linuxfoundation.org/press/pytorch-foundation-announces-new-members-as-agentic-ai-demand-grows ↩
He, H., et al. (2024). "Flex Attention: A Programming Model for Generating Optimized Attention Kernels." https://arxiv.org/abs/2412.05496 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit