PyTorch Lightning
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,533 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,533 words
Add missing citations, update stale details, or suggest a clearer explanation.
PyTorch Lightning is an open-source deep-learning framework that wraps [[pytorch|PyTorch]] to abstract away the engineering boilerplate of training loops, distributed training, mixed precision, checkpointing, and logging, while keeping the research code (model definition, loss, optimization) explicit and modular. The framework is published under the Apache License 2.0 by Lightning AI, the company formerly known as Grid AI. Its tagline, "pretrain and finetune any AI model of any size on 1 or 10,000+ GPUs with zero code changes," captures the design philosophy: the user writes a LightningModule describing what the model does, and a Trainer decides how it runs across hardware.
The project was created by William Falcon in 2019 while he was a PhD student at NYU's CILVR Lab, advised by Kyunghyun Cho and Yann LeCun, and concurrently interning at Facebook AI Research. It quickly gained traction in academic labs, became part of the official PyTorch ecosystem in 2020, and now anchors a broader product family that includes Lightning Fabric, Lightning Studios, and the metrics library TorchMetrics. As of early 2026 the GitHub repository sits at roughly 31,000 stars, with monthly downloads in the tens of millions and a release cadence of roughly one minor version every two to three months.
Raw PyTorch gives the user a programming interface for tensors, autograd, and nn.Module, but it does not prescribe a training loop. Every project tends to reinvent the same scaffolding: a for epoch in ... outer loop, a manual optimizer.zero_grad() / loss.backward() / optimizer.step() sequence, calls to .to(device), manual gradient accumulation, mixed precision with torch.cuda.amp.autocast, distributed wrappers such as DistributedDataParallel, checkpoint saving, learning-rate scheduling, validation hooks, and integration with logging tools like TensorBoard or Weights & Biases. In a small research script this is annoying. In a production training run on 64 GPUs across 8 nodes with bf16 precision and a sharded optimizer, it is a serious source of bugs.
PyTorch Lightning splits that scaffolding from the model. The user defines a LightningModule that contains the model layers, the forward pass, and a few small methods that describe a single training step, validation step, test step, and optimizer configuration. Everything else, including device placement, distributed launch, mixed precision casting, gradient accumulation, checkpoint serialization, and metric logging, is the responsibility of the Trainer. Lightning's own marketing claims this can cut typical training-script length by around 70 percent, and most researchers who have used both interfaces agree the reduction is real, even if the exact ratio depends on how organized the original code was.
Falcon began experimenting with the abstractions that would become Lightning around 2018, and the first public release on GitHub appeared in 2019. The library spread among PhD students looking for a way to share reproducible training code, then was adopted as the recommended submission format for the NeurIPS 2019 Reproducibility Challenge. By 2020 Facebook AI had partnered with the Lightning team and the project was admitted to the official PyTorch ecosystem. The 1.0 release in October 2020 marked the first stable API and arrived alongside the founding of Grid AI, the commercial entity behind the framework.
| Year | Event |
|---|---|
| 2018 | Falcon begins prototyping Lightning at NYU CILVR Lab and Facebook AI Research. |
| 2019 | Initial public commits to GitHub; first PyPI release in mid-2019. |
| 2019 | Adopted by the NeurIPS 2019 Reproducibility Challenge as a recommended submission format. |
| 2020 | Joins the official PyTorch ecosystem; Facebook AI partners with the Lightning team. |
| 2020 | Version 1.0 released on October 13, 2020; Grid AI founded by William Falcon and Luis Capelo with $18.6M Series A led by Index Ventures. |
| 2022 | Grid AI rebrands to Lightning AI; closes a $40M Series B led by Coatue and Index Ventures. |
| 2023 | PyTorch Lightning 2.0 released on March 15, 2023, alongside Lightning Fabric, the lightweight opt-in alternative. |
| 2023 | Lightning AI Studios launched in December 2023 as the company's enterprise cloud platform. |
| 2024 | Lightning Studio reaches AWS Marketplace; company reports 240,000 users across 2,000 organizations and raises a further $50M from Cisco Investments, J.P. Morgan, K5 Global, and NVIDIA. |
| 2024 | Lightning AI joins the PyTorch Foundation as a Premier Member. |
| 2025 | Continued 2.x releases adding FP8 support, improved FSDP integration, and tighter integration with Lightning Studios. |
| 2026 | Version 2.6.1 released January 30, 2026; repository at roughly 31,000 GitHub stars. |
The rebrand from Grid AI to Lightning AI in June 2022 reflected a strategic shift. Grid had focused on a managed cloud product for distributed training; Lightning AI broadened the scope to a general AI development platform, with the open-source framework, Fabric, the deprecated Lightning Apps experiment, and Studios all sitting under one umbrella. The 2.0 release in March 2023 was the most significant inflection point on the open-source side. It cleaned up the Trainer's internal architecture, removed a long backlog of 1.x deprecations, declared the API stable, and introduced Lightning Fabric as a sibling library rather than a hidden internal layer.
PyTorch Lightning is built around a small set of abstractions that the user composes to describe a training run. The two most important are the LightningModule and the Trainer.
A LightningModule is a subclass of torch.nn.Module with extra hooks. It defines:
__init__ and forward.training_step(batch, batch_idx) method returning the loss for one batch.validation_step, test_step, and predict_step methods.configure_optimizers method that returns optimizers and learning-rate schedulers.Logging is done through self.log("name", value) inside any step. The log call is dispatched to whichever logger backend is configured on the Trainer, so the same code works whether you are sending metrics to TensorBoard, Weights & Biases, MLflow, Comet, or several at once. The module knows nothing about devices, ranks, or precision; those are decided by the Trainer.
The Trainer is the orchestration engine. It takes a LightningModule and runs training, validation, testing, or prediction loops on the requested hardware. A typical instantiation looks like Trainer(accelerator="gpu", devices=8, strategy="ddp", precision="bf16-mixed", max_epochs=50). The Trainer is responsible for backpropagation, optimizer stepping, gradient accumulation, gradient clipping, mixed precision casting, checkpointing, callback invocation, logging, and the launch of distributed processes.
A strategy controls how the model is distributed across workers. Lightning ships strategies for single-device training, single-node multi-GPU via DDP (Distributed Data Parallel), multi-node DDP, [[deepspeed|DeepSpeed]] in stages 1, 2, and 3, [[fsdp|FSDP]] (Fully Sharded Data Parallel, developed in collaboration with Meta), DDP Spawn, TPU strategies for Google Cloud TPU pods, and SLURM-aware launchers. The strategy interface handles process launch, NCCL or Gloo communication setup, parameter sharding, and gradient reduction. Switching from single-GPU to 64-GPU multi-node training is typically a matter of changing two arguments.
Callbacks are objects that hook into well-defined points of the training loop, such as on_train_start, on_train_batch_end, on_validation_epoch_end, and on_save_checkpoint. Built-in callbacks include ModelCheckpoint, EarlyStopping, LearningRateMonitor, GradientAccumulationScheduler, StochasticWeightAveraging, BatchSizeFinder, LearningRateFinder, and RichProgressBar. Users write their own callbacks for custom behavior such as exporting ONNX after training, sending Slack notifications on failure, or saving sample predictions to disk. Because callbacks are first-class, advanced training tricks tend to ship as small reusable callback packages.
Lightning supports more than ten logger backends out of the box: TensorBoard, [[wandb|Weights & Biases]], [[mlflow|MLflow]], Comet, Neptune, CSV files, and several more. Multiple loggers can be passed to a single Trainer and self.log(...) is broadcast to all of them. The logger interface is small, so community-maintained backends for systems like ClearML or AIM exist as third-party packages.
PyTorch Lightning supports FP32, FP16 mixed, BF16 mixed, FP8 (on supported hardware such as NVIDIA H100), and 8-bit and 4-bit quantized inference. Gradient clipping by norm or value is enabled with one Trainer flag, gradient accumulation is a single argument, and gradient checkpointing is exposed through PyTorch's standard mechanism with helpers for FSDP-style activation checkpointing. Reproducibility helpers such as pl.seed_everything(42) set Python, NumPy, and PyTorch seeds along with deterministic algorithm flags.
A LightningDataModule packages dataset download, preparation, and DataLoader construction into a single class that can be shared across projects. It defines prepare_data, setup, train_dataloader, val_dataloader, and test_dataloader. The pattern is optional; a Trainer can also accept raw DataLoaders.
The deep-learning training-framework space is crowded. Each option below makes different trade-offs along the spectrum from "rewrite the training loop yourself" to "call .fit() and trust the defaults."
| Framework | Organization | Base | First release | Focus | Philosophy |
|---|---|---|---|---|---|
| PyTorch Lightning | Lightning AI | [[pytorch | PyTorch]] | 2019 | General research and production training |
| Plain [[pytorch | PyTorch]] | Meta / PyTorch Foundation | C++/CUDA | 2016 | Research and production tensor library |
| Hugging Face Trainer | Hugging Face | [[pytorch | PyTorch]] (and JAX) | 2020 | Transformer fine-tuning, LLMs |
| FastAI | fast.ai | [[pytorch | PyTorch]] | 2018 | Education, rapid prototyping |
| [[keras | Keras]] | Google / Keras team | TensorFlow / [[jax | JAX]] / [[pytorch | PyTorch]] |
| [[tf_keras | tf.keras]] | TensorFlow | 2017 | TensorFlow-native Keras | |
| PyTorch Ignite | PyTorch Foundation | [[pytorch | PyTorch]] | 2018 | Research training loops |
| Composer | MosaicML / Databricks | [[pytorch | PyTorch]] | 2021 | Performance-tuned training algorithms |
| [[jax | JAX]] + Flax + Optax | XLA | 2020 | Functional research, TPU pods | |
| [[deepspeed | DeepSpeed]] | Microsoft | [[pytorch | PyTorch]] | 2020 |
| [[ray | Ray]] Train | Anyscale | Multiple | 2021 | Distributed training orchestration |
Against raw PyTorch, Lightning is a productivity and reliability win for almost any training run that exceeds a single GPU, at the cost of a learning curve and some abstraction overhead. Against FastAI, Lightning is less opinionated about data pipelines and architecture choices but exposes more of the underlying PyTorch primitives. Against the Hugging Face Trainer, Lightning is broader; the HF Trainer is excellent for fine-tuning models from the transformers library but less natural for non-Transformer architectures like graph networks, diffusion models, or reinforcement-learning agents. Against PyTorch Ignite, Lightning is heavier and more prescriptive; Ignite gives you an Engine and asks you to compose handlers yourself. Against Composer, the question is whether you want algorithmic optimizations baked in. Against [[jax|JAX]] plus Flax, the question is whether you want PyTorch's eager-by-default ergonomics or JAX's functional purity and TPU performance.
The feature surface is broad enough that most teams discover capabilities they did not know existed for months. Some of the most load-bearing:
srun or torchrun, done.precision="16-mixed", "bf16-mixed", or "fp8-mixed" on H100-class hardware.Trainer(...).fit(model, ckpt_path="path/to/ckpt") restores model weights, optimizer state, scheduler state, RNG state, and epoch counter.SLURM_PROCID, automatic re-submission on preemption, signal-based checkpointing.pl.seed_everything, deterministic flag, dataloader worker seeding.self.save_hyperparameters(), which serializes the constructor arguments into checkpoints.to_torchscript(), to_onnx().The open-source framework is one piece of a larger product family.
Lightning Fabric, introduced with the 2.0 release in March 2023, is a lightweight alternative to the full Trainer for users who want the distributed-training and precision plumbing but want to keep their own training loop. The pitch is that you can take a raw PyTorch script, replace a handful of lines (model wrapping, optimizer wrapping, the loss.backward() call) with Fabric equivalents, and gain multi-GPU, multi-node, FSDP, DeepSpeed, and SLURM support without giving up control over the loop. This matters for reinforcement learning, GAN training, and other settings where the loop logic is itself the research contribution. Fabric is opt-in; users do not have to migrate from the Trainer to use it.
Lightning Studios, announced in December 2023 and now sold through cloud marketplaces including AWS, is the company's hosted cloud workspace. A Studio is a cloud VM with persistent storage that runs in a browser-based VS Code or terminal session. Users can switch the underlying GPU type without losing state, share Studios as templates, and run multi-node training jobs from inside a single workspace. The platform integrates the open-source frameworks but is not required to use them.
Lightning Apps was a Python-based application framework for ML pipelines that the company invested in heavily during 2021 and 2022. The idea was to express ML workflows as graphs of components that could run locally or in the cloud. Adoption was modest, and after the 2.0 launch the company deemphasized Apps in favor of Studios and Fabric. The library still exists in maintenance mode but is no longer the centerpiece of the platform.
TorchMetrics started as the metrics module inside PyTorch Lightning and was spun out into a standalone library in 2021. It now contains more than 100 metric implementations across classification, regression, image, audio, and text domains, with correct behavior for distributed training (per-rank state, sync on epoch end). TorchMetrics is usable from raw PyTorch, Fabric, or the full Trainer.
Lightning Bolts was a library of pre-built model components (encoders, GANs, self-supervised baselines) and Lightning Flash was a higher-level task-oriented API. Both were popular in the early ecosystem but have been deprecated; the company now points users at Hugging Face, the main Lightning library, and Lightning Studio templates instead.
PyTorch Lightning has spread well beyond academic research. NVIDIA's NeMo framework for conversational AI and large language models was built on top of PyTorch Lightning and used the Trainer as its default training loop through NeMo 2.0. The Stable Diffusion training reference implementations released by Stability AI and CompVis used Lightning. Stanford's HyenaDNA project, OpenFold for protein structure prediction, and many MLPerf training submissions are built on Lightning. Hugging Face's TRL library for RLHF and DPO training has integrated Lightning for some recipes.
In industry, the framework is used at companies including NVIDIA, Meta, Microsoft, AWS, Stripe, Hugging Face, Stability AI, Cohere, and many others, often in combination with their internal experiment tracking and infrastructure. Kaggle solutions that involve PyTorch frequently use Lightning, especially for competitions where training stability and checkpoint management matter. The framework reports roughly 4 million monthly downloads on PyPI as of 2023, and Lightning AI claims more than 91 million cumulative downloads of the framework as of late 2024.
The most common reasons teams choose Lightning:
No abstraction is free, and Lightning has its share of detractors:
training_step that would be obvious in a flat script can be obscured by the Trainer's machinery.Trainer class accepts dozens of arguments and the LightningModule has many optional hooks. Reading the documentation for the first time is overwhelming.transformers Trainer, TRL, and Axolotl.In the NeMo example noted earlier, NVIDIA itself moved its newer Megatron-Bridge, AutoModel, and RL projects off the Lightning Trainer to a custom PyTorch loop in 2024 and 2025, citing flexibility and ease of use; the older NeMo 2.0 still uses Lightning. This is a useful illustration of when Lightning is and is not the right tool: for typical supervised training and for fine-tuning at moderate scale, Lightning is hard to beat. For frontier-scale custom training stacks, several teams have ended up writing their own loop on top of Fabric or even raw PyTorch.
As of early 2026, the GitHub repository at Lightning-AI/pytorch-lightning has roughly 31,000 stars and 3,700 forks. The project is licensed under Apache 2.0 and accepts contributions through a Contributor License Agreement. There are dozens of maintainers and hundreds of contributors. Releases follow a roughly two-month cadence; the most recent release at the time of writing is 2.6.1, dated January 30, 2026. The package is published on PyPI as both pytorch-lightning (the original name, still maintained) and lightning (the unified package introduced in 2022, which bundles PyTorch Lightning, Fabric, and supporting libraries).
The Lightning AI organization maintains a number of companion repositories beyond the core training framework:
| Project | Purpose | Status |
|---|---|---|
pytorch-lightning / lightning | Core framework | Actively developed |
lightning-fabric | Lightweight Trainer alternative | Actively developed (since 2.0, March 2023) |
torchmetrics | Distributed-aware metrics library | Actively developed |
lightning-bolts | Reusable research components | Deprecated |
lightning-flash | Task-oriented high-level API | Deprecated |
lightning-apps | Python ML pipeline framework | Deemphasized (2023+) |
litgpt | Hackable LLM training reference | Actively developed |
litserve | Lightning-native model serving | Actively developed |
lit-llama | Reference LLaMA training implementation | Maintenance |
Lightning AI also maintains the commercial Lightning Studios cloud product, which integrates with all of these libraries but is not required to use them.
The open-source framework is licensed under the Apache License 2.0, which permits commercial use, modification, and redistribution with attribution and patent grant. Lightning AI is the corporate sponsor and primary maintainer; the company is headquartered in New York City and has raised more than $90 million across its Seed, Series A, and Series B rounds plus the late-2024 strategic round from Cisco Investments, J.P. Morgan, K5 Global, and NVIDIA. Lightning AI joined the PyTorch Foundation as a Premier Member, alongside companies like AMD, AWS, Google, Hugging Face, Intel, Meta, Microsoft, and NVIDIA, which gives the project a formal governance role in the broader PyTorch ecosystem.