Lion (optimizer)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,411 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,411 words
Add missing citations, update stale details, or suggest a clearer explanation.
Lion (EvoLved Sign Momentum) is a stochastic optimization algorithm for training deep neural networks, introduced by researchers at Google in the 2023 paper "Symbolic Discovery of Optimization Algorithms" by Xiangning Chen, Chen Liang, Da Huang, Esteban Real, and colleagues including Quoc V. Le.[^1] Unlike adaptive optimizers such as Adam or AdamW that maintain both first and second moment estimates, Lion tracks only a single exponential moving average of the gradient and produces parameter updates via the sign of an interpolation between gradient and momentum, scaled by the learning rate. The result is an optimizer whose state memory is approximately half that of AdamW while delivering competitive or superior empirical performance across vision transformers, ResNets, language models, and diffusion models.[^2] Lion was discovered through an automated symbolic program search rather than being hand-designed, making it one of the most prominent examples of machine learning algorithms surfaced by program synthesis. The paper was published as a poster at NeurIPS 2023.[^3]
Lion stands for EvoLved Sign Momentum, a name that captures both its evolutionary origin and its computational signature. The optimizer was created by formulating algorithm discovery as a program search problem and applying evolutionary search over an infinite, sparse program space defined by primitive operators on parameter, gradient, and momentum tensors.[^1] Out of the many candidate programs surfaced and distilled by the search, Lion emerged as simple, memory-light, and consistently strong across the proxy tasks used during selection. The defining feature of its update is the elementwise sign function, which strips magnitude information from a momentum-weighted gradient and replaces it with a binary direction. This makes each parameter update uniform in magnitude, equal to the learning rate, and produces noticeably different optimization dynamics from those of Adam or AdamW.[^2][^4]
Compared to AdamW, Lion stores only one tensor of optimizer state per parameter (the momentum buffer) rather than two (first and second moments). For very large models this halves the additional memory required for optimizer state, which can be a substantial fraction of overall accelerator memory during pre-training. Lion has been deployed in production at Google, including in a search advertising click-through-rate (CTR) model.[^1] It has also seen widespread adoption in open source through community implementations in PyTorch, JAX, and quantized form via the bitsandbytes library.[^2][^5]
| Attribute | Value |
|---|---|
| Full name | EvoLved Sign Momentum (Lion) |
| Introduced | February 13, 2023 (arXiv v1)[^1] |
| Venue | NeurIPS 2023 (poster)[^3] |
| Authors | Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le[^1] |
| Affiliation | Google (including Google Brain/DeepMind), UCLA[^1] |
| Optimizer-state size | 1 tensor per parameter (momentum only) |
| Default beta1 | 0.9[^2] |
| Default beta2 | 0.99[^2] |
| Recommended learning rate | 3 to 10 times smaller than AdamW[^2] |
| Recommended weight decay | 3 to 10 times larger than AdamW[^2] |
| Memory savings vs AdamW | About half the optimizer state[^1] |
| Reference implementation | google/automl/lion (Apache 2.0)[^2] |
The Lion paper is part of a long-running thread of work on automated discovery of machine learning algorithms. AutoML research has produced learned architectures via Neural architecture search, learned loss functions, learned activations, and learned data augmentation policies. Optimization algorithms had been a particularly difficult target because they involve compositional control flow (running for many steps with internal state), discrete operator choices, and a delicate interaction with the underlying Deep Learning training dynamics that they are supposed to drive.[^4]
The Lion authors framed the discovery of an optimizer as a search over symbolic programs. Each candidate optimizer is expressed as a short program that, given a parameter tensor, its gradient, and any internal state tensors, returns an updated parameter tensor and an updated state. The available primitives include arithmetic operators, the sign and absolute-value functions, exponential moving averages, and scalar multiplications by hyperparameters such as the learning rate and weight decay. Programs are constrained in length and in the number of state tensors, which biases the search toward simple, memory-efficient candidates.[^1][^4]
To navigate the resulting program space, the team used regularized evolutionary search combined with warm-starting, restart strategies, and program selection by performance on small proxy tasks (chiefly image classification with Vision Transformer (ViT) on subsets of ImageNet). Promising programs were then progressively scaled up: candidates that survived small proxy training runs were re-evaluated on larger models, more data, longer schedules, and broader tasks. To reduce overfitting to the proxy and improve generalization to target tasks, the authors applied program simplification, removing redundant operations and constants, and a funnel-style selection pipeline.[^1][^4]
After many rounds of search and distillation, the surviving algorithm was simplified to four lines of code with two hyperparameters beyond the learning rate. The authors named it Lion. They then evaluated it extensively on tasks well outside the proxy distribution, including large Language Model pre-training, vision-language Contrastive Learning, and Diffusion model training, and reported competitive or improved performance with reduced compute in many cases.[^1][^4]
The first arXiv version of the paper was posted on February 13, 2023.[^1] An official implementation was released by Google in the google/automl repository, including a PyTorch reference (lion_pytorch.py) and a JAX/Flax variant.[^2] Within days, third-party implementations appeared. The most widely used is lion-pytorch maintained by Phil Wang (lucidrains), which closely mirrors the official update rule and adds an optional fused Triton kernel for GPU efficiency.[^5] The optimizer was added to the timm library used in Image Classification Models research, and to the bitsandbytes optimizer library distributed alongside HuggingFace PEFT, which also provides 8-bit and paged variants.[^6]
The paper was subsequently published as a poster at NeurIPS 2023, with proceedings available through the conference and via OpenReview.[^3] The official Google AutoML Lion repository was eventually archived in 2026, marking the codebase as read-only but leaving the reference implementations accessible.[^2]
Lion maintains a single momentum buffer m, initialized to zero, and is parameterized by a learning rate eta, two coefficients beta1 and beta2 in the interval (0, 1), and a decoupled weight decay coefficient lambda. At step t, given parameter p, gradient g, and previous momentum m, Lion performs the following operations:
c = beta1 * m + (1 - beta1) * g
update_direction = sign(c)p = p - eta * (sign(c) + lambda * p)
In code this is typically written as a multiplicative shrink p = p * (1 - eta * lambda) followed by p = p - eta * sign(c).[^2]beta2:
m = beta2 * m + (1 - beta2) * g[^2]The two distinct coefficients beta1 and beta2 are a notable feature: beta1 controls how aggressively the current gradient is mixed into the update direction, while beta2 controls how slowly the persistent momentum buffer absorbs new gradient information.[^1][^7] In Adam, by contrast, both moments use their own decays but only one of them (the first moment) participates in the direction of the update; the second moment provides per-coordinate scaling.
Because every coordinate of update_direction lies in {-1, 0, +1}, the parameter step has uniform magnitude eta across all coordinates of a given tensor (ignoring weight decay). This is fundamentally different from Adam-style updates, which produce updates whose magnitudes vary by coordinate due to the per-coordinate second-moment denominator.[^1][^4]
Lion's defaults across both the Google reference implementation and the bitsandbytes integration are lr = 1e-4, betas = (0.9, 0.99), and weight_decay = 0.[^2][^6] In contrast to AdamW, where weight decay values around 0.01 to 0.1 are typical, Lion users are advised to combine a smaller Learning Rate with a larger Weight Decay coefficient. The official guidance states that a suitable learning rate for Lion is typically 3 to 10 times smaller than that for AdamW, and that the corresponding lambda value used for Lion is 3 to 10 times larger than that for AdamW so that the effective decay strength (lr * lambda) is comparable.[^2]
The Lion authors provide concrete examples of these hyperparameter pairings:
| Setting | Lion hyperparameters | AdamW (or Adafactor) hyperparameters |
|---|---|---|
| ViT-B/16 on ImageNet | lr=1e-4, lambda=10.0 | lr=1e-3, lambda=1.0[^2] |
| Diffusion models | lr=3e-5, lambda=0.1 | lr=3e-4, lambda=0.01[^2] |
| 7.5B language model | lr=1e-4, lambda=0.01 | Adafactor lr=1e-3, lambda=0.001[^2] |
For very large or unstable settings, the paper recommends raising beta1 from 0.9 toward 0.95 and reducing beta2 from 0.99 toward 0.98 to soften the update direction and stabilize training.[^5]
The sign-of-momentum update has several intertwined consequences that distinguish Lion from Adam and from plain Stochastic Gradient Descent (SGD) with Momentum.
eta. This is similar to sign SGD, but with a momentum-smoothed direction rather than the raw gradient sign.[^1]lr with a larger lambda is necessary to keep the effective decay (lr * lambda) on the same scale as in AdamW.[^2]A compact PyTorch-style description of the Lion update is shown below.
# State per parameter: exp_avg (momentum buffer m), initialized to zero
def lion_step(p, grad, exp_avg, lr, beta1, beta2, weight_decay):
# Decoupled weight decay
p.mul_(1.0 - lr * weight_decay)
# Compute update direction from interpolated momentum and gradient
update = exp_avg.mul(beta1).add(grad, alpha=1.0 - beta1).sign_()
p.add_(update, alpha=-lr)
# Update momentum buffer
exp_avg.mul_(beta2).add_(grad, alpha=1.0 - beta2)
The first line implements decoupled weight decay as in AdamW. The second line computes the interpolated direction beta1 * m + (1 - beta1) * g and applies the sign function elementwise. The third line is the actual parameter update. The fourth line updates the persistent momentum buffer using beta2. This structure is consistent across the Google AutoML reference implementation, the lucidrains lion-pytorch package, and the bitsandbytes Lion variants.[^2][^5][^6]
For a model with N trainable parameters in fp32, AdamW maintains approximately 2N floats of optimizer state (first and second moments). Lion maintains approximately N floats. In mixed-precision training the precise overhead depends on master copies and optimizer-state dtype, but the qualitative halving holds in most configurations. For very large models, optimizer state can rival or exceed model parameter memory itself, so the savings translate into meaningful gains in trainable model size or batch size on a given accelerator.[^1][^4] The bitsandbytes Lion8bit and PagedLion8bit variants reduce this further by quantizing the momentum buffer to 8 bits, with block-wise quantization and percentile clipping to manage outliers.[^6]
The Lion paper reports experiments across multiple modalities, model scales, and tasks. The numbers cited below are drawn from the paper's abstract, blog summaries, and the official Google README; they are presented as reported by the authors and have been corroborated by subsequent third-party evaluations to varying degrees.
On supervised image classification with Vision Transformer (ViT) models trained on ImageNet, Lion improves top-1 accuracy by up to 2 percentage points compared with AdamW under matched compute budgets.[^1][^4] On pre-training with the proprietary JFT dataset, Lion is reported to save up to 5x in pre-training compute while reaching the same downstream accuracy.[^1][^4] Improvements over AdamW are also reported on ResNet image classification, though the gap is smaller than on transformers.[^1]
For vision-language Contrastive Learning in the LiT and BASIC-L training recipes, which resemble CLIP (Contrastive Language-Image Pre-training) style learning at scale, Lion reaches 88.3% zero-shot accuracy and 91.1% fine-tuned accuracy on ImageNet, surpassing the previous best reported by 2 and 0.1 percentage points respectively.[^1][^4]
For Diffusion model training, the paper reports that Lion achieves better Frechet Inception Distance (FID) than AdamW while reducing training compute by up to 2.3x.[^1][^4] These results contributed to community interest in using Lion for Stable Diffusion and related text-to-image fine-tuning workflows.[^5]
On autoregressive and masked language modeling tasks, including a 7.5-billion-parameter dense language model, Lion is reported to match or exceed AdamW (and Adafactor) on validation perplexity, with up to roughly 2x reductions in compute to reach a target perplexity in some configurations.[^1][^2] On language model fine-tuning, results are reported as comparable or slightly improved over AdamW.[^1]
Beyond accuracy and FID, the authors note that Lion delivers a 2 to 15 percent step-time speedup over AdamW and Adafactor at the same batch size and model size, because the sign-based update is simpler than the per-coordinate adaptive computation of Adam.[^2] Combined with the halved optimizer-state memory, this can either reduce wall-clock training time at fixed hardware or unlock larger effective batch sizes.[^1][^4]
The original Google reference implementation lives in the google/automl/lion directory and contains both a PyTorch implementation (lion_pytorch.py) and a JAX-based optimizer module compatible with Flax models.[^2] The code is released under the Apache 2.0 license. The official README documents recommended hyperparameter ranges, model-specific settings used in the paper, and known limitations. The repository was archived in May 2026 and remains accessible read-only.[^2]
The lion-pytorch package, maintained by Phil Wang (lucidrains), is the most widely used third-party PyTorch implementation. It mirrors the official update rule and exposes the same hyperparameters, with an optional use_triton=True flag that selects a fused Triton GPU kernel for higher throughput. The package is distributed on PyPI under MIT licensing and has accumulated thousands of GitHub stars since its release in early 2023.[^5]
The bitsandbytes library, often paired with HuggingFace PEFT for memory-efficient fine-tuning, includes Lion as a first-class optimizer. The classes bitsandbytes.optim.Lion, Lion8bit, Lion32bit, PagedLion, PagedLion8bit, and PagedLion32bit provide variants ranging from a 32-bit reference implementation to 8-bit quantized momentum buffers and paged variants that offload optimizer state to host memory. The default hyperparameters match the original paper (lr=1e-4, betas=(0.9, 0.99), weight_decay=0), and 8-bit variants apply block-wise quantization plus percentile clipping to manage outlier values in the momentum tensor.[^6]
Lion is implemented in timm/optim/lion.py within the timm (PyTorch image models) library, making it a built-in option for training Image Classification Models and ViTs in that ecosystem.[^9] The fastxtend project, which extends fastai with additional optimizers, also provides a Lion implementation along with documentation on hyperparameter migration from Adam.[^10]
Beyond the official JAX/Flax reference inside google/automl, community ports include a TensorFlow implementation distributed via the Lion-tensorflow GitHub repository, and various ports into other JAX optimizer libraries.[^11] These ports faithfully reproduce the published update rule and use the same default hyperparameters.
In the months following its release, Lion was integrated into multiple training stacks. Practitioners on the Transformers community forum reported using Lion for fine-tuning BERT-style and decoder-only language models, though early attempts surfaced API mismatches between Lion implementations and the HuggingFace Trainer; these were resolved via the bitsandbytes integration.[^6][^12]
Beyond research code, Lion saw adoption in stable diffusion fine-tuning communities, where memory savings translate directly into the ability to train at higher resolutions on consumer GPUs. The lucidrains lion-pytorch README enumerates community-reported successes on text-to-image fine-tuning, language model pre-training at moderate scale, and ViT classification, alongside negative results in Recurrent Neural Network-based, Feedforward Neural Network (FFN)-only, and certain hybrid architectures.[^5]
A 2025 study by Caglar et al. evaluated Lion against AdamW for fine-tuning cross-encoder rerankers built on MiniLM, GTE, and ModernBERT, and reported GPU utilization efficiency gains of 2.67% to 10.33% while maintaining competitive retrieval performance, illustrating Lion's continued relevance for production-leaning workloads.[^13] A 2025 budget LLM pre-training comparison of AdamW, Lion, and Sophia found Lion to be the fastest in training GPU-hours while AdamW delivered the best downstream evaluation scores, with Sophia exhibiting the lowest training and validation losses, suggesting that the right choice remains task and budget dependent.[^14]
Because Lion was discovered by search rather than derived from optimization theory, its convergence behavior was initially poorly characterized. Subsequent work has begun to close this gap.
beta1 and beta2 coefficients yields meaningfully different dynamics from prior sign-momentum methods.[^4][^7]These analyses do not yet provide complete tightness with empirical observations, particularly around the larger batch sizes that the authors recommend, but they have removed the original objection that Lion was a purely heuristic discovery with no convergence guarantees.[^7][^15]
The Lion paper and subsequent empirical reports document several caveats.
lion-pytorch discussion thread indicate that Lion can underperform on RL workloads, on purely feedforward networks, and on architectures that mix LSTM and convolution layers. Most of the strong reported results are on Transformer and Diffusion model architectures, which dominated the search proxy.[^5]| Optimizer | State per parameter | Update direction | Magnitude scaling | First introduced |
|---|---|---|---|---|
| SGD with momentum | 1 (momentum) | momentum buffer | learning rate (uniform) | classical |
| Sign SGD | 0 | sign of gradient | learning rate (uniform) | classical |
| RMSProp | 1 (second moment) | scaled gradient | per-coordinate (root mean square) | Hinton, 2012 |
| Adam / AdamW | 2 (m, v) | scaled first moment | per-coordinate (Adam denominator) | Kingma & Ba 2014; Loshchilov & Hutter 2017 |
| Lion | 1 (momentum) | sign of interpolated momentum | learning rate (uniform) | Chen et al. 2023[^1] |
Lion shares its uniform-magnitude update with sign-based methods such as Stochastic Gradient Descent (SGD) with sign rather than RMSProp or Adam. It differs from sign SGD by using a momentum-weighted interpolation rather than the raw gradient inside the sign, and by maintaining a separate momentum buffer updated with beta2. Compared to AdamW, Lion halves the optimizer state while reproducing AdamW's decoupled weight decay coupling.[^1][^2]
Second-order alternatives such as Sophia, which use diagonal Hessian estimates, target similar gains on large language models but with a different mechanism. In some benchmarks Sophia achieves 2x speedups over AdamW and Lion on LLM pre-training, while in others, particularly at small to medium scales, simpler sign-based momentum optimizers match or exceed Sophia, suggesting that the right answer depends on the regime.[^14]
Lion's significance lies along three axes. First, as an algorithm, it is a memory-efficient, easy-to-implement optimizer that improves training in a notable range of modern Deep Learning settings, particularly with Transformer and Diffusion model architectures. Its halved optimizer-state footprint is directly meaningful for large-scale pre-training, especially when combined with parameter-efficient fine-tuning frameworks such as LoRA (Low-Rank Adaptation) and QLoRA and the HuggingFace PEFT ecosystem.[^1][^6]
Second, as a demonstration of automated algorithm discovery, the Lion paper is one of the clearest existence proofs that program search can produce useful, deployed components of the deep learning stack. Earlier work demonstrated learned activations, learned architectures via Neural architecture search, and learned loss functions, but the discovery of a widely adopted optimizer pushes AutoML into a region of the design space (gradient-based training algorithms) that had been considered particularly mature and resistant to automation.[^1][^4]
Third, Lion has spurred a theoretical research thread around sign-based optimizers, constrained optimization perspectives via Lyapunov analysis, and convergence rates for adaptive sign methods. This has produced new vocabulary and tools that apply beyond Lion itself, including a clearer connection between sign-based updates and implicit constraints on parameter norms.[^7][^15]
lr, beta1, beta2, and weight_decay.