# Lion (optimizer)

> Source: https://aiwiki.ai/wiki/lion_optimizer
> Updated: 2026-06-25
> Categories: Algorithms, Google, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Lion** (EvoLved Sign Momentum) is a stochastic [optimizer](/wiki/optimizer) for training deep neural networks, introduced by researchers at Google in the February 2023 paper "Symbolic Discovery of Optimization Algorithms" by Xiangning Chen, Chen Liang, Da Huang, Esteban Real, and colleagues including Quoc V. Le.[^1] Unlike adaptive optimizers such as [Adam](/wiki/adam_optimizer) or AdamW that maintain both first and second moment estimates, Lion tracks only a single exponential moving average of the gradient (the momentum) and produces parameter updates from the sign of an interpolation between gradient and momentum, scaled by the learning rate. As the abstract states, Lion "is more memory-efficient than Adam as it only keeps track of the momentum", and "different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation."[^1] The result is an optimizer whose state memory is about half that of AdamW while delivering competitive or superior performance across vision transformers, ResNets, language models, and diffusion models: on ImageNet it boosts Vision Transformer top-1 accuracy by up to 2% and saves up to 5x the pre-training compute on JFT.[^1][^2] Lion was discovered by an automated symbolic program search rather than hand-designed, making it one of the most prominent machine learning algorithms surfaced by program synthesis. The paper was published as a poster at NeurIPS 2023.[^3]

## What is the Lion optimizer?

Lion stands for **EvoLved Sign Momentum**, a name that captures both its evolutionary origin and its computational signature. The optimizer was created by formulating algorithm discovery as a program search problem and applying evolutionary search over an infinite, sparse program space defined by primitive operators on parameter, gradient, and momentum tensors.[^1] Out of the many candidate programs surfaced and distilled by the search, Lion emerged as simple, memory-light, and consistently strong across the proxy tasks used during selection. The defining feature of its update is the elementwise sign function, which strips magnitude information from a momentum-weighted gradient and replaces it with a binary direction. This makes each parameter update uniform in magnitude, equal to the learning rate, and produces noticeably different optimization dynamics from those of [Adam](/wiki/adam_optimizer) or [AdamW](/wiki/adamw).[^2][^4]

Compared to AdamW, Lion stores only one tensor of optimizer state per parameter (the momentum buffer) rather than two (first and second moments). For very large models this halves the additional memory required for optimizer state, which can be a substantial fraction of overall accelerator memory during pre-training. Lion has been deployed in production at Google: the paper reports that "Lion is also successfully deployed in production systems such as Google's search ads CTR model."[^1] It has also seen widespread adoption in open source through community implementations in [PyTorch](/wiki/pytorch), [JAX](/wiki/jax), and quantized form via the bitsandbytes library.[^2][^5]

| Attribute | Value |
| --- | --- |
| Full name | EvoLved Sign Momentum (Lion) |
| Introduced | February 13, 2023 (arXiv v1); v4 dated May 8, 2023[^1] |
| Venue | NeurIPS 2023 (poster)[^3] |
| Authors | Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le[^1] |
| Affiliation | Google (including Google Brain/DeepMind), UCLA[^1] |
| Optimizer-state size | 1 tensor per parameter (momentum only), about half of AdamW[^1] |
| Default beta1 | 0.9[^2] |
| Default beta2 | 0.99[^2] |
| Default learning rate | 1e-4[^2] |
| Recommended learning rate | 3 to 10 times smaller than AdamW[^2] |
| Recommended weight decay | 3 to 10 times larger than AdamW[^2] |
| Step-time speedup vs AdamW/Adafactor | 2 to 15%[^2] |
| Reference implementation | google/automl/lion (Apache 2.0)[^2] |

## How was Lion discovered?

### Symbolic program search context

The Lion paper is part of a long-running thread of work on automated discovery of machine learning algorithms. [AutoML](/wiki/automl) research has produced learned architectures via [Neural architecture search](/wiki/neural_architecture_search), learned loss functions, learned activations, and learned data augmentation policies. Optimization algorithms had been a particularly difficult target because they involve compositional control flow (running for many steps with internal state), discrete operator choices, and a delicate interaction with the underlying [Deep Learning](/wiki/deep_learning) training dynamics that they are supposed to drive.[^4]

The Lion authors framed the discovery of an optimizer as a search over symbolic programs. Each candidate optimizer is expressed as a short program that, given a parameter tensor, its gradient, and any internal state tensors, returns an updated parameter tensor and an updated state. The available primitives include arithmetic operators, the sign and absolute-value functions, exponential moving averages, and scalar multiplications by hyperparameters such as the learning rate and weight decay. Programs are constrained in length and in the number of state tensors, which biases the search toward simple, memory-efficient candidates.[^1][^4]

### Discovery method

To navigate the resulting program space, the team used regularized evolutionary search combined with warm-starting, restart strategies, and program selection by performance on small proxy tasks (chiefly image classification with [Vision Transformer (ViT)](/wiki/vision_transformer_vit) on subsets of [ImageNet](/wiki/imagenet)). Each search experiment generated on the order of 200,000 to 300,000 candidate programs, of which roughly 20,000 to 30,000 were actually trained and evaluated.[^1] Promising programs were then progressively scaled up: candidates that survived small proxy training runs were re-evaluated on larger models, more data, longer schedules, and broader tasks. To reduce overfitting to the proxy and improve generalization to target tasks, the authors applied program simplification, removing redundant operations and constants, and a funnel-style selection pipeline.[^1][^4]

After many rounds of search and distillation, the surviving algorithm was simplified to a few lines of code with two hyperparameters beyond the learning rate. The authors named it Lion. They then evaluated it extensively on tasks well outside the proxy distribution, including large [Language Model](/wiki/language_model) pre-training, vision-language [Contrastive Learning](/wiki/contrastive_learning), and [Diffusion model](/wiki/diffusion_model) training, and reported competitive or improved performance with reduced compute in many cases.[^1][^4]

### When was Lion released and how widely was it adopted?

The first arXiv version of the paper was posted on February 13, 2023; the latest revision (v4) is dated May 8, 2023.[^1] An official implementation was released by Google in the `google/automl` repository, including a PyTorch reference (`lion_pytorch.py`) and a JAX/Flax variant.[^2] Within days, third-party implementations appeared. The most widely used is `lion-pytorch` maintained by Phil Wang (lucidrains), which closely mirrors the official update rule and adds an optional fused Triton kernel for GPU efficiency.[^5] The optimizer was added to the `timm` library used in [Image Classification Models](/wiki/image_classification_models) research, and to the bitsandbytes optimizer library distributed alongside [HuggingFace PEFT](/wiki/huggingface_peft), which also provides 8-bit and paged variants.[^6]

The paper was subsequently published as a poster at NeurIPS 2023, with proceedings available through the conference and via OpenReview.[^3] The official Google AutoML Lion repository was eventually archived in 2026, marking the codebase as read-only but leaving the reference implementations accessible.[^2]

## How does the Lion update rule work?

### The update rule

Lion maintains a single momentum buffer `m`, initialized to zero, and is parameterized by a learning rate `eta`, two coefficients `beta1` and `beta2` in the interval (0, 1), and a decoupled weight decay coefficient `lambda`. At step `t`, given parameter `p`, gradient `g`, and previous momentum `m`, Lion performs the following operations:

1. Compute the update direction by interpolating the momentum and the current gradient, then taking its elementwise sign:
   `c = beta1 * m + (1 - beta1) * g`
   `update_direction = sign(c)`
2. Apply decoupled weight decay and the parameter step:
   `p = p - eta * (sign(c) + lambda * p)`
   In code this is typically written as a multiplicative shrink `p = p * (1 - eta * lambda)` followed by `p = p - eta * sign(c)`.[^2]
3. Update the momentum buffer using a separate coefficient `beta2`:
   `m = beta2 * m + (1 - beta2) * g`[^2]

The two distinct coefficients `beta1` and `beta2` are a notable feature: `beta1` controls how aggressively the current gradient is mixed into the *update direction*, while `beta2` controls how slowly the persistent momentum buffer absorbs new gradient information.[^1][^7] In Adam, by contrast, both moments use their own decays but only one of them (the first moment) participates in the direction of the update; the second moment provides per-coordinate scaling.

Because every coordinate of `update_direction` lies in `{-1, 0, +1}`, the parameter step has uniform magnitude `eta` across all coordinates of a given tensor (ignoring weight decay). This is fundamentally different from Adam-style updates, which produce updates whose magnitudes vary by coordinate due to the per-coordinate second-moment denominator.[^1][^4]

### Pseudocode

A compact PyTorch-style description of the Lion update is shown below.

```python
# State per parameter: exp_avg (momentum buffer m), initialized to zero
def lion_step(p, grad, exp_avg, lr, beta1, beta2, weight_decay):
    # Decoupled weight decay
    p.mul_(1.0 - lr * weight_decay)
    # Compute update direction from interpolated momentum and gradient
    update = exp_avg.mul(beta1).add(grad, alpha=1.0 - beta1).sign_()
    p.add_(update, alpha=-lr)
    # Update momentum buffer
    exp_avg.mul_(beta2).add_(grad, alpha=1.0 - beta2)
```

The first line implements decoupled weight decay as in AdamW. The second line computes the interpolated direction `beta1 * m + (1 - beta1) * g` and applies the sign function elementwise. The third line is the actual parameter update. The fourth line updates the persistent momentum buffer using `beta2`. This structure is consistent across the Google AutoML reference implementation, the lucidrains `lion-pytorch` package, and the bitsandbytes Lion variants.[^2][^5][^6]

### Why does the sign update behave differently?

The sign-of-momentum update has several intertwined consequences that distinguish Lion from Adam and from plain [Stochastic Gradient Descent (SGD)](/wiki/stochastic_gradient_descent_sgd) with [Momentum](/wiki/momentum).

- **Uniform per-coordinate step size.** Lion gives every coordinate the same per-step displacement `eta`. This is similar to sign SGD, but with a momentum-smoothed direction rather than the raw gradient sign.[^1]
- **Implicit larger effective updates.** Because the sign operation discards magnitude information, the typical norm of Lion's update is larger than that of Adam's update on the same model. To compensate, the recommended Lion learning rate is roughly an order of magnitude smaller.[^2]
- **Decoupled weight decay couples differently.** Since the update direction is bounded but weight decay is proportional to the parameter, the regularization-to-step-size balance shifts. The authors observe that pairing a smaller `lr` with a larger `lambda` is necessary to keep the effective decay (`lr * lambda`) on the same scale as in AdamW.[^2]
- **Robustness to gradient outliers.** Sign-based methods are insensitive to a small number of very large gradient components, which can otherwise dominate adaptive updates. Empirically, this can help on tasks with heavy-tailed gradient distributions.[^4][^7]
- **Smoother minima.** Recent theoretical analyses argue that Lion tends to converge into smoother regions of the loss landscape, which has been associated with better [Regularization](/wiki/regularization) and generalization. These analyses also document conditions under which Lion provably converges, addressing a gap left open by the original discovery-driven paper.[^7][^8]

### How much memory does Lion save?

For a model with `N` trainable parameters in fp32, AdamW maintains approximately `2N` floats of optimizer state (first and second moments). Lion maintains approximately `N` floats. In mixed-precision training the precise overhead depends on master copies and optimizer-state dtype, but the qualitative halving holds in most configurations. For very large models, optimizer state can rival or exceed model parameter memory itself, so the savings translate into meaningful gains in trainable model size or batch size on a given accelerator.[^1][^4] The bitsandbytes Lion8bit and PagedLion8bit variants reduce this further by quantizing the momentum buffer to 8 bits, with block-wise quantization and percentile clipping to manage outliers.[^6]

## How do you tune Lion?

### Hyperparameters and defaults

Lion's defaults across both the Google reference implementation and the bitsandbytes integration are `lr = 1e-4`, `betas = (0.9, 0.99)`, and `weight_decay = 0`.[^2][^6] In contrast to AdamW, where weight decay values around 0.01 to 0.1 are typical, Lion users are advised to combine a smaller [Learning Rate](/wiki/learning_rate) with a larger [Weight Decay](/wiki/weight_decay) coefficient. The official guidance states that "a suitable learning rate for Lion is typically 3-10x smaller than that for AdamW", and that "the value of lambda used for Lion is 3-10x larger than that for AdamW in order to maintain a similar strength", so that the effective decay (`lr * lambda`) is comparable.[^2] The README adds that when changing the learning rate, "the initial value, peak value, and end value of the learning rate should be changed simultaneously with the same ratio."[^2]

The Lion authors provide concrete examples of these hyperparameter pairings:

| Setting | Lion hyperparameters | AdamW (or Adafactor) hyperparameters |
| --- | --- | --- |
| ViT-B/16 on ImageNet | lr=1e-4, lambda=10.0 | lr=1e-3, lambda=1.0[^2] |
| Diffusion models | lr=3e-5, lambda=0.1 | lr=3e-4, lambda=0.01[^2] |
| 7.5B language model | lr=1e-4, lambda=0.01 | Adafactor lr=1e-3, lambda=0.001[^2] |

For very large or unstable settings, the paper recommends raising `beta1` from 0.9 toward 0.95 and reducing `beta2` from 0.99 toward 0.98 to soften the update direction and stabilize training; as the README puts it, "reducing beta2 results in shorter memorization of historical information and enhanced training stability."[^2][^5]

## What are Lion's benchmark results?

The Lion paper reports experiments across multiple modalities, model scales, and tasks. The numbers cited below are drawn from the paper, blog summaries, and the official Google README; they are presented as reported by the authors and have been corroborated by subsequent third-party evaluations to varying degrees.

### Image classification

On supervised image classification with [Vision Transformer (ViT)](/wiki/vision_transformer_vit) models trained on [ImageNet](/wiki/imagenet), Lion improves top-1 accuracy by up to 2 percentage points compared with AdamW under matched compute budgets.[^1][^4] On pre-training with the proprietary JFT dataset, Lion is reported to save up to 5x in pre-training compute while reaching the same downstream accuracy.[^1][^4] After scaling up to JFT-3B, the ViT-G/14 model trained by Lion reaches 90.71% top-1 ImageNet accuracy, with 1.8x fewer parameters than the prior best ViT-G/14 result.[^1] Improvements over AdamW are also reported on [ResNet](/wiki/resnet) image classification, though the gap is smaller than on transformers.[^1]

### Vision-language contrastive learning

For vision-language [Contrastive Learning](/wiki/contrastive_learning) in the LiT and BASIC-L training recipes, which resemble [CLIP (Contrastive Language-Image Pre-training)](/wiki/clip) style learning at scale, replacing Adafactor with Lion delivers a 2.6% gain over the baseline to reach 88.3% zero-shot accuracy on ImageNet. After fine-tuning the CoAtNet-7 image encoder, Lion further reaches 91.1% top-1 ImageNet accuracy, which the paper notes is 0.1% better than the previous state of the art.[^1][^4]

### Diffusion models

For [Diffusion model](/wiki/diffusion_model) training, the paper reports that "Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x."[^1] These results contributed to community interest in using Lion for [Stable Diffusion](/wiki/stable_diffusion) and related text-to-image fine-tuning workflows.[^5]

### Language modeling

On autoregressive and masked language modeling, including a 7.5-billion-parameter dense language model, Lion is reported to match or exceed AdamW (and Adafactor) on validation perplexity.[^1][^2] On the Wiki-40B and PG-19 corpora, Lion achieves 1.6x and 1.5x speedups respectively when training a medium-size (336M) Transformer, and the PG-19 speedup rises to 2x at the large (731M) scale to reach the same validation perplexity.[^1] On language model fine-tuning, results are reported as comparable or slightly improved over AdamW.[^1]

### Runtime and memory

Beyond accuracy and FID, the authors note that Lion delivers a 2 to 15 percent step-time speedup over AdamW and Adafactor at the same batch size and model size, because the sign-based update is simpler than the per-coordinate adaptive computation of Adam.[^2] Combined with the halved optimizer-state memory, this can either reduce wall-clock training time at fixed hardware or unlock larger effective batch sizes.[^1][^4]

## Which implementations and variants exist?

### Official Google implementation

The original Google reference implementation lives in the `google/automl/lion` directory and contains both a PyTorch implementation (`lion_pytorch.py`) and a JAX-based optimizer module compatible with Flax models.[^2] The code is released under the Apache 2.0 license. The official README documents recommended hyperparameter ranges, model-specific settings used in the paper, and known limitations. The repository was archived in May 2026 and remains accessible read-only.[^2]

### lucidrains/lion-pytorch

The `lion-pytorch` package, maintained by Phil Wang (lucidrains), is the most widely used third-party PyTorch implementation. It mirrors the official update rule and exposes the same hyperparameters, with an optional `use_triton=True` flag that selects a fused Triton GPU kernel for higher throughput. The package is distributed on PyPI under MIT licensing and has accumulated thousands of GitHub stars since its release in early 2023.[^5]

### bitsandbytes Lion and Lion8bit

The bitsandbytes library, often paired with [HuggingFace PEFT](/wiki/huggingface_peft) for memory-efficient fine-tuning, includes Lion as a first-class optimizer. The classes `bitsandbytes.optim.Lion`, `Lion8bit`, `Lion32bit`, `PagedLion`, `PagedLion8bit`, and `PagedLion32bit` provide variants ranging from a 32-bit reference implementation to 8-bit quantized momentum buffers and paged variants that offload optimizer state to host memory. The default hyperparameters match the original paper (`lr=1e-4`, `betas=(0.9, 0.99)`, `weight_decay=0`), and 8-bit variants apply block-wise quantization plus percentile clipping to manage outlier values in the momentum tensor.[^6]

### timm and fastxtend integrations

Lion is implemented in `timm/optim/lion.py` within the timm (PyTorch image models) library, making it a built-in option for training [Image Classification Models](/wiki/image_classification_models) and ViTs in that ecosystem.[^9] The fastxtend project, which extends fastai with additional optimizers, also provides a Lion implementation along with documentation on hyperparameter migration from Adam.[^10]

### Community JAX and TensorFlow ports

Beyond the official JAX/Flax reference inside `google/automl`, community ports include a TensorFlow implementation distributed via the `Lion-tensorflow` GitHub repository, and various ports into other JAX optimizer libraries.[^11] These ports faithfully reproduce the published update rule and use the same default hyperparameters.

## How has Lion been adopted in practice?

In the months following its release, Lion was integrated into multiple training stacks. Practitioners on the [Transformers](/wiki/transformers) community forum reported using Lion for fine-tuning BERT-style and decoder-only language models, though early attempts surfaced API mismatches between Lion implementations and the HuggingFace Trainer; these were resolved via the bitsandbytes integration.[^6][^12]

Beyond research code, Lion saw adoption in stable diffusion fine-tuning communities, where memory savings translate directly into the ability to train at higher resolutions on consumer GPUs. The lucidrains `lion-pytorch` README enumerates community-reported successes on text-to-image fine-tuning, language model pre-training at moderate scale, and ViT classification, alongside negative results in [Recurrent Neural Network](/wiki/recurrent_neural_network)-based, [Feedforward Neural Network (FFN)](/wiki/feedforward_neural_network_ffn)-only, and certain hybrid architectures.[^5]

A 2025 study by Caglar et al. evaluated Lion against AdamW for fine-tuning cross-encoder rerankers built on MiniLM, GTE, and ModernBERT, and reported GPU utilization efficiency gains of 2.67% to 10.33% while maintaining competitive retrieval performance, illustrating Lion's continued relevance for production-leaning workloads.[^13] A 2025 budget LLM pre-training comparison of AdamW, Lion, and Sophia found Lion to be the fastest in training GPU-hours while AdamW delivered the best downstream evaluation scores, with Sophia exhibiting the lowest training and validation losses, suggesting that the right choice remains task and budget dependent.[^14]

## What does theory say about why Lion works?

Because Lion was discovered by search rather than derived from optimization theory, its convergence behavior was initially poorly characterized. Subsequent work has begun to close this gap.

- **Lyapunov-function analysis.** Chen et al. (2024) showed that Lion can be interpreted as solving a constrained optimization problem with an associated Lyapunov function, providing a principled motivation for the sign operator and weight decay coupling. This paper, published at ICLR 2024, argues that "Lion secretly solves constrained optimization," with the constraint being the L-infinity norm of parameters bounded by a constant determined by the weight decay.[^15]
- **Convergence rate analysis.** Dong, Li, and Lin (2024) provided convergence rate analyses of Lion in both constrained and unconstrained settings, identifying conditions on the learning rate, momentum coefficients, and gradient noise under which Lion converges to a stationary point of a smooth nonconvex objective.[^7]
- **Centralized and distributed convergence.** A 2025 paper (arXiv 2508.12327) extended convergence analyses to distributed settings, showing that Lion retains its convergence properties under standard assumptions on bounded gradients and stochastic noise, and characterized its behavior across worker counts.[^8]
- **Comparisons with sign SGD.** Several follow-up works connect Lion to sign-SGD with momentum and other sign-based methods, observing that the elementwise sign operator is a unifying primitive but that Lion's choice of independent `beta1` and `beta2` coefficients yields meaningfully different dynamics from prior sign-momentum methods.[^4][^7]

These analyses do not yet provide complete tightness with empirical observations, particularly around the larger batch sizes that the authors recommend, but they have removed the original objection that Lion was a purely heuristic discovery with no convergence guarantees.[^7][^15]

## What are Lion's limitations?

The Lion paper and subsequent empirical reports document several caveats.

- **Batch size sensitivity.** Lion is reported to perform best at larger [Batch Size](/wiki/batch_size). The official README states that Lion performs robustly from batch size 64 to 32K but generally prefers larger batches, with the optimal batch size for Lion in the ImageNet ViT experiments being 4096 versus 256 for AdamW. At a 32K batch size, Lion achieved a 2.5% accuracy gain over AdamW (77.9% versus 75.4%).[^2] On small-batch regimes, Lion can underperform AdamW.[^5]
- **Hyperparameter retuning is mandatory.** Reusing AdamW defaults with Lion (without scaling down the learning rate and scaling up the weight decay) yields poor results. Practitioners migrating models to Lion must explicitly retune these knobs, often with a 3 to 10 times learning-rate reduction and a proportional weight-decay increase.[^2][^5]
- **Data and augmentation sensitivity.** Lion's effectiveness varies with the amount of training data and the strength of [Regularization](/wiki/regularization) applied through augmentation. The paper notes that improvements are small or not statistically significant in certain settings, particularly with weak augmentation or strong overfitting regimes.[^1][^5]
- **Architecture dependence.** Community reports collected on the lucidrains `lion-pytorch` discussion thread indicate that Lion can underperform on RL workloads, on purely feedforward networks, and on architectures that mix LSTM and convolution layers. Most of the strong reported results are on [Transformer](/wiki/transformer) and [Diffusion model](/wiki/diffusion_model) architectures, which dominated the search proxy.[^5]
- **Limited theoretical foundation.** Although recent work has provided Lyapunov and convergence rate analyses, Lion's original derivation was empirical. This creates challenges for adoption in safety-critical or regulated domains where provable guarantees are required.[^7][^8][^15]
- **Mixed reproducibility on LLMs at scale.** A 2025 budget LLM pre-training comparison found that AdamW remained the strongest by downstream evaluation despite Lion being fastest in wall-clock time, suggesting that some of the original speedup claims on large language models do not translate cleanly to all setups and that careful protocol matching is essential.[^14]

## How does Lion differ from Adam and other optimizers?

| Optimizer | State per parameter | Update direction | Magnitude scaling | First introduced |
| --- | --- | --- | --- | --- |
| SGD with momentum | 1 (momentum) | momentum buffer | learning rate (uniform) | classical |
| Sign SGD | 0 | sign of gradient | learning rate (uniform) | classical |
| RMSProp | 1 (second moment) | scaled gradient | per-coordinate (root mean square) | Hinton, 2012 |
| Adam / AdamW | 2 (m, v) | scaled first moment | per-coordinate (Adam denominator) | Kingma & Ba 2014; Loshchilov & Hutter 2017 |
| Lion | 1 (momentum) | sign of interpolated momentum | learning rate (uniform) | Chen et al. 2023[^1] |

Lion shares its uniform-magnitude update with sign-based methods such as [Stochastic Gradient Descent (SGD)](/wiki/stochastic_gradient_descent_sgd) with sign rather than [RMSProp](/wiki/rmsprop) or [Adam](/wiki/adam_optimizer). It differs from sign SGD by using a momentum-weighted interpolation rather than the raw gradient inside the sign, and by maintaining a separate momentum buffer updated with `beta2`. Compared to [AdamW](/wiki/adamw), Lion halves the optimizer state while reproducing AdamW's decoupled weight decay coupling.[^1][^2]

Second-order alternatives such as Sophia, which use diagonal Hessian estimates, target similar gains on large language models but with a different mechanism. In some benchmarks Sophia achieves 2x speedups over AdamW and Lion on LLM pre-training, while in others, particularly at small to medium scales, simpler sign-based momentum optimizers match or exceed Sophia, suggesting that the right answer depends on the regime.[^14]

## Why is Lion significant?

Lion's significance lies along three axes. First, as an algorithm, it is a memory-efficient, easy-to-implement optimizer that improves training in a notable range of modern [Deep Learning](/wiki/deep_learning) settings, particularly with [Transformer](/wiki/transformer) and [Diffusion model](/wiki/diffusion_model) architectures. Its halved optimizer-state footprint is directly meaningful for large-scale pre-training, especially when combined with parameter-efficient fine-tuning frameworks such as [LoRA (Low-Rank Adaptation)](/wiki/lora) and [QLoRA](/wiki/qlora) and the [HuggingFace PEFT](/wiki/huggingface_peft) ecosystem.[^1][^6]

Second, as a demonstration of automated algorithm discovery, the Lion paper is one of the clearest existence proofs that program search can produce useful, deployed components of the deep learning stack. Earlier work demonstrated learned activations, learned architectures via [Neural architecture search](/wiki/neural_architecture_search), and learned loss functions, but the discovery of a widely adopted optimizer pushes [AutoML](/wiki/automl) into a region of the design space (gradient-based training algorithms) that had been considered particularly mature and resistant to automation.[^1][^4]

Third, Lion has spurred a theoretical research thread around sign-based optimizers, constrained optimization perspectives via Lyapunov analysis, and convergence rates for adaptive sign methods. This has produced new vocabulary and tools that apply beyond Lion itself, including a clearer connection between sign-based updates and implicit constraints on parameter norms.[^7][^15]

## ELI5: what is Lion in simple terms?

Imagine you are walking downhill in fog and want to reach the lowest point. [Gradient descent](/wiki/gradient_descent) reads the slope under your feet and steps in the steepest downhill direction. Adam is a careful hiker who remembers both which way it has been going (momentum) and how bumpy each direction has been, then takes bigger steps in calm directions and smaller ones in jittery directions. That bookkeeping takes two notebooks. Lion keeps only one notebook (the running direction it has been heading) and, instead of measuring how steep each direction is, it just decides plus or minus for every direction and takes one fixed-size step there. Carrying one notebook instead of two is why Lion uses about half the extra memory, and the fixed-size step is why you have to use a smaller learning rate so the steps do not get too big.[^1][^2]

## Related work

- [Adam optimizer](/wiki/adam_optimizer) - the dominant adaptive optimizer prior to Lion's release.
- [AdamW](/wiki/adamw) - decoupled weight decay variant of Adam, the standard baseline against which Lion is most often compared.
- [Stochastic Gradient Descent (SGD)](/wiki/stochastic_gradient_descent_sgd) - the foundational first-order optimizer.
- [RMSProp](/wiki/rmsprop) - earlier adaptive optimizer based on second-moment estimates.
- [Gradient Descent](/wiki/gradient_descent) - the broader algorithmic family Lion belongs to.
- [Momentum](/wiki/momentum) - the conceptual mechanism Lion modifies via the sign operator.
- [Weight Decay](/wiki/weight_decay) - regularization technique whose coupling with Lion's update differs from AdamW.
- [Learning Rate](/wiki/learning_rate) - the hyperparameter most sensitive to Lion's sign-based magnitude scaling.
- [Batch Size](/wiki/batch_size) - a hyperparameter Lion is particularly sensitive to.
- [Hyperparameter](/wiki/hyperparameter) - general concept; Lion introduces only `lr`, `beta1`, `beta2`, and `weight_decay`.
- [AutoML (Automated Machine Learning)](/wiki/automl) - the broader area Lion advances.
- [Neural architecture search](/wiki/neural_architecture_search) - the closest sibling area of automated discovery.
- [Google Brain](/wiki/google_brain) - the laboratory (alongside DeepMind and academic collaborators) that produced Lion.
- [NeurIPS](/wiki/neurips) - venue at which the Lion paper was published.

## See also

- [AdamW](/wiki/adamw)
- [Adam optimizer](/wiki/adam_optimizer)
- [RMSProp](/wiki/rmsprop)
- [Stochastic Gradient Descent (SGD)](/wiki/stochastic_gradient_descent_sgd)
- [Momentum](/wiki/momentum)
- [Gradient Descent](/wiki/gradient_descent)
- [Weight Decay](/wiki/weight_decay)
- [Learning Rate](/wiki/learning_rate)
- [Batch Size](/wiki/batch_size)
- [AutoML (Automated Machine Learning)](/wiki/automl)
- [Neural architecture search](/wiki/neural_architecture_search)
- [Vision Transformer (ViT)](/wiki/vision_transformer_vit)
- [Diffusion model](/wiki/diffusion_model)
- [Stable Diffusion](/wiki/stable_diffusion)
- [LoRA (Low-Rank Adaptation)](/wiki/lora)
- [QLoRA](/wiki/qlora)
- [HuggingFace PEFT](/wiki/huggingface_peft)
- [PyTorch](/wiki/pytorch)
- [JAX](/wiki/jax)
- [Google Brain](/wiki/google_brain)
- [NeurIPS](/wiki/neurips)
- [International Conference on Learning Representations](/wiki/iclr)

## References

[^1]: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le, "Symbolic Discovery of Optimization Algorithms", arXiv preprint 2302.06675, 2023-02-13 (v1; v4 dated 2023-05-08). https://arxiv.org/abs/2302.06675. Accessed 2026-06-25.
[^2]: Google AutoML team, "Lion: EvoLved Sign Momentum (official implementation README)", GitHub google/automl repository, 2023. https://github.com/google/automl/blob/master/lion/README.md. Accessed 2026-06-25.
[^3]: NeurIPS 2023 Program Committee, "Symbolic Discovery of Optimization Algorithms (NeurIPS 2023 poster)", OpenReview, 2023-09-21. https://openreview.net/forum?id=ne6zeqLFCZ. Accessed 2026-06-25.
[^4]: Emergent Mind editors, "Symbolic Discovery of Optimization Algorithms (paper summary)", Emergent Mind, 2023. https://www.emergentmind.com/papers/2302.06675. Accessed 2026-06-25.
[^5]: Phil Wang (lucidrains), "lion-pytorch: implementation of Lion (EvoLved Sign Momentum) in PyTorch", GitHub lucidrains/lion-pytorch repository, 2023. https://github.com/lucidrains/lion-pytorch. Accessed 2026-06-25.
[^6]: HuggingFace and bitsandbytes contributors, "Lion optimizer reference (bitsandbytes.optim.Lion, Lion8bit, PagedLion)", HuggingFace bitsandbytes documentation, 2024. https://huggingface.co/docs/bitsandbytes/en/reference/optim/lion. Accessed 2026-06-25.
[^7]: Yiming Dong, Huan Li, Zhouchen Lin, "Convergence Rate Analysis of LION", arXiv preprint 2411.07724, 2024-11-12. https://arxiv.org/abs/2411.07724. Accessed 2026-06-25.
[^8]: Authors of "Convergence Analysis of the Lion Optimizer in Centralized and Distributed Settings", arXiv preprint 2508.12327, 2025. https://arxiv.org/abs/2508.12327. Accessed 2026-06-25.
[^9]: Ross Wightman and timm contributors, "pytorch-image-models: timm/optim/lion.py", GitHub huggingface/pytorch-image-models repository, 2023. https://github.com/huggingface/pytorch-image-models/blob/main/timm/optim/lion.py. Accessed 2026-06-25.
[^10]: Benjamin Warner, "Lion: EvoLved Sign Momentum Optimizer (fastxtend documentation)", fastxtend, 2023. https://fastxtend.benjaminwarner.dev/optimizer.lion.html. Accessed 2026-06-25.
[^11]: G. Lambard, "Lion-tensorflow: implementation of the Lion optimizer in TensorFlow", GitHub GLambard/Lion-tensorflow repository, 2023. https://github.com/GLambard/Lion-tensorflow. Accessed 2026-06-25.
[^12]: HuggingFace community, "How to use LION optimizer? (Transformers forum thread)", HuggingFace Discuss, 2023. https://discuss.huggingface.co/t/how-to-using-lion-optimizer/42270. Accessed 2026-06-25.
[^13]: Authors of "Comparative Analysis of Lion and AdamW Optimizers for Cross-Encoder Reranking with MiniLM, GTE, and ModernBERT", arXiv preprint 2506.18297, 2025. https://arxiv.org/abs/2506.18297. Accessed 2026-06-25.
[^14]: Authors of "Pre-Training LLMs on a budget: A comparison of three optimizers", arXiv preprint 2507.08472, 2025. https://arxiv.org/abs/2507.08472. Accessed 2026-06-25.
[^15]: Lizhang Chen, Bo Liu, Kaizhao Liang, Qiang Liu, "Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts", ICLR 2024 conference paper, 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/986e0caad271b59417287737416d8594-Paper-Conference.pdf. Accessed 2026-06-25.

