# Mini-batch stochastic gradient descent

> Source: https://aiwiki.ai/wiki/mini-batch_stochastic_gradient_descent
> Updated: 2026-06-24
> Categories: Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Mini-batch stochastic gradient descent** (often shortened to **mini-batch SGD** or **MB-SGD**) is the optimization algorithm used to train almost every modern neural network: it updates a model's parameters by estimating the gradient of a [loss function](/wiki/loss_function) on a small random subset of the training data, called a [mini-batch](/wiki/mini-batch), and then taking a step opposite to that gradient. Practical batch sizes range from 1 to several million examples, with 32 a long-standing default for small models and millions of tokens common for [LLM](/wiki/llm) pretraining [4][9][10]. The mini-batch gradient is an unbiased estimate of the true gradient whose variance scales as 1/B, where B is the batch size, so the method trades a little accuracy per step for a large gain in steps per second on parallel hardware.

The method sits between two extremes. Full-batch [gradient descent](/wiki/gradient_descent) computes an exact gradient over the entire dataset before each step, which is expensive and requires the whole dataset to fit in memory. Pure [SGD](/wiki/stochastic_gradient_descent), in the strict sense of using a single example per step, gives very noisy updates that bounce around the loss surface. Mini-batch SGD picks a batch size B somewhere between 1 and the dataset size N, averaging gradients over B examples per step. This middle ground is what makes the method practical: it produces gradient estimates with manageable variance, makes good use of vectorized hardware like GPUs and TPUs, and converges much faster in wall-clock time than either alternative. Almost every neural network trained today, from a simple convolutional classifier to a frontier [LLM](/wiki/llm) with hundreds of billions of parameters, is fit with some flavor of mini-batch SGD or one of its adaptive variants such as [Adam](/wiki/adam_optimizer) or [AdamW](/wiki/adamw).

## When was stochastic gradient descent invented?

The statistical foundations of stochastic optimization predate machine learning by decades. In 1951 Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method" in the *Annals of Mathematical Statistics*, introducing what is now called the Robbins-Monro algorithm for finding the root of a function known only through noisy measurements [1]. Their convergence conditions, that the step sizes must satisfy the sum of step sizes diverging while the sum of squared step sizes remains finite, are still cited today as classical sufficient conditions for SGD to converge.

The ideas filtered into pattern recognition through the perceptron rule (Rosenblatt 1958) and the LMS algorithm (Widrow and Hoff 1960), both of which are early examples of stochastic gradient methods. The connection to neural network training was made explicit once [backpropagation](/wiki/backpropagation) was popularized in the 1980s. The mini-batch variant became the standard recipe in deep learning during the 2000s and 2010s, when GPUs made it efficient to compute gradients on dozens or hundreds of examples in parallel using matrix-matrix multiplications instead of slower matrix-vector operations.

## How does the mini-batch SGD algorithm work?

Given a model with parameters θ, a per-example loss function ℓ, and a training set of N examples, mini-batch SGD repeats the following loop:

1. Shuffle the training data at the start of each [epoch](/wiki/epoch).
2. Partition it into mini-batches of size B.
3. For each mini-batch (x, y):
   a. Compute the average gradient g = (1/B) Σᵢ ∇θ ℓ(fθ(xᵢ), yᵢ) using backpropagation.
   b. Update the parameters: θ ← θ − η g, where η is the [learning rate](/wiki/learning_rate).
4. Stop after a fixed number of epochs, when the validation loss stops improving, or when some other criterion is met.

In each epoch the algorithm processes every training example exactly once, distributed across N/B mini-batch updates. A typical training run lasts anywhere from a single epoch (common for very large language model pretraining) to hundreds of epochs (common for vision tasks).

The gradient computed on a mini-batch is an unbiased estimate of the true gradient over the data distribution, with variance that scales as 1/B. Doubling the [batch size](/wiki/batch_size) halves the variance of the gradient estimate but also doubles the compute per step, so there is a tradeoff between the quality of each step and the number of steps you can afford.

## How does it differ from batch and pure SGD?

Algorithms in the gradient descent family are usually grouped by how much data they touch per update.

| regime | batch size | gradient quality | steps per epoch | typical use |
|---|---|---|---|---|
| Full-batch gradient descent | B = N | exact | 1 | small problems, convex optimization, theoretical analysis |
| Mini-batch SGD | 1 < B << N | unbiased estimate, moderate noise | N / B | the standard for deep learning |
| Stochastic gradient descent (strict sense) | B = 1 | unbiased but very noisy | N | online learning, streaming data |

In practice the term "SGD" is used loosely. When a deep learning paper says it trains a model "with SGD," it almost always means mini-batch SGD with some batch size B between 32 and several million.

## Why are mini-batches used?

Three reasons explain why the mini-batch regime dominates.

First, hardware. GPUs and TPUs are designed for dense linear algebra. A forward and backward pass over a batch of 256 images is not 256 times slower than a single image; it is often only 5 to 10 times slower, because the matrix multiplications inside the network keep the accelerator's compute units busy. Larger batches amortize the fixed overhead of kernel launches, memory transfers, and pipeline bubbles. Yoshua Bengio's widely cited 2012 guide made the point directly, recommending that "B = 32 is a good default value, with values above 10 taking advantage of the speedup of matrix-matrix products over matrix-vector products" [4].

Second, variance reduction. The variance of the mini-batch gradient is the per-example gradient variance divided by B. Smaller batches give noisier updates, which can help the optimizer escape saddle points and shallow minima but make convergence less stable. Larger batches give cleaner updates but, beyond a certain point, the extra noise reduction stops helping.

Third, generalization. There is a long-running observation, formalized by Nitish Keskar and colleagues in 2017, that small-batch training tends to find flatter minima of the loss surface that generalize better to held-out data. Their paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" gave numerical evidence that large batches converge to sharp minima, while small batches converge to flat ones; the authors attribute the gap to the fact that "large-batch methods tend to converge to sharp minimizers of the training and testing functions," whereas the noisier small-batch gradients push the iterates toward flatter regions [8]. The picture is not the whole story (later work showed the gap can often be closed with the right learning rate schedule), but the implicit regularization effect of mini-batch noise is real and is part of why neural networks generalize as well as they do.

## What variants and improvements exist?

The basic update rule θ ← θ − ηg has been extended in many ways. The most influential variants are summarized below.

| optimizer | year | author | core idea | typical use |
|---|---|---|---|---|
| Vanilla SGD | classical | Robbins & Monro (1951) | θ ← θ − ηg | baseline; image classification with momentum |
| Heavy ball [momentum](/wiki/momentum) | 1964 | Polyak | v ← μv + g; θ ← θ − ηv | computer vision, ResNets |
| Nesterov accelerated gradient | 1983 | Nesterov | look-ahead momentum with provably better convex rate | convex problems, some CV models |
| [AdaGrad](/wiki/adagrad) | 2011 | Duchi, Hazan, Singer | per-parameter learning rate scaled by 1/√(Σ g²) | sparse features, NLP |
| [RMSProp](/wiki/rmsprop) | 2012 | Hinton (Coursera lecture) | exponentially decaying average of g² | RNNs, early deep learning |
| Adam | 2015 | Kingma & Ba | combines momentum and RMSProp with bias correction | the de facto default for most tasks |
| AdamW | 2019 | Loshchilov & Hutter | Adam with decoupled weight decay | LLM and large-model training |
| Adafactor | 2018 | Shazeer & Stern | factorizes Adam's second moment to save memory | T5, PaLM, very large models |
| LARS | 2017 | You, Gitman, Ginsburg | layer-wise learning rate for large-batch CNN training | ResNet at large batch |
| LAMB | 2019 | You et al. | layer-wise variant of Adam for large batches | BERT pretraining in 76 minutes |
| Lion | 2023 | Chen et al. | sign-of-momentum updates discovered by symbolic search | competitive with AdamW, less memory |

Momentum, introduced by Boris Polyak in his 1964 paper "Some methods of speeding up the convergence of iteration methods," maintains a velocity vector v that accumulates past gradients with decay coefficient μ (typically 0.9) [2]. The update becomes vₜ = μ vₜ₋₁ + gₜ and θₜ = θₜ₋₁ − η vₜ. This damps oscillation across narrow valleys and accelerates progress along consistent gradient directions.

Adam, proposed by Diederik Kingma and Jimmy Ba at ICLR 2015, keeps an exponential moving average of both the gradient (first moment, like momentum) and the squared gradient (second moment, like RMSProp), then divides one by the square root of the other to get a per-parameter adaptive step size [6]. It is the most widely used optimizer in deep learning practice and is the dominant optimizer for training large language models such as GPT, OPT, and Llama. Its successor AdamW, from a 2019 ICLR paper by Ilya Loshchilov and Frank Hutter, fixes a subtle bug: in standard Adam, applying L2 [regularization](/wiki/regularization) by adding λθ to the gradient does not behave like true weight decay because the adaptive denominator scales the regularization term too [16]. AdamW decouples weight decay from the gradient update, applying θ ← (1 − ηλ) θ directly. The change is small in code but materially improves generalization, which is why AdamW has become the default for large-language-model pretraining.

Lion, introduced by Xiangning Chen and colleagues at Google Brain in their 2023 paper "Symbolic Discovery of Optimization Algorithms," was discovered by an evolutionary program search rather than designed by hand [18]. Its update uses only the sign of a momentum-smoothed gradient, which keeps memory usage low (no second moment to store) and gives every parameter the same update magnitude. Reported gains include training compute reductions of up to 2.3x on diffusion models and competitive results on language models, though it requires roughly an order of magnitude smaller learning rate than AdamW.

## What learning rate schedule should you use?

The learning rate η is the single most important hyperparameter in mini-batch SGD. Most modern training runs vary it over time according to a schedule.

| schedule | shape | typical use |
|---|---|---|
| Constant | flat | small experiments, debugging |
| Step decay | drop by factor (e.g. 10x) at fixed epochs | classical CNN training |
| Exponential decay | ηₜ = η₀ · γᵗ | older recipes |
| Cosine annealing | half-cosine from η₀ to η_min | modern CV, LLM pretraining |
| Linear warmup + cosine | ramp up over first k steps, then cosine decay | the standard LLM recipe |
| One-cycle | warmup, plateau near peak, then anneal below η_min | super-convergence (Smith 2018) |
| Inverse square root | ηₜ = η₀ / √t | original Transformer paper |

Cosine annealing comes from "SGDR: Stochastic Gradient Descent with Warm Restarts" by Loshchilov and Hutter (ICLR 2017) [7]. The schedule decreases the learning rate from η_max to η_min following the curve ηₜ = η_min + 0.5 (η_max − η_min) (1 + cos(π T_cur / T_i)), with optional warm restarts that snap the rate back to its peak value. Combined with a short linear warmup, this is the schedule used by GPT-3 and most subsequent large-scale language models [17].

Linear warmup is important when training starts from a random initialization. A high learning rate applied to noisy early gradients can blow up the optimization. Warming up over a few hundred to a few thousand steps lets the gradient statistics stabilize before the optimizer takes large steps.

## What is the linear scaling rule?

Learning rate and batch size are coupled. If you change one, you usually need to change the other.

The most-cited rule of thumb is the **linear scaling rule** from "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" by Priya Goyal and colleagues at Facebook AI Research (2017), stated plainly as: "When the minibatch size is multiplied by k, multiply the learning rate by k" [9]. The intuition is that a k-times-larger batch produces a gradient with roughly the same direction but lower variance, so taking a k-times-larger step is safe and keeps the total per-epoch progress comparable. Combined with a gradual warmup over the first few epochs, this rule allowed the team to train ResNet-50 on ImageNet to 76.3% top-1 accuracy in one hour using a batch of 8,192 images on 256 GPUs, with no loss of accuracy versus a small-batch baseline [9].

The linear scaling rule has practical limits. Sam McCandlish and colleagues at OpenAI made these limits precise in their 2018 paper "An Empirical Model of Large-Batch Training," which introduced the **gradient noise scale** [13]. The authors found that "a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications," spanning supervised learning, reinforcement learning, and generative models [13]. The noise scale predicts the **critical batch size**, the point beyond which doubling the batch stops giving a corresponding speedup in wall-clock time. Below the critical batch size, larger batches mean fewer steps to convergence; above it, you get diminishing returns and eventually waste compute. The critical batch size grows during training as the loss decreases, and it varies enormously by task: tens of thousands for ImageNet, millions of tokens for language models, and even larger for some reinforcement learning tasks. This framework was used to plan the training of GPT-3 and remains a standard reference for deciding how much data parallelism is worth.

For very large effective batches that exceed available accelerator memory, **gradient accumulation** is the standard trick. Instead of computing the full batch in one forward and backward pass, you split it into k micro-batches, accumulate the gradients across them, and only call the optimizer once per k micro-batches. The result is mathematically equivalent (modulo numerical effects) to training with a batch k times larger. This is how teams routinely simulate batch sizes in the millions of tokens on hardware that can only fit thousands per device.

## What batch size should you use?

There is no single best batch size; the right choice depends on the model, the hardware, and the learning rate schedule, and the literature genuinely disagrees about the ideal range. Two influential findings bracket the debate.

On the small-batch side, Dominic Masters and Carlo Luschi of Graphcore Research argued in their 2018 paper "Revisiting Small Batch Training for Deep Neural Networks" that tiny batches train best. Across CIFAR-10, CIFAR-100, and ImageNet they reported that "the best performance has been consistently obtained for mini-batch sizes between m = 2 and m = 32," and that "increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance" [19]. This echoes Bengio's earlier default of 32 [4] and Keskar's sharp-minima generalization gap [8].

On the large-batch side, the ImageNet-in-an-hour line of work (Goyal 2017 at batch 8,192 [9], You et al. at batch 32,768 for BERT [15]) showed that very large batches can match small-batch accuracy if the learning rate is scaled and warmed up correctly. The reconciling idea is the critical batch size [13]: below it, scaling up saves wall-clock time at no accuracy cost; above it, returns diminish and small-batch noise advantages reappear.

| scenario | typical batch size | notes |
|---|---|---|
| Memory-constrained fine-tuning | 1 to 8 | gradient accumulation often used |
| Vision fine-tuning, small CNNs | 32 to 256 | the classical sweet spot |
| Standard ImageNet training | 256 to 1,024 | works on a single 8-GPU node |
| Large-batch ImageNet (Goyal 2017) | 8,192 | with linear scaling and warmup |
| BERT pretraining (LAMB) | 32,768 | Yang You et al. 2019 |
| GPT-3 pretraining | ~3.2 million tokens | with linear warmup and cosine decay |
| RL agents (e.g. OpenAI Five Dota 2) | tens of millions | high noise scale environment |

## How well does mini-batch SGD converge?

Under a few standard assumptions (smooth loss, bounded gradient variance, suitable step sizes) SGD provably converges to a stationary point of the expected risk. For convex objectives the expected suboptimality after T steps decreases as O(1/√T) for fixed step size, or O(log T / T) for averaged iterates with a Robbins-Monro-style decreasing step size [1]. For strongly convex objectives the rate improves to O(1/T).

Deep learning loss surfaces are non-convex, and the classical theory does not directly apply. In practice SGD on overparameterized neural networks reliably finds solutions with low training loss, often even when the network can fit random labels (Zhang et al. 2017, "Understanding deep learning requires rethinking generalization") [10]. The implicit regularization of small-batch SGD, combined with explicit techniques such as weight decay, dropout, and data augmentation, makes these solutions generalize despite the network's capacity to memorize.

## How is mini-batch SGD implemented?

Every major deep learning framework ships with mini-batch SGD as a built-in optimizer.

| framework | API | notes |
|---|---|---|
| PyTorch | `torch.optim.SGD`, `torch.optim.Adam`, `torch.optim.AdamW` | momentum, weight decay, Nesterov supported as flags |
| TensorFlow / Keras | `tf.keras.optimizers.SGD`, `tf.keras.optimizers.Adam` | similar surface, also includes Adafactor and Lion |
| JAX / Optax | `optax.sgd`, `optax.adam`, `optax.adamw`, `optax.lion` | composable transformations for chaining schedules |
| Hugging Face Transformers | wraps the framework optimizer | exposes a `Trainer` with warmup and weight decay defaults |

A minimal PyTorch training loop looks like this:

```python
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
for epoch in range(num_epochs):
    for x, y in dataloader:                    # dataloader yields mini-batches
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()                        # backprop fills .grad on every parameter
        optimizer.step()                       # apply the update
```

The `DataLoader` handles shuffling, batching, and parallel data loading, while `loss.backward()` and `optimizer.step()` implement the gradient computation and parameter update.

## How is mini-batch SGD used to train modern LLMs?

Large-scale model training has changed what "mini-batch SGD" looks like in practice.

[LLM](/wiki/llm) pretraining today almost universally uses AdamW with a linear warmup followed by a cosine decay to roughly 10% of the peak learning rate. Effective batch sizes are measured in millions of tokens, achieved through a combination of [distributed training](/wiki/distributed_training) across many accelerators and gradient accumulation. The Chinchilla and other [scaling laws](/wiki/scaling_laws) papers have shaped how teams allocate the compute budget between model size and the number of training tokens, but the underlying optimizer remains a mini-batch method.

Mixed-precision training is now standard, with weights stored in 32-bit but gradients computed in BF16 or FP16 on the accelerator. Optimizer states (the momentum and variance buffers in Adam) are typically kept in 32-bit to preserve numerical accuracy, although memory-saving variants like 8-bit Adam (Tim Dettmers et al., 2022) are common when memory is tight [20]. Adafactor and Lion go further by reducing optimizer state to one tensor per parameter or by factorizing it [12][18].

For very large models, optimizer state itself becomes a bottleneck: standard Adam stores two extra full-precision tensors per parameter, which can exceed the model size for models in the hundreds of billions of parameters. Sharding the optimizer state across data-parallel ranks, as in DeepSpeed ZeRO and PyTorch FSDP, has become a routine part of the training stack.

## What are the limitations of mini-batch SGD?

Mini-batch SGD is not magic. It has well-known weak points.

It is sensitive to the learning rate. Set it too high and the loss diverges; set it too low and training stalls. Tuning the schedule, especially the peak learning rate and the warmup length, is one of the most important parts of getting a training run to work.

It requires gradients, which means it cannot be applied directly to non-differentiable objectives. Reinforcement learning, discrete optimization, and many combinatorial problems require gradient estimators, surrogate losses, or evolutionary methods to fit into the SGD framework.

It is path-dependent. Two runs with the same data and the same hyperparameters but different random seeds can land at noticeably different solutions, with different generalization properties. Reproducibility requires careful seeding of the data shuffler, parameter initialization, and any stochastic layers like dropout.

It does not give principled uncertainty estimates. The point estimate produced by SGD is just one mode of the posterior over parameters, and turning it into calibrated predictive uncertainty requires extra machinery such as Monte Carlo dropout, deep ensembles, or stochastic weight averaging.

## related concepts

- [Gradient descent](/wiki/gradient_descent) - the deterministic full-batch ancestor
- [SGD](/wiki/stochastic_gradient_descent) - the broader family of stochastic gradient methods
- [Mini-batch](/wiki/mini-batch) - the random subset of data used per update
- [Backpropagation](/wiki/backpropagation) - the algorithm that computes the gradient
- [Loss function](/wiki/loss_function) - the objective being minimized
- [Learning rate](/wiki/learning_rate) - the most important hyperparameter
- [Batch size](/wiki/batch_size) - controls noise and compute per step
- [Epoch](/wiki/epoch) - one full pass through the training data
- [Distributed training](/wiki/distributed_training) - how very large effective batches are achieved

## explain like i'm 5

You are trying to find the lowest point in a hilly field while blindfolded. You can feel the slope of the ground under your feet and step downhill. If you take a tiny step after feeling just one square inch of dirt, you will move a lot but probably not in the right direction, because that one square inch might be a bump going the wrong way. If you stop and survey the entire field before each step, you will always go the right way, but you will get tired and slow. The smart thing to do is to feel the slope across a small patch of ground (a mini-batch), average it, and step that way. You move efficiently, you avoid being misled by tiny bumps, and you eventually reach the lowest spot. That is what mini-batch SGD does for a neural network.

## References

1. Robbins, H., & Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-3/A-Stochastic-Approximation-Method/10.1214/aoms/1177729586.full
2. Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." *USSR Computational Mathematics and Mathematical Physics*, 4(5), 1-17.
3. Nesterov, Y. (1983). "A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)." *Doklady AN USSR*, 269, 543-547.
4. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." In *Neural Networks: Tricks of the Trade*, Springer, 437-478. arXiv:1206.5533. https://arxiv.org/abs/1206.5533
5. Hinton, G. (2012). "Lecture 6e: RMSProp." Coursera, Neural Networks for Machine Learning.
6. Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. https://arxiv.org/abs/1412.6980
7. Loshchilov, I., & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017. https://arxiv.org/abs/1608.03983
8. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. https://arxiv.org/abs/1609.04836
9. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677. https://arxiv.org/abs/1706.02677
10. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." ICLR 2017. https://arxiv.org/abs/1611.03530
11. Smith, L. N. (2018). "A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay." arXiv:1803.09820. https://arxiv.org/abs/1803.09820
12. Shazeer, N., & Stern, M. (2018). "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018. https://arxiv.org/abs/1804.04235
13. McCandlish, S., Kaplan, J., Amodei, D., & OpenAI Dota Team (2018). "An Empirical Model of Large-Batch Training." arXiv:1812.06162. https://arxiv.org/abs/1812.06162
14. You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." arXiv:1708.03888. https://arxiv.org/abs/1708.03888
15. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C. J. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." ICLR 2020. https://arxiv.org/abs/1904.00962
16. Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. https://arxiv.org/abs/1711.05101
17. Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020 (GPT-3 paper). https://arxiv.org/abs/2005.14165
18. Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C. J., Lu, Y., & Le, Q. V. (2023). "Symbolic Discovery of Optimization Algorithms." NeurIPS 2023. https://arxiv.org/abs/2302.06675
19. Masters, D., & Luschi, C. (2018). "Revisiting Small Batch Training for Deep Neural Networks." arXiv:1804.07612. https://arxiv.org/abs/1804.07612
20. Dettmers, T., Lewis, M., Shleifer, S., & Zettlemoyer, L. (2022). "8-bit Optimizers via Block-wise Quantization." ICLR 2022. https://arxiv.org/abs/2110.02861

