Quantization-Aware Training (QAT)

12 min read

Updated Jul 23, 2026

Quantization-aware training (QAT) is a model compression technique in which the effects of quantization are simulated during the training or fine-tuning of a neural network, so that the model learns parameter values that stay accurate once they are stored and computed at low numerical precision. QAT inserts "fake" quantize-and-dequantize operations into the forward pass and lets gradients flow through the otherwise non-differentiable rounding using the straight-through estimator (STE). It is the principal alternative to post-training quantization (PTQ), which compresses an already-trained model with little or no further optimization. QAT is generally more accurate at very low bit-widths, at the cost of substantially more compute. ^[1]^[2]^[6]

Overview

Neural networks are usually trained in 32-bit or 16-bit floating point, but inference can be made much cheaper by storing the weights, and sometimes the activations and the KV cache, as low-bit integers such as 8-bit (int8), 4-bit (int4), or even ternary and binary values. Lower precision shrinks the memory footprint roughly in proportion to the bit-width and, because much inference is memory-bandwidth bound, can also speed it up. The difficulty is that rounding to a coarse grid introduces error that compounds across a deep network. ^[1]

QAT addresses this by exposing the network to quantization error while it is still being optimized: the forward pass rounds weights (and optionally activations) to the target grid, the loss is computed on the quantized values, and the parameters are nudged to compensate. Because the network can spend its remaining capacity absorbing rounding error, QAT typically preserves accuracy at bit-widths where PTQ degrades sharply, at the cost of a full training loop and data. ^[1]^[6]

QAT versus post-training quantization

The two approaches sit at opposite ends of a cost-accuracy trade-off. Post-training quantization takes a finished model and computes the quantization parameters (scales and zero-points) from a small calibration set, often without any backpropagation. Methods such as GPTQ, AWQ, SmoothQuant, and SpinQuant are PTQ techniques: they are fast, need little or no data, and do not require the original training pipeline, which makes them convenient for compressing third-party weights. Their accuracy is excellent at 8-bit and usually good at 4-bit, but it tends to collapse toward 2-bit and below. ^[1] Quantization-aware training instead folds quantization into the optimization itself; it needs a training loop and data (or a teacher model) and is much slower, but reaches far lower bit-widths with little accuracy loss. The boundary is not sharp, since some PTQ methods perform block-wise reconstruction and some QAT recipes train only a subset of parameters to stay cheap.

Aspect	Post-training quantization (PTQ)	Quantization-aware training (QAT)
Data required	none or a small calibration set	training or fine-tuning data, or a distillation signal
Optimization	calibration, often no backpropagation	full gradient-based training
Compute cost	minutes to a few hours	hours to a full training run
Practical bit-width	8-bit easy, 4-bit with care	down to 4, 3, 2, ternary, and 1-bit
Accuracy at very low bit-width	degrades, can collapse	best available
Needs training pipeline	no	yes
Representative methods	GPTQ, AWQ, SmoothQuant, SpinQuant	Jacob et al. 2018, LSQ, LLM-QAT, BitNet

How QAT works

Fake quantization

The central mechanism is the fake-quantize node, also called simulated quantization. A uniform affine quantizer maps a real value r to an integer q and back using a scale S and a zero-point Z: the quantized integer is q = clamp(round(r / S) + Z, q_min, q_max), and the dequantized approximation is r_hat = S * (q - Z). During QAT each fake-quantize node computes r_hat in floating point and passes that rounded value downstream. The rest of the network therefore sees exactly the rounding and clamping error it will face at deployment, while the tensors remain ordinary floats the training framework can handle. ^[1]

Practical implementations attach fake-quantize nodes to the weights of each linear or convolutional layer and, for full quantization, to the activations too. Weights are commonly quantized per output channel, while activations use a per-tensor or per-token scale, and operations that are folded at inference, most importantly batch normalization, are folded during training so that the simulated graph matches the deployed integer graph. The scheme introduced by Benoit Jacob and colleagues at Google in 2018 made this concrete with an integer-arithmetic-only path: 8-bit weights and activations with int32 accumulators, which underpins the quantization tooling in TensorFlow Lite. ^[1]

The straight-through estimator

The rounding inside a fake-quantize node has a derivative that is zero almost everywhere and undefined at the step boundaries, so naive backpropagation would deliver no usable gradient to the weights. QAT resolves this with the straight-through estimator, introduced by Yoshua Bengio, Nicholas Leonard, and Aaron Courville in 2013 and prefigured in Geoffrey Hinton's 2012 lectures. The STE simply treats the rounding as the identity function on the backward pass: the gradient of r_hat with respect to r is taken to be 1 inside the representable range and 0 outside it (where the value is clamped). ^[2]

To make this work, QAT keeps a full-precision "latent" or "master" copy of the weights. The forward pass quantizes these latent weights to produce r_hat, the STE routes the loss gradient back to them as if no rounding had occurred, and the optimizer updates the latent weights in floating point, recomputing the quantized copy at every step. This pattern, first used for binary networks, is what allows discrete, non-differentiable quantization to be optimized with ordinary gradient descent and backpropagation. The STE is a biased approximation rather than a true gradient, the source of several practical difficulties discussed below.

Learnable quantization parameters

Early QAT fixed the quantization grid by hand, but later methods make the grid itself trainable. PACT (Choi et al., 2018) learns a per-layer clipping threshold for activations by gradient descent, letting the network decide how much of the activation range to keep. Learned Step Size Quantization (LSQ, Esser et al., ICLR 2020) treats the step size S as a learnable parameter with its own STE-style gradient, and was the first method to show 3-bit networks matching full-precision accuracy on ImageNet. ^[10]^[11]

History

The lineage of QAT runs through binary and low-bit networks. BinaryConnect (Courbariaux, Bengio, and David, 2015) trained networks whose forward pass used binary weights while a full-precision shadow copy accumulated gradient updates, establishing the latent-weights idea. Binarized Neural Networks (Courbariaux, Hubara, and colleagues, 2016) pushed both weights and activations to plus or minus one and used the STE to backpropagate through the sign function. ^[3]^[4] XNOR-Net (Rastegari et al., 2016) binarized both weights and inputs of a convolutional neural network, reporting roughly 32 times memory savings and replacing multiply-accumulate with XNOR and bit-counting operations. ^[5]

The work of Jacob et al. (2018), "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," generalized these ideas into a practical 8-bit recipe for mobile inference and popularized the term quantization-aware training through its TensorFlow tooling. ^[1] The learnable-parameter methods PACT and LSQ then carried QAT into the 4-bit, 3-bit, and 2-bit regime. ^[10]^[11] Throughout this period QAT was applied mainly to convolutional networks for image classification and detection, where it routinely recovered most of the accuracy lost to aggressive quantization.

QAT for large language models

When large language models became the dominant target for compression, PTQ led at first because a full training pass over billions of parameters is expensive. QAT for LLMs then grew in two directions: cheaper fine-tuning recipes, and training low-precision models from scratch.

LLM-QAT (Liu et al., Meta, May 29, 2023) was an influential early example. Its key idea is data-free knowledge distillation: rather than relying on the original, often proprietary, training corpus, it generates text from the pretrained model and uses that model as a teacher, training the quantized student to match the teacher's output distribution. LLM-QAT quantizes weights, activations, and the KV cache down to 4-bit on LLaMA 7B, 13B, and 30B models, and reported clear gains over training-free methods in the low-bit settings where PTQ breaks down. ^[6] EfficientQAT (Chen, Shao, et al., July 2024; ACL 2025) attacked the cost problem with a two-phase scheme: a block-wise phase (Block-AP) that trains all parameters of one transformer block at a time, then an end-to-end phase (E2E-QP) that tunes only the quantization parameters. By never holding the whole model in the optimizer state, it makes QAT feasible for 7B to 70B models at 2-bit, 3-bit, and 4-bit precision. ^[7]

A more radical branch trains low-precision LLMs from scratch, which is QAT applied during pretraining rather than fine-tuning. BitNet (Wang, Ma, and colleagues at Microsoft Research, October 2023) introduced a BitLinear layer, a drop-in replacement for the standard linear layer that binarizes weights in the forward pass with the STE while keeping higher-precision activations and latent weights for stability. ^[8] Its successor, BitNet b1.58 ("The Era of 1-bit LLMs," Ma et al., February 27, 2024), used ternary weights {-1, 0, 1}; because each weight carries log2(3) approximately equal to 1.58 bits of information, the model is called 1.58-bit. Trained from scratch with an absolute-mean (absmean) scaling scheme, at the 3B-parameter scale it was reported to match a full-precision FP16 LLaMA baseline in both perplexity and zero-shot accuracy while using about 3.55 times less GPU memory and running about 2.71 times faster. Ternary weights also let the dominant matrix multiplications be carried out with additions instead of multiplications. An open model trained natively this way, BitNet b1.58 2B4T, followed in 2025. ^[9]

QAT has since reached production on-device releases. In October 2024 Meta shipped quantized Llama 3.2 1B and 3B with two recipes: SpinQuant, a PTQ method, and a QAT-plus-LoRA pipeline that prioritizes accuracy by combining quantization-aware training of the backbone with low-rank adapters. ^[13] Google released int4 QAT checkpoints for its Gemma 3 1B, 4B, 12B, and 27B models in 2025, fine-tuning each for a few thousand steps against the non-quantized checkpoint's probabilities to keep close to bfloat16 quality. ^[14]

QAT should be distinguished from QLoRA (Dettmers et al., 2023), with which it is sometimes grouped. QLoRA freezes a base model quantized to 4-bit (the NormalFloat NF4 data type) and trains only higher-precision LoRA adapters on top; the frozen weights are never updated, so QLoRA is PTQ combined with parameter-efficient fine-tuning rather than quantization-aware training of the weights. The QAT-plus-LoRA recipe used for Llama 3.2 differs because there the backbone itself is updated under simulated quantization. ^[12]^[13]

Tradeoffs and limitations

The defining drawback of QAT is cost. It requires backpropagation, an optimizer state, and either training data or a distillation teacher, so a run can take orders of magnitude more compute than a PTQ calibration that finishes in minutes. For 8-bit and many 4-bit deployments PTQ is accurate enough that the expense is hard to justify, so QAT is most valuable in the hardest regimes: 2-bit, ternary, and 1-bit, or when activations and the KV cache must also be quantized. ^[1]^[6]

A second issue is that the straight-through estimator provides only an approximate gradient. Because the true rounding gradient is replaced by the identity, the latent weights can drift and oscillate around a quantization boundary instead of settling, an instability analyzed directly by Nagel et al. (ICML 2022); the oscillations worsen at very low bit-widths and motivate techniques like learnable step sizes and regularizers. ^[15] QAT accuracy also depends on the representativeness of its data or distillation signal.

Finally, simulated quantization does not itself make anything faster. The speed and memory benefits are realized only at inference, and only if the deployment hardware and kernels actually support the chosen format, such as int8 or int4 matrix multiplication or ternary-weight kernels, so QAT is normally co-designed with the inference stack it targets. Even so, QAT remains the most reliable way to push neural networks to the lowest bit-widths while retaining accuracy, and it is the method of record for the binary and ternary LLMs that PTQ cannot reach. ^[1]^[8]^[9]

References

^Jacob, Benoit; Kligys, Skirmantas; Chen, Bo; Zhu, Menglong; Tang, Matthew; Howard, Andrew; Adam, Hartwig; Kalenichenko, Dmitry. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." arXiv:1712.05877 (December 2017); CVPR 2018. arxiv.org/...1712.05877
^Bengio, Yoshua; Leonard, Nicholas; Courville, Aaron. "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation." arXiv:1308.3432 (2013). arxiv.org/...1308.3432
^Courbariaux, Matthieu; Bengio, Yoshua; David, Jean-Pierre. "BinaryConnect: Training Deep Neural Networks with binary weights during propagations." arXiv:1511.00363 (2015); NeurIPS 2015. arxiv.org/...1511.00363
^Courbariaux, Matthieu; Hubara, Itay; Soudry, Daniel; El-Yaniv, Ran; Bengio, Yoshua. "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1." arXiv:1602.02830 (2016); NeurIPS 2016. arxiv.org/...1602.02830
^Rastegari, Mohammad; Ordonez, Vicente; Redmon, Joseph; Farhadi, Ali. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." arXiv:1603.05279 (2016); ECCV 2016. arxiv.org/...1603.05279
^Liu, Zechun; Oguz, Barlas; Zhao, Changsheng; Chang, Ernie; Stock, Pierre; Mehdad, Yashar; Shi, Yangyang; Krishnamoorthi, Raghuraman; Chandra, Vikas. "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models." arXiv:2305.17888 (May 29, 2023). arxiv.org/...2305.17888
^Chen, Mengzhao; Shao, Wenqi; et al. "EfficientQAT: Efficient Quantization-Aware Training for Large Language Models." arXiv:2407.11062 (July 2024); ACL 2025. arxiv.org/...2407.11062
^Wang, Hongyu; Ma, Shuming; Dong, Li; Huang, Shaohan; et al. "BitNet: Scaling 1-bit Transformers for Large Language Models." arXiv:2310.11453 (October 2023). arxiv.org/...2310.11453
^Ma, Shuming; Wang, Hongyu; Ma, Lingxiao; Wang, Lei; Wang, Wenhui; Huang, Shaohan; Dong, Li; Wang, Ruiping; Xue, Jilong; Wei, Furu. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv:2402.17764 (February 27, 2024). arxiv.org/...2402.17764
^Esser, Steven K.; McKinstry, Jeffrey L.; Bablani, Deepika; Appuswamy, Rathinakumar; Modha, Dharmendra S. "Learned Step Size Quantization." arXiv:1902.08153 (2019); ICLR 2020. arxiv.org/...1902.08153
^Choi, Jungwook; Wang, Zhuo; Venkataramani, Swagath; Chuang, Pierce I-Jen; Srinivasan, Vijayalakshmi; Gopalakrishnan, Kailash. "PACT: Parameterized Clipping Activation for Quantized Neural Networks." arXiv:1805.06085 (2018). arxiv.org/...1805.06085
^Dettmers, Tim; Pagnoni, Artidoro; Holtzman, Ari; Zettlemoyer, Luke. "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314 (May 2023); NeurIPS 2023. arxiv.org/...2305.14314
^Meta AI. "Introducing quantized Llama models with increased speed and a reduced memory footprint." October 24, 2024. ai.meta.com/...-llama-quantized-lightweight-models
^Google Developers Blog. "Gemma 3 QAT Models: Bringing state-of-the-art AI to consumer GPUs." 2025. developers.googleblog.com/...t-ai-to-consumer-gpus
^Nagel, Markus; Fournarakis, Marios; Bondarenko, Yelysei; Blankevoort, Tijmen. "Overcoming Oscillations in Quantization-Aware Training." Proceedings of the 39th International Conference on Machine Learning (ICML 2022), PMLR 162. proceedings.mlr.press/...nagel22a.pdf

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · v3 · 2,385 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

BitNet b1.58 Edge TPU Gemma 3 Mistral NeMo On-device AI TensorFlow Lite (LiteRT)

Overview

QAT versus post-training quantization

How QAT works

Fake quantization

The straight-through estimator

Learnable quantization parameters

History

QAT for large language models

Tradeoffs and limitations

References

Improve this article

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here