Quantization-Aware Training (QAT)
Last reviewed
Jun 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,385 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,385 words
Add missing citations, update stale details, or suggest a clearer explanation.
Quantization-aware training (QAT) is a model compression technique in which the effects of quantization are simulated during the training or fine-tuning of a neural network, so that the model learns parameter values that stay accurate once they are stored and computed at low numerical precision. QAT inserts "fake" quantize-and-dequantize operations into the forward pass and lets gradients flow through the otherwise non-differentiable rounding using the straight-through estimator (STE). It is the principal alternative to post-training quantization (PTQ), which compresses an already-trained model with little or no further optimization. QAT is generally more accurate at very low bit-widths, at the cost of substantially more compute. [1][2][6]
Neural networks are usually trained in 32-bit or 16-bit floating point, but inference can be made much cheaper by storing the weights, and sometimes the activations and the KV cache, as low-bit integers such as 8-bit (int8), 4-bit (int4), or even ternary and binary values. Lower precision shrinks the memory footprint roughly in proportion to the bit-width and, because much inference is memory-bandwidth bound, can also speed it up. The difficulty is that rounding to a coarse grid introduces error that compounds across a deep network. [1]
QAT addresses this by exposing the network to quantization error while it is still being optimized: the forward pass rounds weights (and optionally activations) to the target grid, the loss is computed on the quantized values, and the parameters are nudged to compensate. Because the network can spend its remaining capacity absorbing rounding error, QAT typically preserves accuracy at bit-widths where PTQ degrades sharply, at the cost of a full training loop and data. [1][6]
The two approaches sit at opposite ends of a cost-accuracy trade-off. Post-training quantization takes a finished model and computes the quantization parameters (scales and zero-points) from a small calibration set, often without any backpropagation. Methods such as GPTQ, AWQ, SmoothQuant, and SpinQuant are PTQ techniques: they are fast, need little or no data, and do not require the original training pipeline, which makes them convenient for compressing third-party weights. Their accuracy is excellent at 8-bit and usually good at 4-bit, but it tends to collapse toward 2-bit and below. [1] Quantization-aware training instead folds quantization into the optimization itself; it needs a training loop and data (or a teacher model) and is much slower, but reaches far lower bit-widths with little accuracy loss. The boundary is not sharp, since some PTQ methods perform block-wise reconstruction and some QAT recipes train only a subset of parameters to stay cheap.
| Aspect | Post-training quantization (PTQ) | Quantization-aware training (QAT) |
|---|---|---|
| Data required | none or a small calibration set | training or fine-tuning data, or a distillation signal |
| Optimization | calibration, often no backpropagation | full gradient-based training |
| Compute cost | minutes to a few hours | hours to a full training run |
| Practical bit-width | 8-bit easy, 4-bit with care | down to 4, 3, 2, ternary, and 1-bit |
| Accuracy at very low bit-width | degrades, can collapse | best available |
| Needs training pipeline | no | yes |
| Representative methods | GPTQ, AWQ, SmoothQuant, SpinQuant | Jacob et al. 2018, LSQ, LLM-QAT, BitNet |
The central mechanism is the fake-quantize node, also called simulated quantization. A uniform affine quantizer maps a real value r to an integer q and back using a scale S and a zero-point Z: the quantized integer is q = clamp(round(r / S) + Z, q_min, q_max), and the dequantized approximation is r_hat = S * (q - Z). During QAT each fake-quantize node computes r_hat in floating point and passes that rounded value downstream. The rest of the network therefore sees exactly the rounding and clamping error it will face at deployment, while the tensors remain ordinary floats the training framework can handle. [1]
Practical implementations attach fake-quantize nodes to the weights of each linear or convolutional layer and, for full quantization, to the activations too. Weights are commonly quantized per output channel, while activations use a per-tensor or per-token scale, and operations that are folded at inference, most importantly batch normalization, are folded during training so that the simulated graph matches the deployed integer graph. The scheme introduced by Benoit Jacob and colleagues at Google in 2018 made this concrete with an integer-arithmetic-only path: 8-bit weights and activations with int32 accumulators, which underpins the quantization tooling in TensorFlow Lite. [1]
The rounding inside a fake-quantize node has a derivative that is zero almost everywhere and undefined at the step boundaries, so naive backpropagation would deliver no usable gradient to the weights. QAT resolves this with the straight-through estimator, introduced by Yoshua Bengio, Nicholas Leonard, and Aaron Courville in 2013 and prefigured in Geoffrey Hinton's 2012 lectures. The STE simply treats the rounding as the identity function on the backward pass: the gradient of r_hat with respect to r is taken to be 1 inside the representable range and 0 outside it (where the value is clamped). [2]
To make this work, QAT keeps a full-precision "latent" or "master" copy of the weights. The forward pass quantizes these latent weights to produce r_hat, the STE routes the loss gradient back to them as if no rounding had occurred, and the optimizer updates the latent weights in floating point, recomputing the quantized copy at every step. This pattern, first used for binary networks, is what allows discrete, non-differentiable quantization to be optimized with ordinary gradient descent and backpropagation. The STE is a biased approximation rather than a true gradient, the source of several practical difficulties discussed below.
Early QAT fixed the quantization grid by hand, but later methods make the grid itself trainable. PACT (Choi et al., 2018) learns a per-layer clipping threshold for activations by gradient descent, letting the network decide how much of the activation range to keep. Learned Step Size Quantization (LSQ, Esser et al., ICLR 2020) treats the step size S as a learnable parameter with its own STE-style gradient, and was the first method to show 3-bit networks matching full-precision accuracy on ImageNet. [10][11]
The lineage of QAT runs through binary and low-bit networks. BinaryConnect (Courbariaux, Bengio, and David, 2015) trained networks whose forward pass used binary weights while a full-precision shadow copy accumulated gradient updates, establishing the latent-weights idea. Binarized Neural Networks (Courbariaux, Hubara, and colleagues, 2016) pushed both weights and activations to plus or minus one and used the STE to backpropagate through the sign function. [3][4] XNOR-Net (Rastegari et al., 2016) binarized both weights and inputs of a convolutional neural network, reporting roughly 32 times memory savings and replacing multiply-accumulate with XNOR and bit-counting operations. [5]
The work of Jacob et al. (2018), "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," generalized these ideas into a practical 8-bit recipe for mobile inference and popularized the term quantization-aware training through its TensorFlow tooling. [1] The learnable-parameter methods PACT and LSQ then carried QAT into the 4-bit, 3-bit, and 2-bit regime. [10][11] Throughout this period QAT was applied mainly to convolutional networks for image classification and detection, where it routinely recovered most of the accuracy lost to aggressive quantization.
When large language models became the dominant target for compression, PTQ led at first because a full training pass over billions of parameters is expensive. QAT for LLMs then grew in two directions: cheaper fine-tuning recipes, and training low-precision models from scratch.
LLM-QAT (Liu et al., Meta, May 29, 2023) was an influential early example. Its key idea is data-free knowledge distillation: rather than relying on the original, often proprietary, training corpus, it generates text from the pretrained model and uses that model as a teacher, training the quantized student to match the teacher's output distribution. LLM-QAT quantizes weights, activations, and the KV cache down to 4-bit on LLaMA 7B, 13B, and 30B models, and reported clear gains over training-free methods in the low-bit settings where PTQ breaks down. [6] EfficientQAT (Chen, Shao, et al., July 2024; ACL 2025) attacked the cost problem with a two-phase scheme: a block-wise phase (Block-AP) that trains all parameters of one transformer block at a time, then an end-to-end phase (E2E-QP) that tunes only the quantization parameters. By never holding the whole model in the optimizer state, it makes QAT feasible for 7B to 70B models at 2-bit, 3-bit, and 4-bit precision. [7]
A more radical branch trains low-precision LLMs from scratch, which is QAT applied during pretraining rather than fine-tuning. BitNet (Wang, Ma, and colleagues at Microsoft Research, October 2023) introduced a BitLinear layer, a drop-in replacement for the standard linear layer that binarizes weights in the forward pass with the STE while keeping higher-precision activations and latent weights for stability. [8] Its successor, BitNet b1.58 ("The Era of 1-bit LLMs," Ma et al., February 27, 2024), used ternary weights {-1, 0, 1}; because each weight carries log2(3) approximately equal to 1.58 bits of information, the model is called 1.58-bit. Trained from scratch with an absolute-mean (absmean) scaling scheme, at the 3B-parameter scale it was reported to match a full-precision FP16 LLaMA baseline in both perplexity and zero-shot accuracy while using about 3.55 times less GPU memory and running about 2.71 times faster. Ternary weights also let the dominant matrix multiplications be carried out with additions instead of multiplications. An open model trained natively this way, BitNet b1.58 2B4T, followed in 2025. [9]
QAT has since reached production on-device releases. In October 2024 Meta shipped quantized Llama 3.2 1B and 3B with two recipes: SpinQuant, a PTQ method, and a QAT-plus-LoRA pipeline that prioritizes accuracy by combining quantization-aware training of the backbone with low-rank adapters. [13] Google released int4 QAT checkpoints for its Gemma 3 1B, 4B, 12B, and 27B models in 2025, fine-tuning each for a few thousand steps against the non-quantized checkpoint's probabilities to keep close to bfloat16 quality. [14]
QAT should be distinguished from QLoRA (Dettmers et al., 2023), with which it is sometimes grouped. QLoRA freezes a base model quantized to 4-bit (the NormalFloat NF4 data type) and trains only higher-precision LoRA adapters on top; the frozen weights are never updated, so QLoRA is PTQ combined with parameter-efficient fine-tuning rather than quantization-aware training of the weights. The QAT-plus-LoRA recipe used for Llama 3.2 differs because there the backbone itself is updated under simulated quantization. [12][13]
The defining drawback of QAT is cost. It requires backpropagation, an optimizer state, and either training data or a distillation teacher, so a run can take orders of magnitude more compute than a PTQ calibration that finishes in minutes. For 8-bit and many 4-bit deployments PTQ is accurate enough that the expense is hard to justify, so QAT is most valuable in the hardest regimes: 2-bit, ternary, and 1-bit, or when activations and the KV cache must also be quantized. [1][6]
A second issue is that the straight-through estimator provides only an approximate gradient. Because the true rounding gradient is replaced by the identity, the latent weights can drift and oscillate around a quantization boundary instead of settling, an instability analyzed directly by Nagel et al. (ICML 2022); the oscillations worsen at very low bit-widths and motivate techniques like learnable step sizes and regularizers. [15] QAT accuracy also depends on the representativeness of its data or distillation signal.
Finally, simulated quantization does not itself make anything faster. The speed and memory benefits are realized only at inference, and only if the deployment hardware and kernels actually support the chosen format, such as int8 or int4 matrix multiplication or ternary-weight kernels, so QAT is normally co-designed with the inference stack it targets. Even so, QAT remains the most reliable way to push neural networks to the lowest bit-widths while retaining accuracy, and it is the method of record for the binary and ternary LLMs that PTQ cannot reach. [1][8][9]