Quantization

AI Inference Deep Learning Machine Learning

48 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

52 citations

Revision

v5 · 9,693 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Quantization in machine learning and artificial intelligence is the process of reducing the numerical precision of a neural network's parameters (weights, activations, and gradients) from high-precision formats such as 32-bit floating-point (FP32) to lower-precision representations like 8-bit integers (INT8) or 4-bit integers (INT4). This compression technique typically achieves a 4x model size reduction and 2-4x inference speedup with minimal accuracy loss.^[1]^[2] Quantization has become essential for deploying large language models and other large-scale models on resource-constrained devices, reducing computational costs, and enabling real-time inference.

In short, quantization trades a small, often imperceptible loss of numerical precision for large gains in model size, speed, memory bandwidth, and energy use. The landmark results are concrete: GPTQ can quantize a 175 billion parameter model in roughly four GPU hours and run it inside a single GPU for the first time;^[18] LLM.int8() performs inference in models up to 175B parameters "without any performance degradation" while keeping more than 99.9% of values in 8-bit;^[17] and BitNet b1.58 trains models in which "every single parameter (or weight) of the LLM is ternary {-1, 0, 1}" yet still matches a full-precision Transformer of the same size.^[22]

Originally explored in the 1990s for early neural networks, quantization experienced explosive growth after 2022 with the emergence of large language models, evolving from simple 8-bit post-training methods to sophisticated 1-bit architectures trained natively at low precision. Modern quantization enables running 70 billion parameter language models on consumer GPUs, deploying computer vision models on smartphones, and executing AI inference on edge devices with 40-80% energy savings.^[3]^[4] Recent breakthroughs like Microsoft's BitNet demonstrate that models can be trained from scratch with ternary weights (values constrained to -1, 0, or +1) while matching full-precision performance, fundamentally challenging assumptions about the precision requirements of deep learning. As models continue scaling to trillions of parameters, quantization has transformed from an optimization technique into a necessity for practical AI deployment.

What is quantization used for?

Quantization is used to make trained neural networks smaller, faster, and cheaper to run without retraining them from scratch. Its three primary payoffs are memory reduction (an FP32 to INT8 conversion shrinks a model by exactly 4x, so a 7 billion parameter model drops from 28GB to 7GB), inference acceleration (1.5-4x on modern hardware), and energy savings (commonly 40-80% on edge devices).^[1]^[28] In practice this is what lets a 70B-parameter LLM fit on a single 24GB consumer GPU, lets computer vision models run on a phone's neural accelerator, and lets battery-powered robots and vehicles perform real-time inference within a fixed power budget. For large language models specifically, weight-only 4-bit quantization (via methods like GPTQ and AWQ) and 8-bit weight-and-activation quantization (via SmoothQuant) are the dominant deployment recipes, and KV-cache quantization extends usable context length within the same memory.

Explain like I'm 5 (ELI5)

Imagine you have a box of 64 crayons with very specific colors like "cerulean blue" and "burnt sienna." You can draw detailed pictures, but the box is big and heavy. Now imagine replacing it with a box of just 8 basic crayons: red, blue, green, yellow, and so on. Your drawings will look almost the same to most people, but your crayon box is now much smaller and lighter, and you can color faster because you spend less time picking between similar shades.

Quantization does something similar with AI models. A model stores millions or billions of numbers (called weights) that tell it how to make decisions. Normally these numbers are stored with extreme precision, like writing a measurement as "3.14159265." Quantization rounds those numbers to simpler values, like "3.14" or even just "3." The model gets smaller, runs faster, and uses less energy, while still giving answers that are nearly as good as before.

To put it in everyday terms: if someone asks you what time it is, you might glance at your watch and say "10:20" instead of "10:21:37.4 seconds." That small loss of precision almost never matters in practice. Quantization applies this same principle to AI models.^[5]

Core principles and mathematical foundations

Quantization maps continuous floating-point values to a discrete set of integers through an affine transformation defined by two parameters: scale and zero-point. The fundamental quantization equation relates a floating-point value x to its quantized integer representation x_q through the formula:^[1]^[6]

x = S x (x_q - Z)

where S represents the scale factor (a positive floating-point number) and Z represents the zero-point (an integer ensuring exact representation of zero).

The forward quantization process applies the inverse mapping with clipping:

x_q = clip(round(x / S + Z), alpha_q, beta_q)

where alpha_q and beta_q define the quantization range. For b-bit quantization, typical ranges include [-128, 127] for signed 8-bit integers or [0, 255] for unsigned representations. The scale and zero-point parameters derive from the floating-point range [alpha, beta] through:

S = (beta - alpha) / (beta_q - alpha_q)

Z = round((alpha x beta_q - beta x alpha_q) / (beta - alpha))

This affine scheme generalizes to symmetric quantization when the floating-point range centers around zero. Symmetric quantization enforces Z = 0, simplifying computation by eliminating zero-point adjustments.^[7]

Symmetric vs. asymmetric quantization

Feature	Symmetric quantization	Asymmetric (affine) quantization
Real value range	[-alpha, alpha] (centered at 0)	[min, max] (not necessarily centered)
Zero-point (Z)	Fixed at 0	Calculated integer value
Mapping formula	x = S x x_q	x = S x (x_q - Z)
Pros	Computationally faster, simpler	More flexible, better represents skewed data
Cons	May waste integer range if data not zero-centered	Slightly more computational overhead
Typical use	Weights (usually zero-centered)	Activations (often skewed, e.g. post-ReLU)

Symmetric quantization assumes a symmetric range around zero (for example [-127, 127] for INT8), setting Z = 0. For INT8 symmetric quantization, the range becomes [-alpha, alpha] mapped to [-127, 127], deliberately excluding -128 to maintain perfect symmetry. This choice sacrifices one quantization level but enables computational speedups by removing addition operations from the dequantization formula, reducing it to x = S x x_q.^[1]^[2]

Asymmetric (affine) quantization uses a non-zero Z to shift the range, better handling skewed distributions like activations after ReLU or similar functions, which have ranges [0, +max).^[7]

Quantization granularity

The granularity at which quantization parameters are computed significantly affects both accuracy and computational overhead. Three main levels of granularity exist:

Granularity	Description	Accuracy	Overhead
Per-tensor	One scale and zero-point for the entire tensor	Lowest	Lowest
Per-channel	Separate parameters for each output channel	Higher	Moderate
Per-group	Separate parameters for groups of values within a channel	Highest	Highest

Per-tensor quantization uses one scale and zero-point for an entire tensor (for example, all weights in a layer share the same quantization parameters). This is the simplest and fastest approach but can lose accuracy when value distributions vary across channels.^[8]^[9]

Per-channel quantization (also called per-axis) allows each output channel (or each filter) in a layer to have its own scale and zero-point. Per-channel quantization is commonly used for convolutional and fully-connected layer weights because the distribution of weights can differ substantially between channels. Using a separate scale for each channel often yields better accuracy, since it adapts to each set of values more closely.^[10]

Per-group quantization divides each channel into smaller groups (commonly 32, 64, or 128 values) and computes separate quantization parameters for each group. This approach has become especially important for LLM quantization methods like GPTQ, AWQ, and bitsandbytes. It offers the best accuracy preservation because it captures local variations within a channel, though it incurs additional memory overhead from storing more scale and zero-point values. The GGUF format, for example, uses super-blocks of 256 values subdivided into 8 sub-blocks of 32 values each, with quantization parameters at both levels.^[11]

Activations are typically quantized per-tensor because their statistics can change with every input batch, making it less practical to have distinct parameters per channel.

Weight quantization vs. activation quantization

Quantization can be applied to different parts of a neural network:

Weight quantization reduces the precision of the learned parameters stored in the model. Weights are static after training and their distributions are known ahead of time, making them relatively straightforward to quantize. Weight-only quantization is the most common approach for LLMs, since it directly reduces model size and memory bandwidth requirements during inference.
Activation quantization reduces the precision of the intermediate outputs computed during the forward pass. Activations are input-dependent and their ranges can vary significantly across different inputs, making them harder to quantize accurately. Activation quantization requires either calibration (static quantization) or runtime range computation (dynamic quantization).
Weight-and-activation quantization (e.g., W8A8) quantizes both components, enabling fully integer arithmetic during inference. This provides the greatest speedup but demands careful handling of activation distributions.

Calibration techniques

Calibration determines the ranges for activations in static quantization:

Min-Max: Uses observed minimum and maximum values from a calibration dataset
Mean Square Error (MSE): Minimizes the quantization error between original and quantized values
Entropy (KL divergence): Minimizes information loss using Kullback-Leibler divergence between original and quantized distributions
Percentile: Clips outliers using percentiles (typically 99-99.9%), often providing superior results by excluding extreme outliers that would otherwise force wasteful quantization ranges^[12]

History and evolution

The foundational concepts of neural network quantization emerged in the early 1990s when researchers first explored converting floating-point parameters to low-precision datatypes. Balzer and colleagues published pioneering work on weight quantization for Boltzmann machines in 1991, while Choudry introduced "continuous-discrete learning" that applied quantization during training. These early efforts remained largely academic curiosities until the 2010s, when the success of deep learning on ImageNet reignited interest in compression techniques.^[13]

The 2015 publication of BinaryConnect by Matthieu Courbariaux and Yoshua Bengio marked the breakthrough moment for modern quantization research. Released on November 2, 2015, this paper demonstrated that convolutional neural networks could train with binary weights during forward and backward propagation, achieving near state-of-the-art results on MNIST, CIFAR-10, and SVHN benchmarks. BinaryConnect introduced the Straight-Through Estimator (STE), a technique for handling non-differentiable quantization functions during backpropagation by approximating the gradient as the identity function. This method became the de facto standard for training quantized networks.^[13]

The momentum continued with BinaryNet in February 2016, extending binarization to both weights and activations. By constraining all values to {-1, +1}, BinaryNet achieved a remarkable 31.3x memory footprint reduction compared to 32-bit floating-point while maintaining acceptable accuracy.^[14] Han and colleagues' Deep Compression work in 2015 combined pruning, quantization, and Huffman coding, demonstrating that AlexNet and VGG could be compressed by 35x and 49x respectively without accuracy loss.^[15]

The field matured significantly between 2017 and 2021 as researchers systematically explored quantization's capabilities and limitations. Two comprehensive surveys published in 2021, one by Gholami and colleagues at UC Berkeley and another white paper by Qualcomm AI Research, consolidated understanding of post-training quantization and quantization-aware training methodologies.^[16]^[12] These works established that 8-bit quantization typically incurs less than 1% accuracy loss for convolutional networks, while lower bit-widths require careful quantization-aware training to maintain performance.

The large language model era (2022-present)

The explosion of large language models in 2022-2023 catalyzed a quantization revolution focused specifically on transformer architectures. LLM.int8() by Tim Dettmers (August 2022) pioneered mixed-precision decomposition for handling outlier features in attention mechanisms, enabling 8-bit inference for models exceeding 30 billion parameters on single GPUs. This work revealed that transformer models develop extreme outlier activations (magnitude 100x larger than typical values) in specific feature dimensions, requiring special treatment for successful quantization. The paper found these outliers are "highly systematic," emerging in only a handful of feature dimensions, and its mixed-precision decomposition isolates them into a 16-bit matrix multiplication while "more than 99.9% of values are multiplied in 8-bit," enabling inference "with up to 175B parameters without any performance degradation."^[17]

GPTQ followed in October 2022, applying layer-wise post-training quantization based on approximate second-order information. GPTQ successfully quantizes language models to 4-bit, 3-bit, and even 2-bit precision using Hessian-based error minimization and intelligent error redistribution across layers. Its authors report that GPTQ "can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation," allowing "for the first time" the execution of a 175B model inside a single GPU for generative inference.^[18] QLoRA emerged in May 2023, combining 4-bit quantization with Low-Rank Adapters to enable fine-tuning of 65 billion parameter models on single 48GB GPUs while preserving full 16-bit task performance. QLoRA introduced NormalFloat4 (NF4), an information-theoretically optimal quantization format for normally distributed weights, along with double quantization to reduce memory overhead by quantizing the quantization constants themselves; its Guanaco model reached "99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU."^[19]

AWQ (Activation-aware Weight Quantization) appeared in June 2023, earning the MLSys 2024 Best Paper Award for its innovation in protecting salient weights based on activation distributions.^[20] SmoothQuant, published at ICML 2023, introduced a mathematically equivalent transformation to migrate quantization difficulty from activations to weights, enabling efficient W8A8 quantization of LLMs up to 530 billion parameters.^[21]

The year 2024 marked the arrival of 1-bit large language models with Microsoft Research's BitNet series. BitNet b1.58, published February 27, 2024, demonstrated that every parameter in a large language model could be constrained to ternary values {-1, 0, +1} (effectively 1.58 bits per parameter) while matching full-precision performance on perplexity and downstream tasks. The authors argue that "the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective," and that it "enables a new computation paradigm" because ternary weights replace floating-point multiplications with integer additions.^[22]

Landmark quantization methods and milestones

Method / milestone	Year	Key result	Source
BinaryConnect	2015	First practical training of CNNs with binary weights; introduced the Straight-Through Estimator	^[13]
Deep Compression	2015/2016	35x (AlexNet) and 49x (VGG-16) compression with no accuracy loss via pruning + quantization + Huffman coding	^[15]
BinaryNet	2016	Binary weights and activations ({-1, +1}); about 31.3x memory reduction	^[14]
LLM.int8()	Aug 2022	8-bit inference up to 175B parameters with no performance degradation; >99.9% of values in 8-bit	^[17]
GPTQ	Oct 2022	Quantizes 175B models in about 4 GPU-hours to 3-4 bits; first 175B model on a single GPU	^[18]
SmoothQuant	2022/2023	Training-free W8A8 for LLMs up to 530B; up to 1.56x speedup, 2x memory reduction	^[21]
QLoRA	May 2023	4-bit NF4 finetuning of 65B models on one 48GB GPU; Guanaco reaches 99.3% of ChatGPT on Vicuna	^[19]
AWQ	Jun 2023	Protects ~1% salient weights; >3x speedup over FP16; MLSys 2024 Best Paper	^[20]
BitNet b1.58	Feb 2024	Ternary weights {-1, 0, +1} (1.58 bits) matching full-precision Transformers	^[22]

Quantization methodologies

Quantization strategies divide into two fundamental approaches based on timing: post-training quantization applies to already-trained models, while quantization-aware training incorporates quantization effects during the training process.

Post-training quantization (PTQ)

Post-Training Quantization converts a model's weights and/or activations to lower precision after the model has been fully trained in high precision.^[2]^[12] PTQ is widely used due to its simplicity and speed; it does not require retraining the model or having access to the original training dataset and pipeline. However, because the model was not trained with quantization in mind, PTQ can lead to a more significant drop in accuracy compared to QAT, especially when quantizing to very low bit-widths (below 8 bits).^[23]

Dynamic quantization

In dynamic quantization (also known as "dynamic range quantization"), the model's weights are quantized offline, but the activations are quantized on the fly during inference.^[2]^[1] For each input fed to the model, the range (min/max values) of the activation tensors is calculated at runtime. These dynamic ranges are then used to compute the quantization parameters for the activations for that specific inference pass.

The main advantage of dynamic quantization is its flexibility and robustness. It does not require a calibration dataset and can adapt to varying input data distributions, which often results in higher accuracy than static quantization, particularly for models like LLMs where the activation ranges can vary dramatically depending on the input prompt.^[24] The trade-off is performance: the runtime computation of activation ranges introduces computational overhead, making inference slower than with static quantization.

Static quantization

In static quantization, both the model's weights and its activations are quantized offline, before inference begins.^[24]^[8] While the range of the weights is known from the trained model, the range of the activations is input-dependent and must be estimated. This is achieved through a calibration step. During calibration, a small but representative dataset (typically a few hundred samples) is passed through the floating-point model, and observers record the statistical distribution of the activations at each layer.^[1]^[10]

The primary advantage of static quantization is its high inference speed. Since all quantization parameters are pre-computed, the entire inference process can be executed using highly efficient integer-only arithmetic, with no runtime overhead for calculating scales or zero-points. This makes it ideal for latency-critical applications on edge devices where the input data distribution is relatively stable and predictable.

Feature	Static PTQ	Dynamic PTQ
Weights quantization	Offline (pre-computed)	Offline (pre-computed)
Activations quantization	Offline (using calibration data)	On-the-fly (at runtime)
Calibration required?	Yes, needs representative dataset	No
Inference speed	Very fast (integer-only arithmetic)	Slower (runtime overhead)
Accuracy	Good, but sensitive to data shifts	Often higher, more robust
Ideal use case	Edge devices, CNNs for vision	Server-side LLMs with varied inputs

Quantization-aware training (QAT)

Quantization-Aware Training integrates the quantization process directly into the model training or fine-tuning phase.^[25]^[26] While more complex and computationally intensive than PTQ, QAT generally achieves higher accuracy, often recovering nearly all of the performance of the original floating-point model.

Simulating quantization with fake-quantize operators

The core mechanism of QAT is the simulation of low-precision arithmetic during training. This is accomplished by inserting "fake quantization" (or quantize-dequantize) nodes into the model's computation graph, typically after layers that produce weights and activations.^[26]^[9]

In the forward pass of training, these nodes perform a three-step operation:

Take a high-precision (FP32) input tensor
Quantize the tensor to a low-precision integer format (e.g. INT8), which simulates the rounding and clipping errors
Immediately dequantize the tensor back to FP32

The resulting FP32 tensor now carries the "imprint" of quantization error. This error-injected tensor is then passed to the next layer. By doing this, the model's loss function is directly exposed to the effects of quantization throughout the training process. This forces the optimization algorithm (e.g. SGD, Adam) to find a set of weights that is not only good at the task but also robust to the noise and reduced precision of the quantized domain.^[26]

Gradient approximation with the Straight-Through Estimator

A critical challenge in QAT is that the rounding operation inherent in quantization is non-differentiable. Its derivative is zero almost everywhere, which would block the flow of gradients during backpropagation and halt the training process.^[25]^[9]

To overcome this, QAT relies on the Straight-Through Estimator (STE). The STE is an approximation for the gradient of the non-differentiable quantization function. During the backward pass, the STE treats the quantization node as an identity function, passing the gradient from its output directly to its input without modification.^[26]^[27] In essence, while the forward pass sees the effects of quantization, the backward pass "looks through" the problematic rounding operation, allowing gradients to flow and the model's full-precision weights to be updated effectively.

Aspect	PTQ	QAT
Complexity	Low; simple to apply	High; requires retraining/fine-tuning
Data requirement	Small calibration set or none	Requires training dataset
Computational cost	Low; fast conversion	High; significant compute for fine-tuning
Model accuracy	Good, but can degrade at <8 bits	Excellent; near-original accuracy
When to use	Rapid deployment, no training access	When accuracy is paramount, low bit-widths

How does PTQ differ from QAT?

The practical difference is when quantization happens and how much it costs. Post-training quantization (PTQ) is applied after a model is fully trained, needs only a small calibration set or none at all, runs in minutes to a few GPU-hours, and is the default for rapid deployment; its weakness is accuracy loss below 8 bits. Quantization-aware training (QAT) folds simulated low-precision arithmetic into the training or fine-tuning loop using fake-quantize nodes and the Straight-Through Estimator, requires the training dataset and substantial compute, but typically recovers nearly all of the original floating-point accuracy and is preferred when accuracy is paramount or when targeting very low bit-widths. For modern LLMs, PTQ methods (GPTQ, AWQ, SmoothQuant) dominate because retraining a multi-billion-parameter model is expensive, whereas QAT remains common for compact vision and on-device models where the accuracy cliff at low precision is steepest.^[12]^[23]

Precision levels and data formats

INT8 quantization

INT8 quantization represents the production standard, offering 4x model size reduction and 2-4x inference speedup with typically less than 1% accuracy loss.^[28]^[29] Signed 8-bit integers span the range [-128, 127] or [0, 255] for unsigned variants, providing 256 discrete levels to represent continuous floating-point values. Hardware support for INT8 operations is nearly universal across modern processors: NVIDIA Tensor Cores accelerate INT8 matrix multiplication on Volta and newer architectures, Intel CPUs provide optimized VNNI instructions, ARM processors include INT8 NEON extensions, and Google's Edge TPU executes exclusively in INT8.^[7]

Intel's comprehensive study of 69 models on x86 CPUs demonstrates INT8 quantization achieving 2.97x geometric mean speedup compared to FP32, with individual models like MobileNetV2 reaching 3.94x speedup at batch size 64. ResNet-50 demonstrates typical INT8 characteristics: accuracy drops from 70.07% to 69.85% (0.22% loss), while inference accelerates by 1.59-1.65x on CPUs and approximately 2x on GPUs.^[7]

INT4 and lower bit-widths

INT4 quantization occupies the frontier of practical deployment for large language models. 4-bit precision achieves 8x model size reduction, enabling 70 billion parameter models to fit on consumer GPUs with 24GB memory.^[30] Methods like GPTQ and AWQ demonstrate that 4-bit quantization of LLM weights maintains high quality with proper calibration, typically achieving 98-99% accuracy recovery compared to full-precision baselines.

Pushing below 4 bits presents escalating challenges. INT2 quantization compresses models by 16x but frequently causes substantial accuracy degradation without sophisticated techniques. Research shows 2-bit GPTQ quantization of LLaMA-65B decreases LAMBADA accuracy from 79% to 57%, with mathematical reasoning particularly affected (suffering up to 32.39% accuracy loss).^[18] Recent innovations like Vector Post-Training Quantization (VPTQ) achieve 95% accuracy preservation at 2 bits through vector-wise quantization and advanced codebook optimization.^[31]

Binary and ternary neural networks

Binary neural networks (BNNs) and Ternary neural networks (TNNs) represent the extreme end of the precision spectrum. BinaryNet demonstrated feasibility in 2016 by constraining weights and activations to {-1, +1}, achieving 32x compression and replacing multiply-accumulate operations with XNOR and bit-counting.^[14]

BitNet b1.58 uses ternary weights {-1, 0, +1} with 8-bit activations, achieving approximately 21x compression.^[22] The publicly released BitNet b1.58 2B4T model occupies merely 0.4GB compared to 4-5GB for full-precision 2 billion parameter models. Specialized inference frameworks like bitnet.cpp achieve 1.37-6.17x speedup on CPUs with 55-82% energy reduction compared to full-precision inference.^[32]

Ternary neural networks represent a compromise between binary and higher precision. Weights are constrained to three values, typically {-W, 0, +W}, where W is a learnable, layer-specific scaling factor.^[33] The inclusion of an explicit zero state introduces sparsity into the weight matrices, which can be exploited by hardware to skip computations involving zero, leading to further energy savings.

FP16, BF16, and FP8 formats

FP16 (16-bit floating-point) and BF16 (Brain Floating Point 16) provide 2x compression compared to FP32 while maintaining floating-point representation benefits. FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits, offering higher precision but a smaller dynamic range. BF16 uses the same 8-bit exponent as FP32 (with 7 mantissa bits), providing wider dynamic range than FP16, which makes it more stable for training large models and less prone to underflow/overflow issues.^[7]

FP8 (8-bit floating-point) formats have emerged as superior alternatives to INT8 for transformer models. The IEEE-standardized E5M2 (5-bit exponent, 2-bit mantissa) and E4M3 (4-bit exponent, 3-bit mantissa) formats provide wider dynamic range than fixed-point integers, better handling the heavy-tailed activation distributions characteristic of large language models.^[34] NVIDIA's H100 GPUs with Hopper architecture provide native FP8 support, achieving 1.95x speedup for Stable Diffusion XL compared to FP16.

Format	Bit width	Description	Typical use cases
FP32	32	Full-precision floating-point; training baseline	High-accuracy scenarios, training
FP16	16	Half-precision floating-point; maintains dynamic range	GPU inference, mixed-precision training
BF16	16	Brain Floating Point; wider exponent for stability	Training large models, TPUs
FP8	8	E5M2 or E4M3; floating-point with reduced precision	LLM inference on Hopper GPUs
INT8	8	8-bit integer; industry standard	Mobile, edge devices, production inference
INT4	4	4-bit integer; aggressive compression	LLMs on consumer GPUs
INT2	2	2-bit integer; extreme compression	Research, specialized applications
NF4	4	NormalFloat4; optimal for normal distributions	QLoRA, LLM fine-tuning
FP4	4	4-bit floating-point; supported on Blackwell GPUs	Next-generation inference
Ternary	~1.58	Values in {-1, 0, +1}	BitNet, ultra-efficient inference
Binary	1	Values in {-1, +1}	Extreme edge deployment

Mixed-precision quantization

Mixed-precision quantization assigns different bit-widths to different layers or components based on sensitivity analysis. Hessian-based methods compute second-order information to identify layers where quantization most impacts the loss function, preserving higher precision for sensitive layers.^[12] Common mixed-precision strategies include:

W4A8: 4-bit weights, 8-bit activations
W6A6: 6-bit for both weights and activations
W4A16: 4-bit weights, full-precision activations
W8A8: 8-bit for both (most common balanced approach)

For vision transformers, research demonstrates that multi-head self-attention modules require higher precision than feed-forward networks, with projection layers being most sensitive and fully-connected layers tolerating aggressive quantization.^[35]

Mixed-precision training

Mixed-precision training is a related but distinct concept from mixed-precision quantization. It refers to the practice of using a mix of numerical precisions during the training process itself (not just for inference). The standard approach, introduced by Micikevicius et al. at NVIDIA in 2018, uses FP16 or BF16 arithmetic for most forward and backward pass computations while maintaining FP32 "master weights" for the parameter update step.^[36]

The technique has three core components:

FP32 master weights: A primary copy of the model weights is kept in FP32. This ensures that weight updates, which involve small gradient values accumulated over time, do not suffer from the limited precision of 16-bit formats.
FP16/BF16 forward and backward passes: The forward pass and gradient computation use half-precision arithmetic. On NVIDIA GPUs, Tensor Cores can execute FP16 matrix operations at up to 16x the throughput of FP32 operations (for example, the A100 delivers approximately 312 TFLOPS in FP16/BF16 versus 19.5 TFLOPS in FP32).
Loss scaling: Because FP16 has a significantly smaller dynamic range than FP32, gradient values can underflow to zero during backpropagation. Loss scaling multiplies the computed loss by a large factor before the backward pass, shifting gradient magnitudes into FP16's representable range. After computing gradients, the scale factor is divided out before the weight update. Dynamic loss scaling automatically adjusts this factor during training.

BF16 has largely supplanted FP16 for mixed-precision training on modern hardware. Because BF16 shares the same 8-bit exponent as FP32, it has a comparable dynamic range, which means loss scaling is often unnecessary or requires only simple static scaling. Google TPUs and NVIDIA Ampere (and later) GPUs natively support BF16 arithmetic. Mixed-precision training is not quantization in the strict sense (no conversion to integer types), but it represents a closely related form of precision reduction that is standard practice for training virtually all large-scale models today.

Advanced quantization algorithms

GPTQ: Generative Pre-trained Transformer Quantization

GPTQ applies optimal brain quantization principles in a one-shot, layer-wise manner. The algorithm processes each layer independently, using approximate second-order information derived from the Hessian matrix to determine optimal quantization.^[18] For each column of weights, GPTQ minimizes reconstruction error by computing the optimal quantized weight that minimizes the squared difference between original and quantized layer outputs, considering the Hessian H = 2X^T X. The elegant aspect lies in error redistribution: quantization errors from earlier weights inform adjustments to later weights in the same row, minimizing accumulated error across the entire layer.

GPTQ's computational efficiency stems from avoiding gradient-based optimization, instead relying on closed-form updates. Quantizing the OPT-175B model takes approximately 4 GPU-hours, making the technique practical for the largest publicly available models. The method supports aggressive quantization to 4-bit, 3-bit, and 2-bit precision, though quality degrades significantly below 3 bits without additional techniques like outlier preservation. Production deployments leverage optimized kernels like ExLlama, achieving 2x inference speedup while reducing memory by 4x for 4-bit quantization.

AWQ: Activation-aware Weight Quantization

AWQ introduces the key insight that not all weights contribute equally to model performance. Analyzing activation distributions reveals that approximately 1% of weights, those corresponding to salient activation channels, disproportionately affect model outputs.^[20] As the AWQ authors put it, "weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error." AWQ protects these salient weights by applying per-channel scaling factors that amplify important weight magnitudes before uniform quantization, and it "does not rely on any backpropagation or reconstruction," which helps preserve the model's generalization across domains and modalities.^[20]

The crucial innovation is that scaling does not require mixed-precision arithmetic during inference; instead, scales can be absorbed into subsequent layers or activation functions. AWQ optimizes scaling factors to minimize quantization error weighted by activation magnitude. This activation-aware approach provides multiple benefits: faster inference than GPTQ (2.7x speedup on RTX 4090), superior accuracy preservation for instruction-tuned models, reduced calibration data requirements (as few as 128 tokens), and better generalization to multi-modal architectures.

SmoothQuant

SmoothQuant, published at ICML 2023 by Xiao et al., addresses the fundamental challenge that activations in large language models are much harder to quantize than weights due to systematic outliers.^[21] While weight distributions tend to be relatively uniform and flat, activation distributions develop extreme outliers (approximately 100x larger than typical values) that make per-tensor activation quantization impractical.

The core idea is a mathematically equivalent per-channel scaling transformation that migrates quantization difficulty from activations to weights. For a linear layer computing Y = XW, SmoothQuant introduces a diagonal scaling matrix s and reformulates the computation as:

Y = (X diag(s)^-1) (diag(s) W)

By dividing each activation channel by its corresponding scale factor and multiplying the corresponding weight channel by the same factor, the transformation smooths out activation outliers while making weights slightly harder to quantize. The net effect is that both weights and activations become easy to quantize, enabling efficient W8A8 (8-bit weight, 8-bit activation) quantization for the entire model.

SmoothQuant achieves up to 1.56x speedup and 2x memory reduction for LLMs with negligible accuracy loss, and has been demonstrated on models up to 530 billion parameters (including OPT, BLOOM, GLM, and LLaMA families). It enables serving a 530B LLM within a single node using half the number of GPUs compared to FP16.^[21] The scaling factors are computed offline using a small calibration dataset, making SmoothQuant a simple, training-free technique. It has been widely integrated into inference frameworks including NVIDIA TensorRT-LLM, Intel Neural Compressor, and vLLM.

GGUF and llama.cpp

GGUF (GPT-Generated Unified Format) and the llama.cpp framework target CPU-first inference with optional GPU offloading. GGUF represents a quantized model serialization format that stores weights, metadata, and quantization parameters in an efficient binary representation. The format supports block-wise quantization strategies: super-blocks of 256 values subdivide into 8 sub-blocks of 32 values each, with quantization parameters at both levels enabling fine-grained adaptation to local statistics.^[11]

GGUF quality levels span from Q2_K (2.63 bits per weight) through Q8_0 (8.5 bits), allowing users to select accuracy-efficiency trade-offs based on deployment constraints. The llama.cpp inference engine provides CPU implementations optimized for diverse architectures: x86 with AVX, AVX2, and AVX-512 instructions; Apple Silicon leveraging Metal and Accelerate framework; ARM processors using NEON; and GPU backends for CUDA and ROCm.

Performance benchmarks show 100 billion parameter models executing on single CPU cores at human reading speed (5-7 tokens/second), democratizing LLM access for researchers and developers without access to high-end GPUs. GGUF's K-quants variants apply K-means clustering to weight distributions, creating non-uniform quantization levels that concentrate representation capacity where weights densely cluster.

BitNet: native 1-bit training

BitNet and BitNet b1.58 fundamentally differ from post-training methods by training models natively at low precision. The architecture replaces standard Linear layers with BitLinear layers that constrain weights during forward propagation.^[22] For BitNet b1.58, weights are quantized to {-1, 0, +1} using a rounding function normalized by the average absolute weight value. Activations undergo 8-bit quantization with similar absmax-based scaling. The critical innovation enabling training involves maintaining high-precision weights for gradient accumulation while simulating low-precision arithmetic during forward and backward passes.

BitNet's computational model replaces floating-point multiplications with integer additions, dramatically reducing arithmetic complexity. Matrix multiplication W x X with ternary weights decomposes to selective addition and subtraction based on weight masks identifying positive and negative positions. This formulation requires only additions and subtractions, enabling specialized hardware implementations. The bitnet.cpp framework provides optimized kernels achieving 1.37-6.17x CPU speedup and 55-82% energy reduction compared to full-precision inference, with particularly strong performance on ARM architectures.^[32]

KV-cache quantization

As large language models process longer sequences, the key-value (KV) cache used in transformer attention mechanisms grows to consume significant GPU memory. For a model like LLaMA-2-70B processing a 128K token context, the KV cache alone can require over 80GB of memory in FP16 precision, often exceeding the memory consumed by the model weights themselves. KV-cache quantization compresses these cached key and value tensors to lower precision, enabling longer context lengths within the same memory budget.^[37]

Several approaches have emerged for KV-cache quantization:

Uniform quantization applies standard INT8 or INT4 quantization to all key and value tensors. This straightforward approach can introduce accuracy degradation, particularly for key tensors whose values feed into the softmax attention computation.

Heterogeneous quantization stores keys and values at different precisions. Research has shown that keys are more sensitive to quantization than values because key quantization errors affect the shared softmax denominator, amplifying errors across all attention positions. Consequently, methods like LeanKV store keys at higher precision (e.g. 8 bits) and values at lower precision (e.g. 4 bits).

Per-channel, pre-RoPE key quantization quantizes key tensors before rotary positional encoding (RoPE) is applied. Because RoPE introduces channel-dependent rotations that can create outliers, quantizing before RoPE yields more uniform distributions and better accuracy.

KVQuant (NeurIPS 2024) combines sensitivity-weighted non-uniform quantization with per-channel key quantization to achieve under 0.1 perplexity degradation at 3-bit precision across LLaMA, Llama-2, Llama-3, and Mistral models. It enables serving LLaMA-7B with up to 1 million token context on a single A100-80GB GPU and up to 10 million tokens on an 8-GPU system.^[37]

KV-cache quantization has become a critical optimization for production LLM serving systems. Frameworks like vLLM, TensorRT-LLM, and SGLang support KV-cache quantization as a built-in feature, typically offering INT8 and FP8 options for the KV cache independent of the weight quantization format.

Performance characteristics and benefits

Model size reduction

Model size reduction represents quantization's most immediate benefit, with compression ratios directly proportional to bit-width reduction. FP32 to INT8 quantization achieves exactly 4x compression: a 7 billion parameter model decreases from 28GB to 7GB. INT4 quantization doubles this to 8x compression, fitting the same model in 3.5GB.^[28]

The landmark Deep Compression work demonstrated extreme ratios combining quantization with pruning and Huffman coding: AlexNet compressed 35x (240MB to 6.9MB) and VGG-16 compressed 49x (552MB to 11.3MB), both without accuracy loss.^[15] For modern large language models, these reductions transform deployment feasibility. LLaMA 3.1 70B requires 140GB in FP16 but only 35GB in INT8, crossing the threshold from impractical to feasible for consumer GPUs.

Inference speed improvements

Inference speed improvements vary substantially based on hardware, model architecture, and bottleneck characteristics. Compute-bound operations, where arithmetic operations dominate execution time, benefit most from quantization's reduced operation complexity. NVIDIA reports 1.8x speedup for W8A8-INT quantization on A100 GPUs, while INT8 operations on Tensor Cores can theoretically reach 4x speedup over FP32.^[38]

TensorRT-optimized Stable Diffusion XL achieves 1.72x speedup with INT8 and 1.95x with FP8 on RTX 6000 Ada GPUs. CPU implementations show equally impressive gains: Intel's optimized INT8 operators deliver 2.97x geometric mean speedup across 69 models on x86 processors.^[7]

Memory bandwidth-bound operations gain from reduced data movement. Modern GPU inference often stalls waiting for data transfer rather than computation, particularly for large models where weights cannot fit in high-speed cache. A 4x reduction in model size enables 4x more weights to occupy L2 cache, dramatically reducing expensive DRAM accesses.

Accuracy preservation

Accuracy preservation depends critically on bit-width, quantization method, and model characteristics. Comprehensive evaluations provide definitive statistics: 8-bit W8A8-INT quantization achieves 99%+ accuracy recovery across all benchmarks, essentially matching full-precision performance. 4-bit W4A16 quantization maintains 98.9% accuracy recovery for code generation, 98.5% for mathematics, and similarly high retention across diverse tasks.^[39]

Below 4 bits, accuracy degradation accelerates unless sophisticated techniques intervene. Standard 3-bit quantization shows noticeable degradation, with model capacity beginning to deteriorate. Instruction-tuned LLMs generally maintain performance comparable to their non-quantized counterparts at 8-bit and 4-bit precisions, with 4-bit quantized models often showing similar performance to their BFloat16 equivalents. A significant performance drop is observed when LLMs are quantized to 3-bit or lower, particularly with GPTQ.^[40]

Energy efficiency and cost reduction

Energy efficiency gains prove crucial for edge deployment and environmental sustainability. Manufacturing edge AI case studies document dramatic improvements: hardware costs decreased 92% (from $225,000 for 50 GPU cards to $18,000 for 4 cards) while energy consumption dropped 65-80% through INT8 quantization.^[41]

Mobile edge computing research shows up to 40% overall energy reduction, combining computational savings with reduced data transmission. BitNet models demonstrate 55-82% energy savings on CPUs compared to full-precision baselines.^[32] The bitnet.cpp framework running the 2B4T model achieves remarkable edge device performance: 11 tokens per second on Raspberry Pi 5 and 48 tokens per second on Snapdragon X Elite. These metrics transform deployment economics; models that previously required $10,000+ GPU infrastructure now execute on $100 single-board computers.

Hardware support

The effectiveness of quantized inference depends heavily on hardware support for low-precision arithmetic. Different hardware platforms support different precision formats, and this fragmentation influences which quantization methods are practical for a given deployment target.

Hardware	INT8	INT4	FP8	FP4	BF16/FP16	Notes
NVIDIA Volta/Turing (V100, T4)	Yes	No	No	No	FP16	First Tensor Core INT8 support
NVIDIA Ampere (A100)	Yes	Yes	No	No	Both	INT4 Tensor Core operations
NVIDIA Hopper (H100)	Yes	No	Yes	No	Both	Native FP8 support; INT4 removed
NVIDIA Blackwell (B100/B200)	Yes	No	Yes	Yes	Both	FP4 Tensor Core support added
Intel CPUs (VNNI)	Yes	No	No	No	No	Optimized INT8 dot products via VNNI
Apple Neural Engine (A16+)	Yes	Yes (channel-wise)	No	No	FP16	INT8 groupwise, INT4 channel-wise
Qualcomm Hexagon NPU	Yes	Yes	No	No	FP16	Native INT4 doubles tensor throughput
Google TPU v4/v5	Yes	Yes	No	No	BF16	Optimized for BF16 training
ARM CPUs (NEON)	Yes	No	No	No	FP16	INT8 via NEON extensions

NVIDIA's evolution illustrates the shifting landscape of quantization hardware support. The Volta architecture (2017) introduced Tensor Cores with INT8 support. Ampere (2020) added fast INT4 Tensor Core operations. However, the Hopper architecture (2022) removed INT4 support in favor of FP8, reflecting the industry's recognition that floating-point formats better handle the heavy-tailed distributions common in transformer models.^[34] Blackwell (2024) added FP4 Tensor Core support, continuing the trend toward low-bit floating-point formats.

Apple's Neural Engine primarily performs INT8 x INT8 and FP16 x FP16 matrix multiplications. Inputs can optionally use INT4 or INT8 linear quantization, with INT8 supporting groupwise quantization and INT4 limited to channel-wise quantization.^[42] Qualcomm's Hexagon NPU includes native INT4 support, doubling tensor throughput compared to INT8 for supported operations.^[43]

The absence of industry-wide quantization standards forces developers to maintain multiple quantized model versions for different deployment targets. Framework interoperability remains imperfect: PyTorch, TensorFlow, and ONNX use different quantization schemes, complicating model conversion. Kernel optimization presents an additional challenge. Standard deep learning frameworks lack specialized implementations for many quantization schemes, particularly exotic formats like 1-bit or mixed-precision. BitNet models demonstrate this gap: running through standard transformers can be slower than full-precision inference despite theoretical 32x reduction in arithmetic complexity, because software falls back to inefficient dequantization before every operation.^[22]

Evaluation of quantized models

Evaluating the quality of a quantized model requires careful measurement of accuracy degradation across multiple dimensions. The two most common evaluation approaches are perplexity measurement and downstream task accuracy.

Perplexity is the standard metric for evaluating quantized language models. It measures how well a model predicts the next token in a sequence, with lower values indicating better performance. Perplexity is typically computed on standard evaluation datasets such as WikiText-2 or C4. A well-quantized model should show minimal perplexity increase; for example, 8-bit quantization of LLaMA models typically increases WikiText-2 perplexity by less than 0.1 points.^[18]

Downstream task accuracy evaluates quantized models on specific benchmarks such as MMLU, HumanEval, GSM8K, and ARC. This approach captures whether the quantized model retains its capabilities on practical tasks. Research shows that perplexity is a reliable predictor of downstream performance: models with low perplexity degradation after quantization generally maintain their task accuracy as well.^[40]

Calibration dataset sensitivity is an important evaluation consideration. The choice and size of calibration data for PTQ methods can significantly affect results. GPTQ and AWQ typically use 128-256 calibration samples from C4 or WikiText-2, and results can vary based on the overlap between calibration data and evaluation benchmarks.

A practical rule of thumb for acceptable degradation: a 1% drop in accuracy may be acceptable if it leads to 2x speedup and 4x reduction in model size, but a 10% drop is typically prohibitive. The acceptable level depends on the application's requirements and the efficiency benefits achieved.

Applications across domains

Computer vision

Convolutional neural networks demonstrate strong quantization robustness, with deeper architectures generally more tolerant than compact efficient networks. ResNet family models quantize successfully to INT8 with minimal accuracy loss: ResNet-18 and ResNet-50 both maintain performance within 0.3% of full-precision baselines while gaining 1.59-1.65x CPU speedup and 4x size reduction.^[7]

MobileNet architectures present greater challenges due to aggressive efficiency optimizations. MobileNetV1 and V2's depth-wise separable convolutions prove sensitive to quantization, with per-tensor INT8 quantization causing 4-5% accuracy degradation. However, per-channel quantization recovers most losses, and quantization-aware training further improves results.^[8]

Object detection models benefit from hybrid quantization strategies that apply different bit-widths to backbones versus detection heads. YOLOv8 quantization with pruning achieves 64.2% compression ratio with approximately 4% detection accuracy loss, maintaining real-time performance on edge devices.^[44]

Natural language processing and LLMs

Transformer architectures and large language models dominate recent quantization research due to their scale and deployment challenges. BERT models quantize successfully to 8 bits with minimal degradation. Q8BERT achieves 4x compression maintaining accuracy, while FP8-BERT addresses outlier challenges through floating-point representation.^[45]

LLM quantization has evolved into a sophisticated subfield. The LLaMA 3.1 family (8B, 70B, 405B parameters) serves as a benchmark for quantization methods, with comprehensive evaluations showing 8-bit models maintaining 99%+ baseline accuracy and 4-bit models retaining 98.9% on code generation.^[39] Different quantization algorithms exhibit distinct strengths:

GPTQ: Excels for GPU deployment with 2x speedup and 4x memory reduction
AWQ: Provides fastest inference with 3x speedup, best for instruction-tuned models
GGUF: Enables CPU deployment with quality levels from 2 to 8 bits

Generative models

Generative models for images and video present unique quantization challenges. Stable Diffusion XL quantization with TensorRT achieves 1.72x INT8 speedup and 1.95x FP8 speedup while preserving image quality through percentile-based calibration that excludes extreme outliers.^[38] The iterative generation process amplifies quantization error accumulation across diffusion steps, requiring careful validation that perceptual quality remains acceptable.

Edge and mobile deployment

Quantization is the primary enabling technology for deploying AI models on edge devices with limited compute, memory, and power budgets. In autonomous vehicles, perception models that process data from cameras, LiDAR, and radar must run with extremely low latency to make real-time driving decisions. Quantization accelerates these critical models on the vehicle's embedded computing hardware.^[29]

Edge deployment in robotics benefits from quantization's power efficiency. Battery-powered robots require energy-efficient inference to maximize operational time. INT8 quantization of object detection, semantic segmentation, and SLAM models enables real-time processing on embedded GPUs like NVIDIA Jetson with 2-3x speedup and 40-60% power reduction.^[46] Mobile phones leverage the Apple Neural Engine, Qualcomm Hexagon NPU, and similar accelerators to run quantized models for on-device tasks including speech recognition, image processing, and language understanding.

Implementation in deep learning frameworks

PyTorch

PyTorch provides three quantization paradigms reflecting the evolution of the framework's capabilities. Eager Mode Quantization (beta status) offers manual control, requiring explicit fusion and quantization/dequantization module placement. FX Graph Mode Quantization (maintenance mode) automates fusion and quantization through graph-level transformations. PyTorch 2 Export Quantization (prototype) leverages torch.export to capture entire model graphs, enabling more sophisticated optimizations.^[8]

The PyTorch quantization API centers on configuration objects (QConfig) specifying observer modules for calibration and fake-quantization modules for QAT. Backend support spans diverse hardware: x86 CPUs through fbgemm and onednn libraries, ARM processors via qnnpack and xnnpack, and prototype NVIDIA GPU support through TensorRT integration.

TensorFlow and TensorFlow Lite

TensorFlow Model Optimization Toolkit offers comprehensive quantization capabilities through two main pathways. Post-training quantization provides weight-only, dynamic range, and full integer quantization options. Float16 quantization reduces model size by half while maintaining GPU acceleration benefits.^[47]

Full integer quantization converts weights and activations to 8-bit integers, achieving maximum efficiency for Edge TPU, NNAPI, and mobile deployment. The toolkit's representative dataset concept enables static quantization calibration, where users provide a data generator function that yields calibration samples from which quantization parameters are derived.

ONNX Runtime

ONNX Runtime bridges framework boundaries, accepting models from PyTorch, TensorFlow, and other frameworks in the standardized ONNX format. Quantization support encompasses dynamic, static, and quantization-aware training with two representation formats. QOperator format represents quantization at the operator level, directly replacing floating-point operators with quantized equivalents. QDQ (Quantize-DeQuantize) format explicitly inserts quantization and dequantization nodes in the graph, providing finer granularity and better hardware portability.^[48]

NVIDIA TensorRT

NVIDIA TensorRT provides production-grade inference optimization for NVIDIA GPUs, with quantization as a core capability. The framework supports INT8 and INT4 quantization through post-training and quantization-aware training workflows, with recent additions including FP8 and FP4 support via TensorRT Model Optimizer.^[38]

Layer fusion combines operations like convolution, activation, and batch normalization into single optimized kernels, reducing memory traffic and enabling better quantization by preserving higher precision in intermediate computations.

Hugging Face Transformers

Hugging Face Transformers has become the de facto standard for LLM deployment, with native quantization support for multiple methods. The BitsAndBytesConfig class enables 4-bit and 8-bit quantization through simple parameter specifications during model loading.^[1]

NormalFloat4 (NF4) quantization provides information-theoretically optimal 4-bit representation for normally distributed weights, while double quantization reduces memory further by quantizing the quantization constants. GPTQConfig and AwqConfig provide interfaces to advanced post-training quantization methods. Pre-quantized models on Hugging Face Hub load directly with appropriate configuration objects, enabling immediate deployment without local quantization. PEFT (Parameter-Efficient Fine-Tuning) integration allows training LoRA adapters on quantized base models, with QLoRA specifically designed for 4-bit quantization.^[19]

Apple Core ML Tools

Apple's Core ML Tools provide linear quantization with 8-bit and 4-bit support, per-tensor, per-channel, and per-block granularity, and specialized algorithms like GPTQ integration for sequential models.^[42] Per-block quantization (iOS 18+, macOS 15+) proves particularly effective for 4-bit weights, with blocks of 16-64 values sharing quantization parameters. The Neural Engine accelerates per-channel INT8 and FP16 operations, GPUs handle per-block quantization efficiently, and CPUs provide flexible support.

Challenges and research directions

Activation outliers in transformers

Activation outliers in transformer models represent the most significant challenge for aggressive quantization. Models exceeding 6.7 billion parameters develop extreme outlier channels with magnitude 100x larger than typical activations, concentrating in specific feature dimensions corresponding to residual stream layers (query-key-value projections, attention outputs).^[17] These structured outliers emerge early during pretraining and persist throughout model evolution, preventing accurate per-tensor quantization below 8 bits.

Current solutions adopt multiple strategies:

SmoothQuant: Migrates quantization difficulty from activations to weights through mathematically equivalent transformations^[21]
Mixed-precision: Preserves outlier channels at higher precision
Rotation-based methods: Apply orthogonal transformations to eliminate outliers before quantization
Activation decomposition: Separates activations into outlier-free components through singular value decomposition^[49]

Accuracy degradation at ultra-low precision

Accuracy degradation below 4 bits presents fundamental challenges distinct from outlier problems. Quantization to 3 bits and below causes systematic performance deterioration even with advanced methods. The information-theoretic perspective illuminates the fundamental constraint: 2-bit quantization provides only 4 discrete levels per parameter, insufficient to represent the rich parameter distributions learned during training.

Recent innovations like VPTQ achieve 95% accuracy preservation at 2 bits through vector-wise quantization, grouping parameters into vectors and quantizing vectors jointly rather than independently, leveraging inter-parameter correlations to reduce error.^[31]

Rotation-based quantization

Rotation-based quantization methods apply mathematical transformations to eliminate outliers through distributional reshaping rather than outlier preservation. QuaRot uses randomized Hadamard transformations exploiting rotational invariance, rotating weight matrices and activation distributions without changing computed outputs but redistributing outlier magnitude across channels.^[50]

SpinQuant extends this concept with learned rotations optimized during fine-tuning to minimize quantization error. DuQuant combines rotation with zigzag permutation, redistributing outliers across the feature dimension to balance quantization difficulty.^[51] These approaches enable 4-bit quantization of weights, activations, and KV cache simultaneously, something previously unattainable due to activation outliers.

Safety and alignment preservation

Safety and alignment preservation during quantization represents an emerging concern as models deploy broadly. Recent research demonstrates that quantization correlates with increased safety risks; quantized models generate more harmful outputs than full-precision counterparts when tested on adversarial prompts.^[52] The theoretical question of whether alignment and capability exhibit different sensitivity to quantization requires further investigation.

Frequently asked questions

Does quantization reduce accuracy?

Usually only slightly, and often imperceptibly. At 8 bits, weight-and-activation quantization typically recovers 99%+ of full-precision accuracy across benchmarks, and ResNet-50 loses only 0.22% top-1 accuracy when moved from FP32 to INT8.^[7]^[39] At 4 bits, well-engineered methods such as GPTQ and AWQ retain roughly 98-99% of baseline accuracy for LLMs. Accuracy loss becomes pronounced only at 3 bits and below, where the number of discrete levels (8 levels at 3 bits, 4 levels at 2 bits) is too small to capture learned parameter distributions without advanced techniques like vector quantization or rotation.^[18]^[31]

How much smaller does quantization make a model?

The compression ratio is set by the bit-width: FP32 to INT8 is exactly 4x, FP32 to INT4 is 8x, and ternary BitNet weights reach roughly 21x.^[22]^[28] Concretely, a 7B model drops from 28GB (FP32) to 7GB (INT8) or 3.5GB (INT4), and LLaMA 3.1 70B drops from 140GB in FP16 to about 35GB in INT8, which is what lets large models fit on consumer GPUs.

Which quantization method should I use for an LLM?

For GPU inference, GPTQ gives strong 4-bit results with about 2x speedup and 4x memory reduction; AWQ often delivers the fastest inference (3x speedup) and is especially strong for instruction-tuned models because it protects salient weights without backpropagation.^[18]^[20] For CPU-first or laptop deployment, GGUF with llama.cpp offers selectable quality levels from about 2.6 to 8.5 bits per weight. For 8-bit weight-and-activation serving at scale, SmoothQuant enables W8A8 across very large models, and for fine-tuning under tight memory, QLoRA combines 4-bit NF4 weights with LoRA adapters.^[19]^[21]

References

Hugging Face. "Quantization." Hugging Face Documentation. https://huggingface.co/docs/optimum/en/concept_guides/quantization ↩
Gholami, A., Kim, S., Dong, Z., et al. "A Survey of Quantization Methods for Efficient Neural Network Inference." arXiv:2103.13630, 2021. ↩
Lin, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. ↩
Ma, S., et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv:2402.17764, 2024. ↩
IBM. "What is Quantization?" IBM Think. https://www.ibm.com/think/topics/quantization ↩
Jacob, B., et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR, 2018. ↩
Nagel, M., et al. "A White Paper on Neural Network Quantization." Qualcomm AI Research, 2021. ↩
PyTorch. "Quantization." PyTorch Documentation. https://pytorch.org/docs/stable/quantization.html ↩
Krishnamoorthi, R. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv:1806.08342, 2018. ↩
Nagel, M., et al. "Up or Down? Adaptive Rounding for Post-Training Quantization." ICML, 2020. ↩
GGML. "GGUF format specification." https://github.com/ggerganov/ggml/blob/master/docs/gguf.md ↩
Nagel, M., et al. "A White Paper on Neural Network Quantization." Qualcomm AI Research, arXiv:2106.08295, 2021. ↩
Courbariaux, M., Bengio, Y., and David, J.-P. "BinaryConnect: Training Deep Neural Networks with binary weights during propagations." NeurIPS, 2015. ↩
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. "Binarized Neural Networks." NeurIPS, 2016. ↩
Han, S., Mao, H., and Dally, W. J. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR, 2016. ↩
Gholami, A., Kim, S., Dong, Z., et al. "A Survey of Quantization Methods for Efficient Neural Network Inference." arXiv:2103.13630, 2021. ↩
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022. arXiv:2208.07339. ↩
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR, 2023. arXiv:2210.17323. ↩
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS, 2023. arXiv:2305.14314. ↩
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024 Best Paper. arXiv:2306.00978. ↩
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." ICML, 2023. arXiv:2211.10438. ↩
Ma, S., Wang, H., Ma, L., et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv:2402.17764, 2024. ↩
Banner, R., Nahshan, Y., and Soudry, D. "Post training 4-bit quantization of convolutional networks for rapid-deployment." NeurIPS, 2019. ↩
TensorFlow. "Post-training quantization." TensorFlow Lite Documentation. https://www.tensorflow.org/lite/performance/post_training_quantization ↩
Bengio, Y., Leonard, N., and Courville, A. "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation." arXiv:1308.3432, 2013. ↩
Esser, S. K., McKinstry, J. L., Bablani, D., et al. "Learned Step Size Quantization." ICLR, 2020. ↩
Yin, P., Lyu, J., Zhang, S., et al. "Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets." ICLR, 2019. ↩
NVIDIA. "Model Quantization: Concepts, Methods, and Why It Matters." NVIDIA Technical Blog, 2025. ↩
Wu, H., Judd, P., Zhang, X., Isaev, M., and Micikevicius, P. "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation." arXiv:2004.09602, 2020. ↩
Dettmers, T. and Zettlemoyer, L. "The case for 4-bit precision: k-bit Inference Scaling Laws." ICML, 2023. ↩
Liu, Y., et al. "VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models." arXiv:2409.17066, 2024. ↩
Microsoft Research. "bitnet.cpp: Efficient Inference Framework for 1-bit LLMs." GitHub, 2024. ↩
Li, F., Zhang, B., and Liu, B. "Ternary Weight Networks." arXiv:1605.04711, 2016. ↩
Micikevicius, P., et al. "FP8 Formats for Deep Learning." arXiv:2209.05433, 2022. ↩
Liu, Z., et al. "Post-Training Quantization for Vision Transformer." NeurIPS, 2021. ↩
Micikevicius, P., et al. "Mixed Precision Training." ICLR, 2018. ↩
Hooper, C., Kim, S., Mohammadi, H., et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." NeurIPS, 2024. ↩
NVIDIA. "TensorRT Model Optimizer." NVIDIA Documentation. https://docs.nvidia.com/deeplearning/tensorrt/ ↩
Huang, W., et al. "A Comprehensive Evaluation of Quantization Strategies for Large Language Models." arXiv:2402.16775, 2024. ↩
Jin, H., et al. "A Comprehensive Evaluation of Quantization Strategies for Large Language Models." ACL Findings, 2024. ↩
Rokh, B., Azarpeyvand, A., and Khanteymoori, A. "A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification." ACM TIST, 2023. ↩
Apple. "Quantization." Core ML Tools Documentation. https://apple.github.io/coremltools/docs-guides/source/opt-quantization.html ↩
Qualcomm. "Unlocking on-device generative AI with an NPU and heterogeneous computing." Qualcomm White Paper, 2024. ↩
Wang, Y., et al. "YOLOv8 Quantization with Pruning for Edge Deployment." 2024. ↩
Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M. "Q8BERT: Quantized 8Bit BERT." EMC2 Workshop at NeurIPS, 2019. ↩
NVIDIA. "Jetson AI Edge Computing." NVIDIA Developer. https://developer.nvidia.com/embedded-computing ↩
TensorFlow. "Model Optimization Toolkit." https://www.tensorflow.org/model_optimization ↩
Microsoft. "ONNX Runtime Quantization." https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html ↩
Zhang, Y., et al. "QUAD: Quantization with Activation Decomposition for Efficient LLM Inference." 2024. ↩
Ashkboos, S., et al. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." arXiv:2404.00456, 2024. ↩
Lin, H., et al. "DuQuant: Distributing Outliers via Dual Transformations Makes Stronger Quantized LLMs." arXiv:2406.01721, 2024. ↩
Ahmadian, A., et al. "Intriguing Properties of Quantization at Scale." NeurIPS, 2023. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit