Quantization
Last reviewed
Jun 1, 2026
Sources
52 citations
Review status
Source-backed
Revision
v4 · 8,568 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
52 citations
Review status
Source-backed
Revision
v4 · 8,568 words
Add missing citations, update stale details, or suggest a clearer explanation.
Quantization in machine learning and artificial intelligence is the process of reducing the numerical precision of a neural network's parameters (weights, activations, and gradients) from high-precision formats such as 32-bit floating-point (FP32) to lower-precision representations like 8-bit integers (INT8) or 4-bit integers (INT4). This compression technique typically achieves a 4x model size reduction and 2-4x inference speedup with minimal accuracy loss.[1][2] Quantization has become essential for deploying large language models and other large-scale models on resource-constrained devices, reducing computational costs, and enabling real-time inference.
Originally explored in the 1990s for early neural networks, quantization experienced explosive growth after 2022 with the emergence of large language models, evolving from simple 8-bit post-training methods to sophisticated 1-bit architectures trained natively at low precision. Modern quantization enables running 70 billion parameter language models on consumer GPUs, deploying computer vision models on smartphones, and executing AI inference on edge devices with 40-80% energy savings.[3][4] Recent breakthroughs like Microsoft's BitNet demonstrate that models can be trained from scratch with ternary weights (values constrained to -1, 0, or +1) while matching full-precision performance, fundamentally challenging assumptions about the precision requirements of deep learning. As models continue scaling to trillions of parameters, quantization has transformed from an optimization technique into a necessity for practical AI deployment.
Imagine you have a box of 64 crayons with very specific colors like "cerulean blue" and "burnt sienna." You can draw detailed pictures, but the box is big and heavy. Now imagine replacing it with a box of just 8 basic crayons: red, blue, green, yellow, and so on. Your drawings will look almost the same to most people, but your crayon box is now much smaller and lighter, and you can color faster because you spend less time picking between similar shades.
Quantization does something similar with AI models. A model stores millions or billions of numbers (called weights) that tell it how to make decisions. Normally these numbers are stored with extreme precision, like writing a measurement as "3.14159265." Quantization rounds those numbers to simpler values, like "3.14" or even just "3." The model gets smaller, runs faster, and uses less energy, while still giving answers that are nearly as good as before.
To put it in everyday terms: if someone asks you what time it is, you might glance at your watch and say "10:20" instead of "10:21:37.4 seconds." That small loss of precision almost never matters in practice. Quantization applies this same principle to AI models.[5]
Quantization maps continuous floating-point values to a discrete set of integers through an affine transformation defined by two parameters: scale and zero-point. The fundamental quantization equation relates a floating-point value x to its quantized integer representation x_q through the formula:[1][6]
x = S x (x_q - Z)
where S represents the scale factor (a positive floating-point number) and Z represents the zero-point (an integer ensuring exact representation of zero).
The forward quantization process applies the inverse mapping with clipping:
x_q = clip(round(x / S + Z), alpha_q, beta_q)
where alpha_q and beta_q define the quantization range. For b-bit quantization, typical ranges include [-128, 127] for signed 8-bit integers or [0, 255] for unsigned representations. The scale and zero-point parameters derive from the floating-point range [alpha, beta] through:
S = (beta - alpha) / (beta_q - alpha_q)
Z = round((alpha x beta_q - beta x alpha_q) / (beta - alpha))
This affine scheme generalizes to symmetric quantization when the floating-point range centers around zero. Symmetric quantization enforces Z = 0, simplifying computation by eliminating zero-point adjustments.[7]
| Feature | Symmetric quantization | Asymmetric (affine) quantization |
|---|---|---|
| Real value range | [-alpha, alpha] (centered at 0) | [min, max] (not necessarily centered) |
| Zero-point (Z) | Fixed at 0 | Calculated integer value |
| Mapping formula | x = S x x_q | x = S x (x_q - Z) |
| Pros | Computationally faster, simpler | More flexible, better represents skewed data |
| Cons | May waste integer range if data not zero-centered | Slightly more computational overhead |
| Typical use | Weights (usually zero-centered) | Activations (often skewed, e.g. post-ReLU) |
Symmetric quantization assumes a symmetric range around zero (for example [-127, 127] for INT8), setting Z = 0. For INT8 symmetric quantization, the range becomes [-alpha, alpha] mapped to [-127, 127], deliberately excluding -128 to maintain perfect symmetry. This choice sacrifices one quantization level but enables computational speedups by removing addition operations from the dequantization formula, reducing it to x = S x x_q.[1][2]
Asymmetric (affine) quantization uses a non-zero Z to shift the range, better handling skewed distributions like activations after ReLU or similar functions, which have ranges [0, +max).[7]
The granularity at which quantization parameters are computed significantly affects both accuracy and computational overhead. Three main levels of granularity exist:
| Granularity | Description | Accuracy | Overhead |
|---|---|---|---|
| Per-tensor | One scale and zero-point for the entire tensor | Lowest | Lowest |
| Per-channel | Separate parameters for each output channel | Higher | Moderate |
| Per-group | Separate parameters for groups of values within a channel | Highest | Highest |
Per-tensor quantization uses one scale and zero-point for an entire tensor (for example, all weights in a layer share the same quantization parameters). This is the simplest and fastest approach but can lose accuracy when value distributions vary across channels.[8][9]
Per-channel quantization (also called per-axis) allows each output channel (or each filter) in a layer to have its own scale and zero-point. Per-channel quantization is commonly used for convolutional and fully-connected layer weights because the distribution of weights can differ substantially between channels. Using a separate scale for each channel often yields better accuracy, since it adapts to each set of values more closely.[10]
Per-group quantization divides each channel into smaller groups (commonly 32, 64, or 128 values) and computes separate quantization parameters for each group. This approach has become especially important for LLM quantization methods like GPTQ, AWQ, and bitsandbytes. It offers the best accuracy preservation because it captures local variations within a channel, though it incurs additional memory overhead from storing more scale and zero-point values. The GGUF format, for example, uses super-blocks of 256 values subdivided into 8 sub-blocks of 32 values each, with quantization parameters at both levels.[11]
Activations are typically quantized per-tensor because their statistics can change with every input batch, making it less practical to have distinct parameters per channel.
Quantization can be applied to different parts of a neural network:
Weight quantization reduces the precision of the learned parameters stored in the model. Weights are static after training and their distributions are known ahead of time, making them relatively straightforward to quantize. Weight-only quantization is the most common approach for LLMs, since it directly reduces model size and memory bandwidth requirements during inference.
Activation quantization reduces the precision of the intermediate outputs computed during the forward pass. Activations are input-dependent and their ranges can vary significantly across different inputs, making them harder to quantize accurately. Activation quantization requires either calibration (static quantization) or runtime range computation (dynamic quantization).
Weight-and-activation quantization (e.g., W8A8) quantizes both components, enabling fully integer arithmetic during inference. This provides the greatest speedup but demands careful handling of activation distributions.
Calibration determines the ranges for activations in static quantization:
The foundational concepts of neural network quantization emerged in the early 1990s when researchers first explored converting floating-point parameters to low-precision datatypes. Balzer and colleagues published pioneering work on weight quantization for Boltzmann machines in 1991, while Choudry introduced "continuous-discrete learning" that applied quantization during training. These early efforts remained largely academic curiosities until the 2010s, when the success of deep learning on ImageNet reignited interest in compression techniques.[13]
The 2015 publication of BinaryConnect by Matthieu Courbariaux and Yoshua Bengio marked the breakthrough moment for modern quantization research. Released on November 2, 2015, this paper demonstrated that convolutional neural networks could train with binary weights during forward and backward propagation, achieving near state-of-the-art results on MNIST, CIFAR-10, and SVHN benchmarks. BinaryConnect introduced the Straight-Through Estimator (STE), a technique for handling non-differentiable quantization functions during backpropagation by approximating the gradient as the identity function. This method became the de facto standard for training quantized networks.[13]
The momentum continued with BinaryNet in February 2016, extending binarization to both weights and activations. By constraining all values to {-1, +1}, BinaryNet achieved a remarkable 31.3x memory footprint reduction compared to 32-bit floating-point while maintaining acceptable accuracy.[14] Han and colleagues' Deep Compression work in 2015 combined pruning, quantization, and Huffman coding, demonstrating that AlexNet and VGG could be compressed by 35x and 49x respectively without accuracy loss.[15]
The field matured significantly between 2017 and 2021 as researchers systematically explored quantization's capabilities and limitations. Two comprehensive surveys published in 2021, one by Gholami and colleagues at UC Berkeley and another white paper by Qualcomm AI Research, consolidated understanding of post-training quantization and quantization-aware training methodologies.[16][12] These works established that 8-bit quantization typically incurs less than 1% accuracy loss for convolutional networks, while lower bit-widths require careful quantization-aware training to maintain performance.
The explosion of large language models in 2022-2023 catalyzed a quantization revolution focused specifically on transformer architectures. LLM.int8() by Tim Dettmers (August 2022) pioneered mixed-precision decomposition for handling outlier features in attention mechanisms, enabling 8-bit inference for models exceeding 30 billion parameters on single GPUs. This work revealed that transformer models develop extreme outlier activations (magnitude 100x larger than typical values) in specific feature dimensions, requiring special treatment for successful quantization.[17]
GPTQ followed in October 2022, applying layer-wise post-training quantization based on approximate second-order information. GPTQ successfully quantizes language models to 4-bit, 3-bit, and even 2-bit precision using Hessian-based error minimization and intelligent error redistribution across layers.[18] QLoRA emerged in May 2023, combining 4-bit quantization with Low-Rank Adapters to enable fine-tuning of 65 billion parameter models on single 48GB GPUs while preserving full 16-bit task performance. QLoRA introduced NormalFloat4 (NF4), an information-theoretically optimal quantization format for normally distributed weights, along with double quantization to reduce memory overhead by quantizing the quantization constants themselves.[19]
AWQ (Activation-aware Weight Quantization) appeared in June 2023, earning the MLSys 2024 Best Paper Award for its innovation in protecting salient weights based on activation distributions.[20] SmoothQuant, published at ICML 2023, introduced a mathematically equivalent transformation to migrate quantization difficulty from activations to weights, enabling efficient W8A8 quantization of LLMs up to 530 billion parameters.[21]
The year 2024 marked the arrival of 1-bit large language models with Microsoft Research's BitNet series. BitNet b1.58, published February 27, 2024, demonstrated that every parameter in a large language model could be constrained to ternary values {-1, 0, +1} (effectively 1.58 bits per parameter) while matching full-precision performance on perplexity and downstream tasks.[22]
Quantization strategies divide into two fundamental approaches based on timing: post-training quantization applies to already-trained models, while quantization-aware training incorporates quantization effects during the training process.
Post-Training Quantization converts a model's weights and/or activations to lower precision after the model has been fully trained in high precision.[2][12] PTQ is widely used due to its simplicity and speed; it does not require retraining the model or having access to the original training dataset and pipeline. However, because the model was not trained with quantization in mind, PTQ can lead to a more significant drop in accuracy compared to QAT, especially when quantizing to very low bit-widths (below 8 bits).[23]
In dynamic quantization (also known as "dynamic range quantization"), the model's weights are quantized offline, but the activations are quantized on the fly during inference.[2][1] For each input fed to the model, the range (min/max values) of the activation tensors is calculated at runtime. These dynamic ranges are then used to compute the quantization parameters for the activations for that specific inference pass.
The main advantage of dynamic quantization is its flexibility and robustness. It does not require a calibration dataset and can adapt to varying input data distributions, which often results in higher accuracy than static quantization, particularly for models like LLMs where the activation ranges can vary dramatically depending on the input prompt.[24] The trade-off is performance: the runtime computation of activation ranges introduces computational overhead, making inference slower than with static quantization.
In static quantization, both the model's weights and its activations are quantized offline, before inference begins.[24][8] While the range of the weights is known from the trained model, the range of the activations is input-dependent and must be estimated. This is achieved through a calibration step. During calibration, a small but representative dataset (typically a few hundred samples) is passed through the floating-point model, and observers record the statistical distribution of the activations at each layer.[1][10]
The primary advantage of static quantization is its high inference speed. Since all quantization parameters are pre-computed, the entire inference process can be executed using highly efficient integer-only arithmetic, with no runtime overhead for calculating scales or zero-points. This makes it ideal for latency-critical applications on edge devices where the input data distribution is relatively stable and predictable.
| Feature | Static PTQ | Dynamic PTQ |
|---|---|---|
| Weights quantization | Offline (pre-computed) | Offline (pre-computed) |
| Activations quantization | Offline (using calibration data) | On-the-fly (at runtime) |
| Calibration required? | Yes, needs representative dataset | No |
| Inference speed | Very fast (integer-only arithmetic) | Slower (runtime overhead) |
| Accuracy | Good, but sensitive to data shifts | Often higher, more robust |
| Ideal use case | Edge devices, CNNs for vision | Server-side LLMs with varied inputs |
Quantization-Aware Training integrates the quantization process directly into the model training or fine-tuning phase.[25][26] While more complex and computationally intensive than PTQ, QAT generally achieves higher accuracy, often recovering nearly all of the performance of the original floating-point model.
The core mechanism of QAT is the simulation of low-precision arithmetic during training. This is accomplished by inserting "fake quantization" (or quantize-dequantize) nodes into the model's computation graph, typically after layers that produce weights and activations.[26][9]
In the forward pass of training, these nodes perform a three-step operation:
The resulting FP32 tensor now carries the "imprint" of quantization error. This error-injected tensor is then passed to the next layer. By doing this, the model's loss function is directly exposed to the effects of quantization throughout the training process. This forces the optimization algorithm (e.g. SGD, Adam) to find a set of weights that is not only good at the task but also robust to the noise and reduced precision of the quantized domain.[26]
A critical challenge in QAT is that the rounding operation inherent in quantization is non-differentiable. Its derivative is zero almost everywhere, which would block the flow of gradients during backpropagation and halt the training process.[25][9]
To overcome this, QAT relies on the Straight-Through Estimator (STE). The STE is an approximation for the gradient of the non-differentiable quantization function. During the backward pass, the STE treats the quantization node as an identity function, passing the gradient from its output directly to its input without modification.[26][27] In essence, while the forward pass sees the effects of quantization, the backward pass "looks through" the problematic rounding operation, allowing gradients to flow and the model's full-precision weights to be updated effectively.
| Aspect | PTQ | QAT |
|---|---|---|
| Complexity | Low; simple to apply | High; requires retraining/fine-tuning |
| Data requirement | Small calibration set or none | Requires training dataset |
| Computational cost | Low; fast conversion | High; significant compute for fine-tuning |
| Model accuracy | Good, but can degrade at <8 bits | Excellent; near-original accuracy |
| When to use | Rapid deployment, no training access | When accuracy is paramount, low bit-widths |
INT8 quantization represents the production standard, offering 4x model size reduction and 2-4x inference speedup with typically less than 1% accuracy loss.[28][29] Signed 8-bit integers span the range [-128, 127] or [0, 255] for unsigned variants, providing 256 discrete levels to represent continuous floating-point values. Hardware support for INT8 operations is nearly universal across modern processors: NVIDIA Tensor Cores accelerate INT8 matrix multiplication on Volta and newer architectures, Intel CPUs provide optimized VNNI instructions, ARM processors include INT8 NEON extensions, and Google's Edge TPU executes exclusively in INT8.[7]
Intel's comprehensive study of 69 models on x86 CPUs demonstrates INT8 quantization achieving 2.97x geometric mean speedup compared to FP32, with individual models like MobileNetV2 reaching 3.94x speedup at batch size 64. ResNet-50 demonstrates typical INT8 characteristics: accuracy drops from 70.07% to 69.85% (0.22% loss), while inference accelerates by 1.59-1.65x on CPUs and approximately 2x on GPUs.[7]
INT4 quantization occupies the frontier of practical deployment for large language models. 4-bit precision achieves 8x model size reduction, enabling 70 billion parameter models to fit on consumer GPUs with 24GB memory.[30] Methods like GPTQ and AWQ demonstrate that 4-bit quantization of LLM weights maintains high quality with proper calibration, typically achieving 98-99% accuracy recovery compared to full-precision baselines.
Pushing below 4 bits presents escalating challenges. INT2 quantization compresses models by 16x but frequently causes substantial accuracy degradation without sophisticated techniques. Research shows 2-bit GPTQ quantization of LLaMA-65B decreases LAMBADA accuracy from 79% to 57%, with mathematical reasoning particularly affected (suffering up to 32.39% accuracy loss).[18] Recent innovations like Vector Post-Training Quantization (VPTQ) achieve 95% accuracy preservation at 2 bits through vector-wise quantization and advanced codebook optimization.[31]
Binary neural networks (BNNs) and Ternary neural networks (TNNs) represent the extreme end of the precision spectrum. BinaryNet demonstrated feasibility in 2016 by constraining weights and activations to {-1, +1}, achieving 32x compression and replacing multiply-accumulate operations with XNOR and bit-counting.[14]
BitNet b1.58 uses ternary weights {-1, 0, +1} with 8-bit activations, achieving approximately 21x compression.[22] The publicly released BitNet b1.58 2B4T model occupies merely 0.4GB compared to 4-5GB for full-precision 2 billion parameter models. Specialized inference frameworks like bitnet.cpp achieve 1.37-6.17x speedup on CPUs with 55-82% energy reduction compared to full-precision inference.[32]
Ternary neural networks represent a compromise between binary and higher precision. Weights are constrained to three values, typically {-W, 0, +W}, where W is a learnable, layer-specific scaling factor.[33] The inclusion of an explicit zero state introduces sparsity into the weight matrices, which can be exploited by hardware to skip computations involving zero, leading to further energy savings.
FP16 (16-bit floating-point) and BF16 (Brain Floating Point 16) provide 2x compression compared to FP32 while maintaining floating-point representation benefits. FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits, offering higher precision but a smaller dynamic range. BF16 uses the same 8-bit exponent as FP32 (with 7 mantissa bits), providing wider dynamic range than FP16, which makes it more stable for training large models and less prone to underflow/overflow issues.[7]
FP8 (8-bit floating-point) formats have emerged as superior alternatives to INT8 for transformer models. The IEEE-standardized E5M2 (5-bit exponent, 2-bit mantissa) and E4M3 (4-bit exponent, 3-bit mantissa) formats provide wider dynamic range than fixed-point integers, better handling the heavy-tailed activation distributions characteristic of large language models.[34] NVIDIA's H100 GPUs with Hopper architecture provide native FP8 support, achieving 1.95x speedup for Stable Diffusion XL compared to FP16.
| Format | Bit width | Description | Typical use cases |
|---|---|---|---|
| FP32 | 32 | Full-precision floating-point; training baseline | High-accuracy scenarios, training |
| FP16 | 16 | Half-precision floating-point; maintains dynamic range | GPU inference, mixed-precision training |
| BF16 | 16 | Brain Floating Point; wider exponent for stability | Training large models, TPUs |
| FP8 | 8 | E5M2 or E4M3; floating-point with reduced precision | LLM inference on Hopper GPUs |
| INT8 | 8 | 8-bit integer; industry standard | Mobile, edge devices, production inference |
| INT4 | 4 | 4-bit integer; aggressive compression | LLMs on consumer GPUs |
| INT2 | 2 | 2-bit integer; extreme compression | Research, specialized applications |
| NF4 | 4 | NormalFloat4; optimal for normal distributions | QLoRA, LLM fine-tuning |
| FP4 | 4 | 4-bit floating-point; supported on Blackwell GPUs | Next-generation inference |
| Ternary | ~1.58 | Values in {-1, 0, +1} | BitNet, ultra-efficient inference |
| Binary | 1 | Values in {-1, +1} | Extreme edge deployment |
Mixed-precision quantization assigns different bit-widths to different layers or components based on sensitivity analysis. Hessian-based methods compute second-order information to identify layers where quantization most impacts the loss function, preserving higher precision for sensitive layers.[12] Common mixed-precision strategies include:
For vision transformers, research demonstrates that multi-head self-attention modules require higher precision than feed-forward networks, with projection layers being most sensitive and fully-connected layers tolerating aggressive quantization.[35]
Mixed-precision training is a related but distinct concept from mixed-precision quantization. It refers to the practice of using a mix of numerical precisions during the training process itself (not just for inference). The standard approach, introduced by Micikevicius et al. at NVIDIA in 2018, uses FP16 or BF16 arithmetic for most forward and backward pass computations while maintaining FP32 "master weights" for the parameter update step.[36]
The technique has three core components:
FP32 master weights: A primary copy of the model weights is kept in FP32. This ensures that weight updates, which involve small gradient values accumulated over time, do not suffer from the limited precision of 16-bit formats.
FP16/BF16 forward and backward passes: The forward pass and gradient computation use half-precision arithmetic. On NVIDIA GPUs, Tensor Cores can execute FP16 matrix operations at up to 16x the throughput of FP32 operations (for example, the A100 delivers approximately 312 TFLOPS in FP16/BF16 versus 19.5 TFLOPS in FP32).
Loss scaling: Because FP16 has a significantly smaller dynamic range than FP32, gradient values can underflow to zero during backpropagation. Loss scaling multiplies the computed loss by a large factor before the backward pass, shifting gradient magnitudes into FP16's representable range. After computing gradients, the scale factor is divided out before the weight update. Dynamic loss scaling automatically adjusts this factor during training.
BF16 has largely supplanted FP16 for mixed-precision training on modern hardware. Because BF16 shares the same 8-bit exponent as FP32, it has a comparable dynamic range, which means loss scaling is often unnecessary or requires only simple static scaling. Google TPUs and NVIDIA Ampere (and later) GPUs natively support BF16 arithmetic. Mixed-precision training is not quantization in the strict sense (no conversion to integer types), but it represents a closely related form of precision reduction that is standard practice for training virtually all large-scale models today.
GPTQ applies optimal brain quantization principles in a one-shot, layer-wise manner. The algorithm processes each layer independently, using approximate second-order information derived from the Hessian matrix to determine optimal quantization.[18] For each column of weights, GPTQ minimizes reconstruction error by computing the optimal quantized weight that minimizes the squared difference between original and quantized layer outputs, considering the Hessian H = 2X^T X. The elegant aspect lies in error redistribution: quantization errors from earlier weights inform adjustments to later weights in the same row, minimizing accumulated error across the entire layer.
GPTQ's computational efficiency stems from avoiding gradient-based optimization, instead relying on closed-form updates. Quantizing the OPT-175B model takes approximately 4 GPU-hours, making the technique practical for the largest publicly available models. The method supports aggressive quantization to 4-bit, 3-bit, and 2-bit precision, though quality degrades significantly below 3 bits without additional techniques like outlier preservation. Production deployments leverage optimized kernels like ExLlama, achieving 2x inference speedup while reducing memory by 4x for 4-bit quantization.
AWQ introduces the key insight that not all weights contribute equally to model performance. Analyzing activation distributions reveals that approximately 1% of weights, those corresponding to salient activation channels, disproportionately affect model outputs.[20] AWQ protects these salient weights by applying per-channel scaling factors that amplify important weight magnitudes before uniform quantization.
The crucial innovation is that scaling does not require mixed-precision arithmetic during inference; instead, scales can be absorbed into subsequent layers or activation functions. AWQ optimizes scaling factors to minimize quantization error weighted by activation magnitude. This activation-aware approach provides multiple benefits: faster inference than GPTQ (2.7x speedup on RTX 4090), superior accuracy preservation for instruction-tuned models, reduced calibration data requirements (as few as 128 tokens), and better generalization to multi-modal architectures.
SmoothQuant, published at ICML 2023 by Xiao et al., addresses the fundamental challenge that activations in large language models are much harder to quantize than weights due to systematic outliers.[21] While weight distributions tend to be relatively uniform and flat, activation distributions develop extreme outliers (approximately 100x larger than typical values) that make per-tensor activation quantization impractical.
The core idea is a mathematically equivalent per-channel scaling transformation that migrates quantization difficulty from activations to weights. For a linear layer computing Y = XW, SmoothQuant introduces a diagonal scaling matrix s and reformulates the computation as:
Y = (X diag(s)^-1) (diag(s) W)
By dividing each activation channel by its corresponding scale factor and multiplying the corresponding weight channel by the same factor, the transformation smooths out activation outliers while making weights slightly harder to quantize. The net effect is that both weights and activations become easy to quantize, enabling efficient W8A8 (8-bit weight, 8-bit activation) quantization for the entire model.
SmoothQuant achieves up to 1.56x speedup and 2x memory reduction for LLMs with negligible accuracy loss, and has been demonstrated on models up to 530 billion parameters (including OPT, BLOOM, GLM, and LLaMA families). The scaling factors are computed offline using a small calibration dataset, making SmoothQuant a simple, training-free technique. It has been widely integrated into inference frameworks including NVIDIA TensorRT-LLM, Intel Neural Compressor, and vLLM.
GGUF (GPT-Generated Unified Format) and the llama.cpp framework target CPU-first inference with optional GPU offloading. GGUF represents a quantized model serialization format that stores weights, metadata, and quantization parameters in an efficient binary representation. The format supports block-wise quantization strategies: super-blocks of 256 values subdivide into 8 sub-blocks of 32 values each, with quantization parameters at both levels enabling fine-grained adaptation to local statistics.[11]
GGUF quality levels span from Q2_K (2.63 bits per weight) through Q8_0 (8.5 bits), allowing users to select accuracy-efficiency trade-offs based on deployment constraints. The llama.cpp inference engine provides CPU implementations optimized for diverse architectures: x86 with AVX, AVX2, and AVX-512 instructions; Apple Silicon leveraging Metal and Accelerate framework; ARM processors using NEON; and GPU backends for CUDA and ROCm.
Performance benchmarks show 100 billion parameter models executing on single CPU cores at human reading speed (5-7 tokens/second), democratizing LLM access for researchers and developers without access to high-end GPUs. GGUF's K-quants variants apply K-means clustering to weight distributions, creating non-uniform quantization levels that concentrate representation capacity where weights densely cluster.
BitNet and BitNet b1.58 fundamentally differ from post-training methods by training models natively at low precision. The architecture replaces standard Linear layers with BitLinear layers that constrain weights during forward propagation.[22] For BitNet b1.58, weights are quantized to {-1, 0, +1} using a rounding function normalized by the average absolute weight value. Activations undergo 8-bit quantization with similar absmax-based scaling. The critical innovation enabling training involves maintaining high-precision weights for gradient accumulation while simulating low-precision arithmetic during forward and backward passes.
BitNet's computational model replaces floating-point multiplications with integer additions, dramatically reducing arithmetic complexity. Matrix multiplication W x X with ternary weights decomposes to selective addition and subtraction based on weight masks identifying positive and negative positions. This formulation requires only additions and subtractions, enabling specialized hardware implementations. The bitnet.cpp framework provides optimized kernels achieving 1.37-6.17x CPU speedup and 55-82% energy reduction compared to full-precision inference, with particularly strong performance on ARM architectures.[32]
As large language models process longer sequences, the key-value (KV) cache used in transformer attention mechanisms grows to consume significant GPU memory. For a model like LLaMA-2-70B processing a 128K token context, the KV cache alone can require over 80GB of memory in FP16 precision, often exceeding the memory consumed by the model weights themselves. KV-cache quantization compresses these cached key and value tensors to lower precision, enabling longer context lengths within the same memory budget.[37]
Several approaches have emerged for KV-cache quantization:
Uniform quantization applies standard INT8 or INT4 quantization to all key and value tensors. This straightforward approach can introduce accuracy degradation, particularly for key tensors whose values feed into the softmax attention computation.
Heterogeneous quantization stores keys and values at different precisions. Research has shown that keys are more sensitive to quantization than values because key quantization errors affect the shared softmax denominator, amplifying errors across all attention positions. Consequently, methods like LeanKV store keys at higher precision (e.g. 8 bits) and values at lower precision (e.g. 4 bits).
Per-channel, pre-RoPE key quantization quantizes key tensors before rotary positional encoding (RoPE) is applied. Because RoPE introduces channel-dependent rotations that can create outliers, quantizing before RoPE yields more uniform distributions and better accuracy.
KVQuant (NeurIPS 2024) combines sensitivity-weighted non-uniform quantization with per-channel key quantization to achieve under 0.1 perplexity degradation at 3-bit precision across LLaMA, Llama-2, Llama-3, and Mistral models. It enables serving LLaMA-7B with up to 1 million token context on a single A100-80GB GPU and up to 10 million tokens on an 8-GPU system.[37]
KV-cache quantization has become a critical optimization for production LLM serving systems. Frameworks like vLLM, TensorRT-LLM, and SGLang support KV-cache quantization as a built-in feature, typically offering INT8 and FP8 options for the KV cache independent of the weight quantization format.
Model size reduction represents quantization's most immediate benefit, with compression ratios directly proportional to bit-width reduction. FP32 to INT8 quantization achieves exactly 4x compression: a 7 billion parameter model decreases from 28GB to 7GB. INT4 quantization doubles this to 8x compression, fitting the same model in 3.5GB.[28]
The landmark Deep Compression work demonstrated extreme ratios combining quantization with pruning and Huffman coding: AlexNet compressed 35x (240MB to 6.9MB) and VGG-16 compressed 49x (552MB to 11.3MB), both without accuracy loss.[15] For modern large language models, these reductions transform deployment feasibility. LLaMA 3.1 70B requires 140GB in FP16 but only 35GB in INT8, crossing the threshold from impractical to feasible for consumer GPUs.
Inference speed improvements vary substantially based on hardware, model architecture, and bottleneck characteristics. Compute-bound operations, where arithmetic operations dominate execution time, benefit most from quantization's reduced operation complexity. NVIDIA reports 1.8x speedup for W8A8-INT quantization on A100 GPUs, while INT8 operations on Tensor Cores can theoretically reach 4x speedup over FP32.[38]
TensorRT-optimized Stable Diffusion XL achieves 1.72x speedup with INT8 and 1.95x with FP8 on RTX 6000 Ada GPUs. CPU implementations show equally impressive gains: Intel's optimized INT8 operators deliver 2.97x geometric mean speedup across 69 models on x86 processors.[7]
Memory bandwidth-bound operations gain from reduced data movement. Modern GPU inference often stalls waiting for data transfer rather than computation, particularly for large models where weights cannot fit in high-speed cache. A 4x reduction in model size enables 4x more weights to occupy L2 cache, dramatically reducing expensive DRAM accesses.
Accuracy preservation depends critically on bit-width, quantization method, and model characteristics. Comprehensive evaluations provide definitive statistics: 8-bit W8A8-INT quantization achieves 99%+ accuracy recovery across all benchmarks, essentially matching full-precision performance. 4-bit W4A16 quantization maintains 98.9% accuracy recovery for code generation, 98.5% for mathematics, and similarly high retention across diverse tasks.[39]
Below 4 bits, accuracy degradation accelerates unless sophisticated techniques intervene. Standard 3-bit quantization shows noticeable degradation, with model capacity beginning to deteriorate. Instruction-tuned LLMs generally maintain performance comparable to their non-quantized counterparts at 8-bit and 4-bit precisions, with 4-bit quantized models often showing similar performance to their BFloat16 equivalents. A significant performance drop is observed when LLMs are quantized to 3-bit or lower, particularly with GPTQ.[40]
Energy efficiency gains prove crucial for edge deployment and environmental sustainability. Manufacturing edge AI case studies document dramatic improvements: hardware costs decreased 92% (from $225,000 for 50 GPU cards to $18,000 for 4 cards) while energy consumption dropped 65-80% through INT8 quantization.[41]
Mobile edge computing research shows up to 40% overall energy reduction, combining computational savings with reduced data transmission. BitNet models demonstrate 55-82% energy savings on CPUs compared to full-precision baselines.[32] The bitnet.cpp framework running the 2B4T model achieves remarkable edge device performance: 11 tokens per second on Raspberry Pi 5 and 48 tokens per second on Snapdragon X Elite. These metrics transform deployment economics; models that previously required $10,000+ GPU infrastructure now execute on $100 single-board computers.
The effectiveness of quantized inference depends heavily on hardware support for low-precision arithmetic. Different hardware platforms support different precision formats, and this fragmentation influences which quantization methods are practical for a given deployment target.
| Hardware | INT8 | INT4 | FP8 | FP4 | BF16/FP16 | Notes |
|---|---|---|---|---|---|---|
| NVIDIA Volta/Turing (V100, T4) | Yes | No | No | No | FP16 | First Tensor Core INT8 support |
| NVIDIA Ampere (A100) | Yes | Yes | No | No | Both | INT4 Tensor Core operations |
| NVIDIA Hopper (H100) | Yes | No | Yes | No | Both | Native FP8 support; INT4 removed |
| NVIDIA Blackwell (B100/B200) | Yes | No | Yes | Yes | Both | FP4 Tensor Core support added |
| Intel CPUs (VNNI) | Yes | No | No | No | No | Optimized INT8 dot products via VNNI |
| Apple Neural Engine (A16+) | Yes | Yes (channel-wise) | No | No | FP16 | INT8 groupwise, INT4 channel-wise |
| Qualcomm Hexagon NPU | Yes | Yes | No | No | FP16 | Native INT4 doubles tensor throughput |
| Google TPU v4/v5 | Yes | Yes | No | No | BF16 | Optimized for BF16 training |
| ARM CPUs (NEON) | Yes | No | No | No | FP16 | INT8 via NEON extensions |
NVIDIA's evolution illustrates the shifting landscape of quantization hardware support. The Volta architecture (2017) introduced Tensor Cores with INT8 support. Ampere (2020) added fast INT4 Tensor Core operations. However, the Hopper architecture (2022) removed INT4 support in favor of FP8, reflecting the industry's recognition that floating-point formats better handle the heavy-tailed distributions common in transformer models.[34] Blackwell (2024) added FP4 Tensor Core support, continuing the trend toward low-bit floating-point formats.
Apple's Neural Engine primarily performs INT8 x INT8 and FP16 x FP16 matrix multiplications. Inputs can optionally use INT4 or INT8 linear quantization, with INT8 supporting groupwise quantization and INT4 limited to channel-wise quantization.[42] Qualcomm's Hexagon NPU includes native INT4 support, doubling tensor throughput compared to INT8 for supported operations.[43]
The absence of industry-wide quantization standards forces developers to maintain multiple quantized model versions for different deployment targets. Framework interoperability remains imperfect: PyTorch, TensorFlow, and ONNX use different quantization schemes, complicating model conversion. Kernel optimization presents an additional challenge. Standard deep learning frameworks lack specialized implementations for many quantization schemes, particularly exotic formats like 1-bit or mixed-precision. BitNet models demonstrate this gap: running through standard transformers can be slower than full-precision inference despite theoretical 32x reduction in arithmetic complexity, because software falls back to inefficient dequantization before every operation.[22]
Evaluating the quality of a quantized model requires careful measurement of accuracy degradation across multiple dimensions. The two most common evaluation approaches are perplexity measurement and downstream task accuracy.
Perplexity is the standard metric for evaluating quantized language models. It measures how well a model predicts the next token in a sequence, with lower values indicating better performance. Perplexity is typically computed on standard evaluation datasets such as WikiText-2 or C4. A well-quantized model should show minimal perplexity increase; for example, 8-bit quantization of LLaMA models typically increases WikiText-2 perplexity by less than 0.1 points.[18]
Downstream task accuracy evaluates quantized models on specific benchmarks such as MMLU, HumanEval, GSM8K, and ARC. This approach captures whether the quantized model retains its capabilities on practical tasks. Research shows that perplexity is a reliable predictor of downstream performance: models with low perplexity degradation after quantization generally maintain their task accuracy as well.[40]
Calibration dataset sensitivity is an important evaluation consideration. The choice and size of calibration data for PTQ methods can significantly affect results. GPTQ and AWQ typically use 128-256 calibration samples from C4 or WikiText-2, and results can vary based on the overlap between calibration data and evaluation benchmarks.
A practical rule of thumb for acceptable degradation: a 1% drop in accuracy may be acceptable if it leads to 2x speedup and 4x reduction in model size, but a 10% drop is typically prohibitive. The acceptable level depends on the application's requirements and the efficiency benefits achieved.
Convolutional neural networks demonstrate strong quantization robustness, with deeper architectures generally more tolerant than compact efficient networks. ResNet family models quantize successfully to INT8 with minimal accuracy loss: ResNet-18 and ResNet-50 both maintain performance within 0.3% of full-precision baselines while gaining 1.59-1.65x CPU speedup and 4x size reduction.[7]
MobileNet architectures present greater challenges due to aggressive efficiency optimizations. MobileNetV1 and V2's depth-wise separable convolutions prove sensitive to quantization, with per-tensor INT8 quantization causing 4-5% accuracy degradation. However, per-channel quantization recovers most losses, and quantization-aware training further improves results.[8]
Object detection models benefit from hybrid quantization strategies that apply different bit-widths to backbones versus detection heads. YOLOv8 quantization with pruning achieves 64.2% compression ratio with approximately 4% detection accuracy loss, maintaining real-time performance on edge devices.[44]
Transformer architectures and large language models dominate recent quantization research due to their scale and deployment challenges. BERT models quantize successfully to 8 bits with minimal degradation. Q8BERT achieves 4x compression maintaining accuracy, while FP8-BERT addresses outlier challenges through floating-point representation.[45]
LLM quantization has evolved into a sophisticated subfield. The LLaMA 3.1 family (8B, 70B, 405B parameters) serves as a benchmark for quantization methods, with comprehensive evaluations showing 8-bit models maintaining 99%+ baseline accuracy and 4-bit models retaining 98.9% on code generation.[39] Different quantization algorithms exhibit distinct strengths:
Generative models for images and video present unique quantization challenges. Stable Diffusion XL quantization with TensorRT achieves 1.72x INT8 speedup and 1.95x FP8 speedup while preserving image quality through percentile-based calibration that excludes extreme outliers.[38] The iterative generation process amplifies quantization error accumulation across diffusion steps, requiring careful validation that perceptual quality remains acceptable.
Quantization is the primary enabling technology for deploying AI models on edge devices with limited compute, memory, and power budgets. In autonomous vehicles, perception models that process data from cameras, LiDAR, and radar must run with extremely low latency to make real-time driving decisions. Quantization accelerates these critical models on the vehicle's embedded computing hardware.[29]
Edge deployment in robotics benefits from quantization's power efficiency. Battery-powered robots require energy-efficient inference to maximize operational time. INT8 quantization of object detection, semantic segmentation, and SLAM models enables real-time processing on embedded GPUs like NVIDIA Jetson with 2-3x speedup and 40-60% power reduction.[46] Mobile phones leverage the Apple Neural Engine, Qualcomm Hexagon NPU, and similar accelerators to run quantized models for on-device tasks including speech recognition, image processing, and language understanding.
PyTorch provides three quantization paradigms reflecting the evolution of the framework's capabilities. Eager Mode Quantization (beta status) offers manual control, requiring explicit fusion and quantization/dequantization module placement. FX Graph Mode Quantization (maintenance mode) automates fusion and quantization through graph-level transformations. PyTorch 2 Export Quantization (prototype) leverages torch.export to capture entire model graphs, enabling more sophisticated optimizations.[8]
The PyTorch quantization API centers on configuration objects (QConfig) specifying observer modules for calibration and fake-quantization modules for QAT. Backend support spans diverse hardware: x86 CPUs through fbgemm and onednn libraries, ARM processors via qnnpack and xnnpack, and prototype NVIDIA GPU support through TensorRT integration.
TensorFlow Model Optimization Toolkit offers comprehensive quantization capabilities through two main pathways. Post-training quantization provides weight-only, dynamic range, and full integer quantization options. Float16 quantization reduces model size by half while maintaining GPU acceleration benefits.[47]
Full integer quantization converts weights and activations to 8-bit integers, achieving maximum efficiency for Edge TPU, NNAPI, and mobile deployment. The toolkit's representative dataset concept enables static quantization calibration, where users provide a data generator function that yields calibration samples from which quantization parameters are derived.
ONNX Runtime bridges framework boundaries, accepting models from PyTorch, TensorFlow, and other frameworks in the standardized ONNX format. Quantization support encompasses dynamic, static, and quantization-aware training with two representation formats. QOperator format represents quantization at the operator level, directly replacing floating-point operators with quantized equivalents. QDQ (Quantize-DeQuantize) format explicitly inserts quantization and dequantization nodes in the graph, providing finer granularity and better hardware portability.[48]
NVIDIA TensorRT provides production-grade inference optimization for NVIDIA GPUs, with quantization as a core capability. The framework supports INT8 and INT4 quantization through post-training and quantization-aware training workflows, with recent additions including FP8 and FP4 support via TensorRT Model Optimizer.[38]
Layer fusion combines operations like convolution, activation, and batch normalization into single optimized kernels, reducing memory traffic and enabling better quantization by preserving higher precision in intermediate computations.
Hugging Face Transformers has become the de facto standard for LLM deployment, with native quantization support for multiple methods. The BitsAndBytesConfig class enables 4-bit and 8-bit quantization through simple parameter specifications during model loading.[1]
NormalFloat4 (NF4) quantization provides information-theoretically optimal 4-bit representation for normally distributed weights, while double quantization reduces memory further by quantizing the quantization constants. GPTQConfig and AwqConfig provide interfaces to advanced post-training quantization methods. Pre-quantized models on Hugging Face Hub load directly with appropriate configuration objects, enabling immediate deployment without local quantization. PEFT (Parameter-Efficient Fine-Tuning) integration allows training LoRA adapters on quantized base models, with QLoRA specifically designed for 4-bit quantization.[19]
Apple's Core ML Tools provide linear quantization with 8-bit and 4-bit support, per-tensor, per-channel, and per-block granularity, and specialized algorithms like GPTQ integration for sequential models.[42] Per-block quantization (iOS 18+, macOS 15+) proves particularly effective for 4-bit weights, with blocks of 16-64 values sharing quantization parameters. The Neural Engine accelerates per-channel INT8 and FP16 operations, GPUs handle per-block quantization efficiently, and CPUs provide flexible support.
Activation outliers in transformer models represent the most significant challenge for aggressive quantization. Models exceeding 6.7 billion parameters develop extreme outlier channels with magnitude 100x larger than typical activations, concentrating in specific feature dimensions corresponding to residual stream layers (query-key-value projections, attention outputs).[17] These structured outliers emerge early during pretraining and persist throughout model evolution, preventing accurate per-tensor quantization below 8 bits.
Current solutions adopt multiple strategies:
Accuracy degradation below 4 bits presents fundamental challenges distinct from outlier problems. Quantization to 3 bits and below causes systematic performance deterioration even with advanced methods. The information-theoretic perspective illuminates the fundamental constraint: 2-bit quantization provides only 4 discrete levels per parameter, insufficient to represent the rich parameter distributions learned during training.
Recent innovations like VPTQ achieve 95% accuracy preservation at 2 bits through vector-wise quantization, grouping parameters into vectors and quantizing vectors jointly rather than independently, leveraging inter-parameter correlations to reduce error.[31]
Rotation-based quantization methods apply mathematical transformations to eliminate outliers through distributional reshaping rather than outlier preservation. QuaRot uses randomized Hadamard transformations exploiting rotational invariance, rotating weight matrices and activation distributions without changing computed outputs but redistributing outlier magnitude across channels.[50]
SpinQuant extends this concept with learned rotations optimized during fine-tuning to minimize quantization error. DuQuant combines rotation with zigzag permutation, redistributing outliers across the feature dimension to balance quantization difficulty.[51] These approaches enable 4-bit quantization of weights, activations, and KV cache simultaneously, something previously unattainable due to activation outliers.
Safety and alignment preservation during quantization represents an emerging concern as models deploy broadly. Recent research demonstrates that quantization correlates with increased safety risks; quantized models generate more harmful outputs than full-precision counterparts when tested on adversarial prompts.[52] The theoretical question of whether alignment and capability exhibit different sensitivity to quantization requires further investigation.