SmoothQuant
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,371 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,371 words
Add missing citations, update stale details, or suggest a clearer explanation.
SmoothQuant is a training-free, post-training quantization (PTQ) method that enables 8-bit weight and 8-bit activation (W8A8) integer inference for large language models without retraining and with negligible accuracy degradation.[^1] Introduced in November 2022 by Guangxuan Xiao and Ji Lin of MIT, with Mickael Seznec, Hao Wu, Julien Demouth of NVIDIA, and Song Han of MIT, the technique addresses a long-standing obstacle in LLM serving: the systematic emergence of activation outliers in models above roughly 6.7 billion parameters that make standard INT8 activation quantization fail catastrophically.[^1][^2] SmoothQuant resolves this by introducing a per-channel scaling factor that mathematically migrates quantization difficulty from the hard-to-quantize activations to the relatively easier-to-quantize weights, leaving the network output algebraically unchanged.[^1][^3] The work was published at the 40th International Conference on Machine Learning (ICML 2023) and has since been adopted by NVIDIA TensorRT-LLM, NVIDIA FasterTransformer, Intel Neural Compressor, Microsoft ONNX Runtime, Amazon SageMaker, and AMD's Composable Kernel library for Instinct MI300X.[^2][^4][^5]
Inference for very large transformer models is bottlenecked by memory bandwidth and weight storage. Reducing parameters and activations from FP16 to INT8 halves memory traffic and, on hardware with INT8 tensor cores, can roughly double arithmetic throughput.[^1] Two prior approaches struggled with the largest open models. ZeroQuant applied dynamic per-token activation quantization with group-wise weight quantization and worked well for moderately sized models like GPT-J-6B, but its accuracy collapsed on OPT-175B.[^1] LLM.int8(), released earlier in 2022, preserved accuracy by detecting outlier feature dimensions and keeping them in FP16 while quantizing the rest to INT8; this mixed-precision decomposition is hard to schedule efficiently on GPUs and in practice often ran slower than the FP16 baseline.[^1][^6]
The SmoothQuant authors set out to find a method that satisfied three constraints simultaneously: hardware-efficient INT8-only matrix multiplications, accuracy parity with FP16 across the largest available LLMs, and no fine-tuning. Achieving all three required understanding why activation quantization fails at scale.[^1][^3]
The paper documents three empirical observations about activation outliers that motivate the algorithm.[^1][^6] First, outliers persist in a small number of fixed channels: across many input tokens, the same channels carry magnitudes approximately 100 times larger than the rest. Second, within a given channel the variance is relatively small, so the outlier channels behave almost like static scaling factors. Third, these patterns emerge systematically once models pass roughly 6.7 billion parameters, which is why per-tensor INT8 activation quantization works on smaller transformers but breaks on multi-billion-parameter LLMs.[^1][^6]
Because outliers concentrate in specific channels, per-channel activation quantization would in principle solve the precision problem, but it is incompatible with hardware-accelerated general matrix-multiply (GEMM) kernels that expect scale factors along the reduction axis to be uniform.[^1] Per-token dynamic quantization spreads the difficulty across tokens but does little to attenuate the persistent channel imbalance, and per-tensor static quantization simply yields enormous quantization error on the regular values whenever an outlier shows up.[^1]
SmoothQuant exploits a simple algebraic identity in a linear layer Y = X W. For any positive per-channel scaling vector s of length equal to the input dimension, the product can be rewritten as Y = (X diag(s)^-1) (diag(s) W).[^1][^3] The transformation divides each input channel of X by s_j and multiplies the corresponding row of W by s_j, leaving Y unchanged. By choosing s so that the smoothed activation X̂ = X diag(s)^-1 has a much smaller dynamic range while the smoothed weight Ŵ = diag(s) W remains within an INT8-friendly range, both factors become amenable to standard per-tensor or per-token INT8 quantization.[^1][^3]
The core contribution is the choice of s. SmoothQuant uses calibration data to estimate, for each input channel j, the activation maximum max(|X_j|) and the corresponding weight maximum max(|W_j|), then computes:[^1][^7]
s_j = max(|X_j|)^α / max(|W_j|)^(1 - α)
The exponent α is a single scalar hyperparameter that controls how much of the quantization difficulty is shifted from activations to weights. When α = 1 the smoothed activations across channels share the same maximum, and all the difficulty has been pushed onto the weights. When α = 0 nothing is migrated. Intermediate values strike a balance.[^1][^7] The paper reports that α = 0.5 is "a well-balanced sweet spot" for OPT and BLOOM, and that GLM-130B requires the larger value α = 0.75 because roughly 30% of its activation channels behave as outliers, demanding more aggressive migration toward the weight side.[^1][^7]
Because the scaling is folded offline into the weights (and into the preceding LayerNorm or matmul that produces X), no extra multiplication runs at inference time. The smoothed model has exactly the same compute graph as the original, except that weights and activations are now INT8.[^1][^3]
SmoothQuant calibrates the activation maxima with a small set of unlabeled text. The published recipe uses 512 random sentences from the Pile pre-training dataset to estimate max(|X_j|) per channel, and the smoothing factors plus the static quantization step sizes are computed once and applied unchanged across all downstream tasks.[^7] The grid search for α is similarly performed on a Pile validation subset.[^7] Intel Neural Compressor later added an auto-alpha procedure that sweeps α over a [0.0, 1.0] grid in 0.1 increments while minimizing per-operator output mean-squared error against the FP32 reference.[^8]
The paper studies three granularity settings of increasing aggressiveness and decreasing latency overhead.[^1][^6]
| Mode | Weight quantization | Activation quantization | Notes |
|---|---|---|---|
| O1 | Per-tensor | Per-token dynamic | Recomputes activation scales per token at runtime |
| O2 | Per-tensor | Per-tensor dynamic | Recomputes one scale per tensor per forward pass |
| O3 | Per-tensor | Per-tensor static | Scales fixed at calibration; lowest latency |
O1 and O2 match FP16 accuracy on OPT-175B; O3 is the most efficient because it avoids runtime scale computation altogether, at the cost of a small average accuracy drop (about 0.8 percentage points across the reported tasks).[^6]
The SmoothQuant paper evaluates the method on three then-flagship open LLM families: Meta AI's OPT (up to 175B parameters), BigScience's BLOOM (up to 176B), and Tsinghua/Zhipu's GLM-130B. For every model the W8A8 SmoothQuant variant tracks the FP16 baseline to within a fraction of a point on standard zero-shot benchmarks.[^1][^6] On OPT-175B with the aggressive O3 setting, reported zero-shot accuracies include 74.7% on LAMBADA (FP16: 74.7%), 59.2% on HellaSwag (FP16: 59.3%), 79.7% on PIQA (FP16: 79.7%), and 71.2% on WinoGrande (FP16: 72.6%), with WikiText perplexity of 11.17 versus 10.99 for FP16.[^6] For GLM-130B, the O1 setting matches FP16 accuracy and O3 degrades it by about 1 percentage point.[^7]
The authors also benchmark wall-clock latency and memory. Integrated into NVIDIA's FasterTransformer C++ inference library, SmoothQuant delivers up to 1.56x inference speedup and 2x reduction in GPU memory footprint relative to FP16, while halving the number of GPUs required to serve the largest models (one GPU instead of two for OPT-66B, four instead of eight for OPT-175B).[^1][^4] In a striking demonstration of memory savings, SmoothQuant enables the 530-billion-parameter MT-NLG model to be served on a single 8-GPU node, which is impossible at FP16.[^1][^4]
| Method | Year | Weight precision | Activation precision | Outlier handling | OPT-175B accuracy |
|---|---|---|---|---|---|
| LLM.int8() | 2022 | INT8 | INT8 + FP16 mixed | Keep outlier channels in FP16 | Matches FP16, often slower than FP16 baseline[^1] |
| ZeroQuant | 2022 | INT8 group-wise | INT8 per-token dynamic | None | Accuracy collapses on OPT-175B[^1] |
| SmoothQuant | 2022 | INT8 per-tensor | INT8 per-tensor or per-token | Offline migration to weights via s_j | Matches FP16, up to 1.56x speedup[^1] |
SmoothQuant's principal advantage over LLM.int8() is that it never needs mixed-precision decomposition at runtime, so it can be deployed using only standard INT8 GEMM kernels. Its advantage over ZeroQuant is that the per-channel migration neutralizes the systematic outliers that defeat naive INT8 activation quantization in 100-billion-plus-parameter models.[^1][^6]
The official open-source implementation is hosted at github.com/mit-han-lab/smoothquant under the MIT-HAN Lab organization.[^4] It includes PyTorch reference code, a real INT8 OPT inference demo built on CUTLASS kernels, and W8A8 evaluation scripts for OPT, BLOOM, Llama 2, Llama 3, Falcon, Mistral 7B, and Mixtral.[^4] The repository's README records the alpha values that the authors recommend for each model family: 0.85–0.90 for Llama-2, 0.8 for Mistral and Mixtral, and 0.6–0.7 for Falcon.[^4]
The technique has since been integrated into multiple production inference and quantization stacks. The README's news log documents the timeline:[^4]
| Date | Integration |
|---|---|
| 2023-03 | Intel Neural Compressor adds SmoothQuant with auto-alpha tuning[^4][^8] |
| 2023-10 | NVIDIA TensorRT-LLM adds INT8 SmoothQuant as a supported quantization mode[^4][^9] |
| 2023-11 | Amazon SageMaker hosts SmoothQuant-quantized models[^4] |
| 2024-01 | Microsoft ONNX Runtime adds SmoothQuant support[^4] |
| 2024-03 | Official W8A8 recipes published for Llama-1/2/3, Falcon, Mistral, Mixtral[^4] |
| 2024-05 | AMD enables INT8 SmoothQuant inference on Instinct MI300X via Composable Kernel[^4] |
NVIDIA's documentation positions INT8 SmoothQuant in TensorRT-LLM as a recommended option for Ada-generation GPUs that lack hardware FP8, and as a fallback when FP8 accuracy is insufficient for a given workload.[^9] Inside TensorRT-LLM, SmoothQuant lives alongside FP8, FP4, INT4 AWQ, and INT4 GPTQ as one of the supported post-training quantization formats for LLM checkpoints.[^9]
SmoothQuant is significant primarily because it was the first PTQ method to achieve INT8-only matrix multiplication for 100-billion-plus-parameter LLMs without accuracy loss, and to demonstrate the wall-clock and memory savings end-to-end in a production-grade C++ runtime.[^1][^4] By halving the memory footprint of an LLM and roughly doubling INT8 arithmetic intensity, it enables substantially larger models to fit on a single accelerator and reduces the number of GPUs needed for distributed inference, with direct economic implications for model serving.[^1][^4]
The method has also become a methodological influence on subsequent LLM-quantization research. AWQ (Activation-aware Weight Quantization) from the same lab generalizes the per-channel activation-aware scaling idea to weight-only quantization in INT4, where outliers in the activation magnitudes are used to identify the most salient weight channels.[^10] SmoothQuant+ (2023) extends the migration idea to 4-bit weight quantization. Hybrid methods that combine SmoothQuant-style channel scaling with rotation-based smoothing (such as QuaRot and SmoothRot) have explored even more aggressive 4-bit weight, 4-bit activation regimes.
In the broader landscape of model compression techniques for LLMs, SmoothQuant sits squarely in the post-training quantization category, complementing weight-only INT4 methods like GPTQ and AWQ, floating-point formats like FP8 and FP4, and tangentially related compression strategies such as pruning and knowledge distillation.[^1][^9]
Several caveats apply to SmoothQuant in practice.
The α hyperparameter is model-specific. The paper's general recommendation of α = 0.5 fails for GLM-130B (which needs 0.75) and Llama-2 (which uses 0.85–0.90 in the official scripts), so a small grid search or auto-tuning step is required for each new architecture.[^1][^4][^7]
Calibration is necessary. Although there is no gradient-based fine-tuning, SmoothQuant requires a representative text sample (the published recipe uses 512 sentences from the Pile) to estimate channel maxima, and the calibration distribution can affect downstream quality on out-of-domain inputs.[^7]
Static per-tensor activation quantization (the O3 setting) shows a small but measurable accuracy regression compared with O1 and O2; deployments that prioritize accuracy over the last 5–10% of latency typically choose O1 or O2.[^6]
Pushing below INT8. SmoothQuant by itself targets W8A8. Reaching INT4 weights or INT4 activations generally requires combining its channel-scaling idea with additional techniques (group-wise weight quantization, rotation, mixed-precision activations, or quantization-aware training), because the per-channel smoothing alone is insufficient to absorb the larger quantization noise at four bits.[^1][^4]
| Topic | Relation to SmoothQuant |
|---|---|
| AWQ | Activation-aware weight scaling for INT4 weight-only quantization, from the same MIT lab; uses activation magnitudes to choose protective scales for salient weight channels |
| GPTQ | Hessian-based INT4 weight-only PTQ; orthogonal to SmoothQuant and often composed with it for 4-bit weights plus 8-bit activations |
| LLM.int8() | Mixed-precision INT8 method with FP16 outlier rows; same goal but slower in practice |
| NF4 | 4-bit weight-only format for QLoRA fine-tuning; complementary, not a direct alternative |
| ExLlamaV2 / EXL2 | Mixed-bit-width inference engine for quantized LLMs |
| GGML / GGUF | CPU-oriented quantized inference format using INT4/INT5/INT8 weight schemes |
| Product quantization | Classical vector-database quantization; conceptually adjacent but targets retrieval rather than LLM serving |