Activation-aware Weight Quantization (AWQ) is a post-training quantization method for large language models that compresses weights to 4-bit (and optionally 3-bit) integers while keeping near-FP16 task accuracy. Introduced in a paper by Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han of MIT, Tsinghua University, and the MIT-IBM Watson AI Lab, first posted to arXiv on June 1, 2023.[1][2] The method received the Best Paper Award at the Seventh Conference on Machine Learning and Systems (MLSys 2024) in Santa Clara, California.[3] AWQ identifies a small fraction of "salient" weight channels by inspecting activation magnitudes rather than weight magnitudes, then applies a per-channel scaling transformation to reduce quantization error on those channels. The companion TinyChat inference engine implements W4A16 (4-bit weight, 16-bit activation) GPU kernels that report more than 3x speedup over the Hugging Face FP16 baseline on both desktop and mobile GPUs.[1][2]
Background
The growth of decoder-only transformer language models created a substantial deployment problem. A model with 70 billion parameters in 16-bit floating point requires roughly 140 GB of storage for its weights alone; a 7-billion-parameter model requires about 14 GB, which exceeds the VRAM on most consumer GPUs and many cloud instances. Lower-precision representations reduce both the storage footprint and the DRAM bandwidth consumed during inference, with the latter being the dominant cost during single-request decoding on a memory-bound accelerator.[1]
Quantization for neural networks has a long history, but the regime of interest for large language models (LLMs) is weight-only post-training quantization (PTQ) in 3- to 4-bit integer formats. Two prior baselines defined the space when AWQ was introduced:
- Round-to-nearest (RTN) rounds each weight to the closest representable value on a uniform integer grid. RTN is trivial to apply and incurs no calibration cost, but at 4-bit it loses non-trivial accuracy on smaller models, and at 3-bit it produces unusable perplexity on most LLMs.[1]
- gptq (Frantar et al., 2022) quantizes weights one column at a time using an inverse-Hessian update to compensate for the error introduced in already-quantized columns. GPTQ delivers near-FP16 4-bit accuracy on many models but takes hours on large models, depends on a relatively large calibration set, and can be sensitive to calibration-data domain.[4]
AWQ takes a different path. Rather than correcting quantization error after it occurs, AWQ asks which weights, if quantized poorly, would most damage downstream computation. The proposed answer is "those weights multiplied by large activations during inference," a signal that does not require gradient information.[1][2]
Paper and authorship
The arXiv preprint "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" was first posted as version 1 on June 1, 2023.[5] Subsequent revisions on arXiv updated the title in v5 to "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration" to reflect the strong on-device results added to the camera-ready version.[5] The paper was accepted to the MLSys 2024 conference, where it received the Best Paper Award; the official proceedings list the same author roster as the arXiv submission.[3][6]
Lead author Ji Lin completed the research as a PhD student in Song Han's group at MIT. Song Han is principal investigator of the HAN Lab at MIT (also distinguished scientist at NVIDIA). Author affiliations across the paper and the MIT HAN Lab project page span MIT, Tsinghua University, and the MIT-IBM Watson AI Lab.[2]
The official reference implementation lives at the GitHub repository mit-han-lab/llm-awq, released under the mit license and including the TinyChat inference engine.[7]
Salient weights and the activation signal
The core empirical finding of the paper is illustrated by a controlled ablation. Consider quantizing a linear layer y = W x to INT4. Suppose 1% of the columns of W (corresponding to 1% of input channels) are left in FP16 while the remaining 99% are quantized with standard RTN. Three strategies for choosing the protected columns are compared on LLaMA-7B:[1]
- Random selection of 1% of channels.
- Selection by largest average weight magnitude across the column.
- Selection by largest average activation magnitude on the corresponding input channel.
Strategies 1 and 2 give marginal gains. Strategy 3 closes most of the gap between INT4-RTN and FP16 on WikiText perplexity. The paper reports that mixed-precision protection of the top 1% of activation-selected channels reduces LLaMA-7B INT4 perplexity from about 43.2 (RTN, 3-bit demonstration) toward the FP16 baseline, while random and weight-magnitude selection do not.[1]
The intuition is that the output error of a quantized linear layer is (W minus Q(W)) times x. The contribution of column i to that error is (w_i minus Q(w_i)) times x_i. A weight column multiplied by an input channel that is always near zero contributes little error regardless of how poorly it is quantized; a column multiplied by an input channel with large magnitude contributes proportionally more. Activation magnitude, averaged over a calibration set, is therefore a cleaner importance signal than weight magnitude.[1]
Keeping 1% of columns in FP16 and the rest in INT4 is hardware-unfriendly because the resulting layer requires mixed-precision matrix multiplication. AWQ therefore replaces explicit precision protection with an equivalent per-channel scaling transformation that keeps every weight in the same INT4 format.[1][2]
Per-channel scaling
The mathematical core of AWQ is the following observation. For a single weight w in a salient channel and an input activation x, the output contribution before quantization is w times x. If w is multiplied by a scale s > 1 before quantization, the scaled weight is w' = s times w. Quantizing w' to INT4 produces some absolute error delta, but the relative error delta / w' equals delta / (s times w), which is 1/s times the relative error of the unscaled weight. Larger s shrinks the relative error on this channel.[1]
For the layer output to remain unchanged, the corresponding input activation x must be divided by s. The substitution (Q(s times w) / s) times x equals Q(s times w) times (x / s), so dividing the activation cancels the weight scale. In a transformer block, the upstream operation feeding the linear layer is typically a layer normalization (or a prior linear projection), and the 1/s scaling can be absorbed into the affine parameters of that upstream operation at offline-quantization time. This means inference adds no extra arithmetic; the inverse scale is baked into the preceding layer.[1]
AWQ formulates the search for per-channel scales s = (s_1, ..., s_C) as the minimization of the per-layer output reconstruction error over a calibration set:
L(s) = || Q(W diag(s)) diag(s)^{-1} X − W X ||
where X is the matrix of input activations gathered from the calibration set and W is the FP16 weight matrix. The candidate scales are parameterized as s_i = (mean absolute activation of channel i) raised to a power alpha, with alpha searched over a small one-dimensional grid (the paper uses 20 candidate values).[1] This is a fast forward-pass search rather than a gradient-based optimization. The paper reports that the full quantization procedure runs in minutes for a 7B model on a single GPU and does not require backpropagation, second-order information, or full-model fine-tuning.[1][2]
A complication is that increasing s indefinitely is not free. The scaled weight column now has a larger dynamic range, which compresses the INT4 grid spacing for the rest of the weights in its quantization group and can introduce error elsewhere. The grid search picks the alpha that balances protecting salient channels against disturbing unprotected ones.[1]
The paper also discusses a clipping step that adjusts the maximum value used to compute the per-group scale, trading a small additional clipping error for a tighter integer grid on the bulk of the weights.[1]
AWQ uses group-wise integer quantization for the actual INT4 encoding. With a group size of 128 (the standard setting), every 128 consecutive weights along the input-channel dimension share one FP16 scale and one zero-point. This is fine enough to track local distribution shape but coarse enough to limit storage overhead.[1][7]
For a single group of 128 4-bit weights:
- Weight payload: 128 x 4 bits = 64 bytes.
- Scale: 1 x FP16 = 2 bytes.
- Zero-point: 1 x FP16 (or packed integer in some variants) = up to 2 bytes.
Total: about 68 bytes per group, versus 256 bytes per group in FP16. The effective compression ratio is roughly 3.76x, somewhat less than the theoretical 4x because of the scale and zero-point overhead.[7]
The Hugging Face documentation refers to this configuration as bits: 4, group_size: 128, with zero_point: true in the model config.json. Group size 64 is also supported as a higher-fidelity variant, and version: "gemm", version: "gemv", or version: "exllama" selects the underlying CUDA kernel implementation.[8]
The shorthand W4A16g128 appears in model cards and serving stacks: 4-bit weights, 16-bit activations, group size 128. This distinguishes AWQ from weight-and-activation quantization schemes such as smoothquant (W8A8) or QuaRot (W4A4).[9]
Calibration data
AWQ requires a small calibration set to estimate per-channel activation magnitudes. The reference implementation defaults to a few hundred sequences drawn from the Pile or a similar web-text corpus; published recipes commonly use 128 to 512 samples.[1][7] Each sample is run through the model to record the mean absolute value of activations on each input channel of each linear layer. These statistics are averaged across the calibration set to yield channel importance scores used to compute s.[1]
Because AWQ only needs first-order activation statistics (mean absolute values), not gradients or Hessians, its calibration set is far smaller than GPTQ's (which is typically 2,048 samples) and its sensitivity to the calibration domain is lower.[1] The paper reports that under deliberate calibration-domain mismatch, AWQ perplexity rises by roughly 0.5 to 0.6 points on the evaluation distribution while GPTQ perplexity rises by 2.3 to 4.9 points under the same setup.[1] This calibration-domain tolerance has practical consequences for proprietary models where representative calibration data is hard to obtain.
TinyChat inference engine
AWQ-quantized checkpoints by themselves do not run faster than FP16; they require an inference kernel that dequantizes INT4 weights and multiplies them by FP16 activations efficiently. The MIT HAN Lab released TinyChat as the companion runtime.[2][7]
TinyChat implements fused W4A16 CUDA kernels that:
- Load packed INT4 weights directly from global memory.
- Unpack and dequantize them to FP16 in registers or shared memory.
- Multiply against FP16 activations.
- Apply activation functions (such as SiLU in the LLaMA MLP block) in the same kernel to avoid intermediate writes to global memory.
End-to-end results reported in the paper and project page include 2.7x speedup on the NVIDIA RTX 4090 desktop GPU and 2.9x speedup on the NVIDIA Jetson Orin embedded GPU for LLaMA-class models running with TinyChat AWQ kernels relative to the Hugging Face FP16 baseline.[1][2] The system runs a quantized llama 2 70B model on a Jetson Orin 64 GB platform at interactive generation speed, where the FP16 variant would not fit in memory at all.[2]
A subsequent release, TinyChat 2.0 (December 2024), extended the engine to vision-language models, added VILA and NVILA support, and reports 1.5x to 1.7x faster prefill compared to the original TinyChat through revised kernel scheduling.[2] The HAN Lab also publishes TinyChatEngine, a C++ counterpart targeting on-device deployment on Apple silicon, ARM CPUs, and other accelerators where CUDA is not available.[7]
Reported quantization results
The AWQ paper reports zero-shot perplexity and downstream accuracy across LLaMA-1, LLaMA-2, OPT, Mistral, and several vision-language models. Headline numbers from the paper at W4A16 group-size-128 include the following ranges (all on WikiText-2 perplexity, lower is better):[1]
- LLaMA-7B: FP16 about 5.68, AWQ INT4 about 5.78, RTN INT4 about 5.96 (group size 128), GPTQ INT4 close to AWQ.
- LLaMA-13B: FP16 about 5.09, AWQ INT4 about 5.18.
- LLaMA-2-70B: FP16 about 3.32, AWQ INT4 about 3.41.
At 3-bit, the gap widens but AWQ retains a large advantage over RTN. The paper reports LLaMA-7B WikiText-2 at INT3 of about 13.0 with AWQ versus 43.2 with RTN.[1]
Beyond perplexity, the paper evaluates common-sense reasoning suites (HellaSwag, PIQA, ARC, WinoGrande, and others) and the instruction-tuned Vicuna chatbot benchmark, with AWQ showing minimal degradation versus FP16 at 4-bit precision.[1] The work also reports successful low-bit quantization of vision-language models, including llava-13B and OpenFlamingo-9B, evaluated across eleven multimodal benchmarks, making AWQ one of the first weight-only PTQ methods to demonstrate this generality.[1][2]
Comparison to other quantization methods
The following table summarizes how AWQ relates to the most-cited alternative weight-only and weight-and-activation quantization methods discussed in the AWQ paper and the surrounding literature.
| Method | Format | Approach | Calibration | Notes |
|---|
| RTN | W4A16 | Round-to-nearest | None | Trivial; degrades on small models, fails at 3-bit. |
| gptq | W4A16 | Hessian-based column update | About 2,048 samples | Strong 4-bit accuracy; hours to run; sensitive to calibration domain.[4] |
| AWQ | W4A16 | Activation-aware channel scaling | About 128 to 512 samples | Minutes to run; tolerant of calibration mismatch.[1] |
| smoothquant | W8A8 | Per-channel migration of activation outliers to weights | Small set | Targets INT8 tensor-core compute, complementary to AWQ.[9] |
| QuaRot / SpinQuant | W4A4 | Rotation matrices to suppress activation outliers | Calibration set | Enables full INT4 compute at extra calibration cost. |
| AQLM, QuIP# | sub-4-bit | Vector-quantization codebooks | Calibration set | Targets 2-3 bit regime where AWQ degrades. |
| NF4 (used by qlora) | W4A16 | Non-uniform quantile grid | None | Used with bitsandbytes; slower inference kernels in practice. |
AWQ and GPTQ are the two methods most often compared head-to-head in production-serving contexts. The community comparisons consistently find that:
- AWQ is faster to produce than GPTQ at quantization time.
- GPTQ sometimes achieves slightly lower perplexity on larger models, but the gap narrows or vanishes at 13B and above.
- AWQ tends to be slightly faster at inference time when both are paired with well-optimized kernels such as Marlin.[9]
Adoption
huggingface tgi and the broader Hugging Face Transformers library added support for loading AWQ-quantized checkpoints in late 2023. Users invoke standard AutoModelForCausalLM.from_pretrained on any model with quant_method: "awq" in its config.json; the library inspects the configuration and selects the appropriate AWQ kernel.[8] Supported backends include the original llm-awq, the community AutoAWQ library, and the ExLlamaV2 kernels (selected with AwqConfig(version="exllama"), which works on both NVIDIA and AMD GPUs).[8] Hugging Face also implements fused AWQ modules for LLaMA and Mistral architectures: the Q, K, V, and O projections plus MLP layers are combined into single fused operations, which the documentation reports doubles decode throughput at batch size 1 for benchmark workloads on Mistral-7B-OpenOrca.[8]
AutoAWQ
AutoAWQ is a community library by Casper Hansen, started in August 2023, that simplified the end-to-end AWQ workflow and extended support to additional architectures beyond the original llm-awq set.[10] AutoAWQ was for several years the most common path for producing community AWQ checkpoints: by mid-2024, over 7,000 AWQ-format models existed on the Hugging Face Hub, many produced by AutoAWQ users.[10] The library is released under the MIT License. The project README now states that AutoAWQ is officially deprecated and no longer maintained, with users directed toward vllm's llm-compressor package and MLX-LM as recommended successors.[10]
vLLM
vllm supports AWQ natively as a recommended INT4 format for production serving.[11] AWQ models can be loaded by passing quantization="awq" to the LLM constructor, with vLLM dispatching the appropriate W4A16 kernel for the host hardware. The llm-compressor package, maintained by the vLLM project, includes an AWQ calibration recipe so that producing a quantized checkpoint and serving it use a coherent workflow.[11]
NVIDIA TensorRT-LLM
NVIDIA's tensorrt llm supports INT4 AWQ as a native quantization recipe. The NVIDIA TensorRT Model Optimizer ships with an AWQ calibration utility that emits checkpoints consumable directly by TensorRT-LLM. This integration brings AWQ into enterprise serving pipelines including Amazon SageMaker JumpStart, Google Cloud Vertex AI, and NVIDIA NIM microservices.[12]
Other systems
Production-grade systems that have shipped AWQ support include sglang, lmdeploy (Shanghai AI Lab), Hugging Face TGI, Intel Neural Compressor, and AMD Quark.[7][8][11] llama cpp uses its own GGUF-format Q4_K and Q5_K quantizations rather than AWQ; the two ecosystems mostly do not interoperate at the kernel level, though both serve the same practical goal of running 4-bit LLMs on commodity hardware.[7]
Models commonly quantized with AWQ
By mid-2024, the AWQ ecosystem on the Hugging Face Hub had accumulated over six million downloads of AWQ-format checkpoints.[2] Models for which AWQ-quantized variants are routinely published include:
The MIT HAN Lab maintains a model zoo of pre-computed AWQ search results so that users can reproduce or extend the original quantization configurations without re-running the full search.[7]
The memory reduction from W4A16g128 storage is roughly 3.5x to 3.8x relative to FP16. In rule-of-thumb form:
- A 7B model shrinks from about 14 GB FP16 to roughly 4 GB AWQ.
- A 13B model shrinks from about 26 GB to roughly 7 GB.
- A 70B model shrinks from about 140 GB to roughly 38 GB.
These figures cover the weight payload only; activation memory and the KV cache remain in FP16 and scale with context length and batch size.[1][2]
Throughput at single-request decoding follows from the memory reduction. Decoding is memory-bandwidth-bound on most accelerators at batch size 1: each generated token requires reading the entire weight matrix from DRAM. Reducing weight precision from 16 bits to 4 bits cuts DRAM traffic by roughly 4x, with the observed end-to-end speedup typically landing in the 2.7x to 3.3x range once dequantization overhead and the unchanged FP16 activation traffic are accounted for.[1][2]
At larger batch sizes, each loaded weight is reused across more tokens, the arithmetic intensity rises, and the relative benefit of weight-only quantization shrinks. AWQ's W4A16 design is therefore most effective for latency-bound single-request inference and for memory-constrained deployments, less so for high-throughput offline batch serving where compute throughput dominates.[1]
Limitations
AWQ is a widely deployed method, but it has identifiable limitations:
- Calibration required. Although AWQ uses far fewer samples than GPTQ and is less sensitive to domain mismatch, it is not a zero-shot method like RTN. The calibration step adds a pipeline stage and requires representative input data, which can be a concern for proprietary or sensitive models.[1]
- W4A16 does not exploit INT8 tensor cores. Because activations stay in FP16, the underlying matrix multiplication is still a 16-bit operation after on-the-fly INT4 dequantization. The acceleration comes from DRAM-bandwidth savings, not from faster arithmetic. On compute-bound workloads (large batch, long-context prefill on accelerators with strong FP16 throughput), AWQ delivers little speedup over FP16.[1]
- Accuracy ceiling near 4 bits. AWQ is reliable at 4-bit, somewhat usable at 3-bit, and not usable at 2-bit for most tasks. Sub-4-bit operating points require methods such as AQLM or QuIP# that use vector-quantization codebooks.[1]
- Non-linear layers and the KV cache stay in FP16. AWQ quantizes only linear weight matrices. Layer norms, embedding tables, attention softmax outputs, and the KV cache remain in 16-bit. For very long context windows, KV cache memory can dominate, leaving AWQ's compression irrelevant for that component. KV cache quantization is a separate research area.[1]
- Stable outlier assumption. AWQ works because the same input channels consistently carry large activations across diverse inputs in standard transformer architectures. In novel architectures with input-dependent outlier patterns, a fixed per-channel scale may be less effective. This has not been a practical problem for the major open-weight model families (LLaMA, Mistral, Qwen, Falcon, OPT), but is a theoretical caveat for non-standard designs.[1]
- Marlin AWQ. The Marlin kernel from IST Austria's IST-DASLab is a high-performance FP16-by-INT4 matrix-multiplication implementation that pairs with AWQ-format checkpoints. vLLM uses Marlin as the default kernel for AWQ models on supported GPUs, reporting substantially higher decode throughput than the original TinyChat-style kernels on the same workload.[11]
- OmniQuant. OmniQuant (Shao et al., 2023) extends the equivalent-transformation idea by jointly learning per-channel scales and clipping bounds with gradient descent rather than grid search, trading additional calibration compute for somewhat better accuracy in some regimes.
- QuaRot and SpinQuant. These methods apply learned or random orthogonal rotations to activations to spread outlier magnitudes across channels, enabling W4A4 quantization with usable accuracy. They are more expensive to calibrate but enable fully integer inference, which AWQ does not provide.
- AQLM and QuIP#. Vector-quantization codebook methods designed for the sub-4-bit regime, where AWQ degrades.
- smoothquant. Shares the per-channel scaling idea but applies it bidirectionally to enable W8A8 quantization with INT8 compute, complementary to AWQ's weight-only W4A16 design.[9]
See also
References
- Lin, Ji; Tang, Jiaming; Tang, Haotian; Yang, Shang; Chen, Wei-Ming; Wang, Wei-Chen; Xiao, Guangxuan; Dang, Xingyu; Gan, Chuang; Han, Song. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint, 2023-06-01. https://arxiv.org/abs/2306.00978. Accessed 2026-05-26.
- MIT HAN Lab. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." Project page, MIT. https://hanlab.mit.edu/projects/awq. Accessed 2026-05-26.
- MLSys 2024 Conference Organizers. "MLSys 2024 Awards: Best Paper Award." Conference on Machine Learning and Systems, 2024. https://mlsys.org/virtual/2024/awards_detail. Accessed 2026-05-26.
- Frantar, Elias; Ashkboos, Saleh; Hoefler, Torsten; Alistarh, Dan. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv preprint, 2022-10-31. https://arxiv.org/abs/2210.17323. Accessed 2026-05-26.
- arXiv. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (arXiv:2306.00978)." Listing page with version history v1 (2023-06-01) through subsequent revisions. https://arxiv.org/abs/2306.00978. Accessed 2026-05-26.
- Lin, Ji; Tang, Jiaming; Tang, Haotian; Yang, Shang; Chen, Wei-Ming; Wang, Wei-Chen; Xiao, Guangxuan; Dang, Xingyu; Gan, Chuang; Han, Song. "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration." Proceedings of Machine Learning and Systems (MLSys), 2024. https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html. Accessed 2026-05-26.
- MIT HAN Lab. "llm-awq: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" (GitHub repository). https://github.com/mit-han-lab/llm-awq. Accessed 2026-05-26.
- Hugging Face. "AWQ quantization documentation." Hugging Face Transformers docs. https://huggingface.co/docs/transformers/quantization/awq. Accessed 2026-05-26.
- Xiao, Guangxuan; Lin, Ji; Seznec, Mickael; Wu, Hao; Demouth, Julien; Han, Song. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." arXiv preprint, 2022-11-18. https://arxiv.org/abs/2211.10438. Accessed 2026-05-26.
- Hansen, Casper. "AutoAWQ" (GitHub repository). https://github.com/casper-hansen/AutoAWQ. Accessed 2026-05-26.
- vLLM Project. "AutoAWQ quantization documentation." https://docs.vllm.ai/en/latest/features/quantization/auto_awq.html. Accessed 2026-05-26.
- NVIDIA Developer Blog. "Optimizing LLMs for Performance and Accuracy with Post-Training Quantization." NVIDIA Developer. https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/. Accessed 2026-05-26.