AWQ
Last reviewed
May 7, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 4,673 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 4,673 words
Add missing citations, update stale details, or suggest a clearer explanation.
AWQ (Activation-aware Weight Quantization) is a post-training quantization method for large language models that compresses model weights to 4-bit integers while preserving near-FP16 accuracy. Developed at MIT's HAN Lab and published in June 2023 by Ji Lin, Song Han, and colleagues, AWQ won the Best Paper Award at MLSys 2024. The method is widely used in production inference stacks including vLLM, NVIDIA TensorRT-LLM, Hugging Face Transformers, and Amazon SageMaker, and AWQ-format models had accumulated over six million downloads on the Hugging Face Hub as of mid-2024.
As large language models scaled to tens of billions of parameters, the gap between model size and available GPU memory became a central deployment challenge. A 70-billion-parameter model in FP16 requires roughly 140 GB of GPU memory, placing it out of reach for all but the largest server clusters. A 7-billion-parameter model demands around 14 GB at FP16, exceeding the VRAM on most consumer desktop GPUs and many cloud instances. Serving these models at scale required either very expensive multi-GPU configurations or a way to reduce memory footprint.
Quantization reduces the memory footprint by representing model parameters in lower-precision numeric formats. Eight-bit integer (INT8) quantization cuts weight storage roughly in half compared to FP16; 4-bit quantization cuts it to one quarter. The engineering challenge is achieving this compression without meaningful accuracy degradation.
The memory-bound nature of autoregressive decoding makes weight quantization especially attractive for inference. During the generation phase, the GPU loads the model's weight matrices from DRAM for each token produced. At batch size one, the computation is almost entirely limited by memory bandwidth rather than arithmetic throughput. The GPU's tensor cores sit largely idle while weights trickle in from DRAM. Reducing weight precision from 16 bits to 4 bits reduces DRAM traffic by roughly 4x, translating into proportional speedups on memory-bandwidth-limited hardware. This is particularly valuable for latency-sensitive applications and on edge devices where DRAM bandwidth is a scarce resource.
Before AWQ, post-training quantization methods for LLMs fell into two main categories. Round-to-nearest (RTN) quantization maps each weight to the nearest representable integer value with no additional optimization. It is trivially fast to apply but produces significant accuracy degradation at 4-bit precision, particularly for smaller models in the 1-to-7 billion parameter range. GPTQ (Frantar et al., 2022) achieved much better 4-bit accuracy by applying a layer-wise optimization: it uses the inverse Hessian of the layer's output with respect to its weights to compensate for quantization error, adjusting remaining weights after each one is quantized. GPTQ reached near-FP16 perplexity on many models at 4 bits, but the Hessian computation required a relatively large calibration set (2,048 samples by default), took hours to run on large models, and showed sensitivity to the calibration data distribution.
AWQ identified a third path. Instead of correcting for quantization error after it occurs, AWQ asks which weights will suffer most from quantization and takes proactive steps to protect them, using a signal from the model's activation distribution to make that determination.
AWQ was introduced in the paper "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" (arXiv:2306.00978), first published on June 1, 2023. The full author list is Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. All authors were affiliated with MIT at the time of publication. Ji Lin subsequently joined OpenAI, Han Cai joined NVIDIA Research, and Song Han remains principal investigator of the HAN Lab at MIT.
The paper was revised multiple times before its final publication; v5 on arXiv updated the title to "AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration" to reflect the strong edge-device results. The work was accepted to MLSys 2024 (the Seventh Conference on Machine Learning and Systems), where it received the Best Paper Award, the conference's highest honor. The accompanying code and TinyChat inference engine are maintained at github.com/mit-han-lab/llm-awq.
The paper's central observation is straightforward: weights are not all equally important for model quality, and the way to identify the important ones is to look at which input channels consistently carry large-magnitude activations. The paper shows through ablations that protecting just 1% of weight channels by this criterion dramatically reduces quantization error, closing most of the gap between pure INT4 RTN and FP16. The challenge is protecting those channels without resorting to hardware-inefficient mixed-precision arithmetic. AWQ solves this with a per-channel scaling transformation that is mathematically equivalent to precision protection but keeps all weights in the same INT4 format.
Beyond the language modeling results, the paper was also the first to demonstrate successful low-bit quantization of visual language models including LLaVA and OpenFlamingo, showing that the activation-based salience signal generalizes to multimodal architectures.
The AWQ algorithm starts with a key observation about how quantization error propagates through a linear layer. Consider a linear layer computing output y = W x, where W is the weight matrix and x is the input activation vector. After quantizing W to get Q(W), the output error is (W - Q(W)) x. For a weight column i (corresponding to input channel i), the contribution to output error is (w_i - Q(w_i)) * x_i, where x_i is the i-th element of the input activation.
If x_i has consistently large magnitude across inputs, even a small per-element weight quantization error in column i gets amplified into a large output error. If x_i is consistently near zero, the quantization error in that column barely affects the output regardless of its magnitude. This means the importance of a weight channel is determined not by the weight values themselves, but by the magnitude of the corresponding input activations.
To verify this, the authors run an ablation: they protect different subsets of weight channels by keeping them in FP16 while quantizing the rest to INT4. When they select the 1% of channels with the highest average activation magnitude, WikiText-2 perplexity on LLaMA-7B drops from 43.2 (pure INT4 RTN) to 13.0, nearly matching the FP16 baseline. When they select 1% of channels randomly, the improvement is minimal. When they select channels based on weight magnitude rather than activation magnitude, the improvement is also much smaller. This empirical result motivates the choice of activation magnitude as the salience criterion.
The naive implementation, keeping salient channels in FP16 and the rest in INT4, works but is hardware-inefficient. Mixed-precision computation within a single layer requires special handling in the inference kernel, introduces control flow overhead, and cannot take advantage of bulk INT4 tensor operations. AWQ avoids this by using an equivalent transformation.
The core algorithmic contribution of AWQ is a method to protect salient weight channels without mixed-precision arithmetic. The idea is to scale up the values of a salient weight column before quantization, and then undo the scale at inference time by scaling down the corresponding input activation.
For a weight w in a salient column and a scale factor s > 1, the scaled weight is w' = w * s. When w' is quantized to INT4, the absolute quantization error delta is the same as without scaling (the INT4 grid spacing is fixed by the range of w'). But the relative error is smaller: delta / w' = delta / (w * s), which is 1/s times the relative error of the original weight. Larger s provides better protection.
The transformation is mathematically lossless if the corresponding input is divided by s: (Q(w * s) / s) * x = Q(w * s) * (x / s). In a transformer's attention and MLP layers, the activation x/s can be produced by absorbing the 1/s factor into the weight or bias of the preceding layer (typically a layer normalization). This means the inverse scaling adds no extra computation at inference; it is baked into the preceding layer's parameters during the offline quantization step.
The scale factors s_i for each channel are chosen via a grid search. The search evaluates a range of candidate scale values (proportional to powers of the per-channel activation magnitude) and selects the values that minimize quantization error on the calibration set. Because this is just a forward-pass measurement at the layer level, not full-model backpropagation, the search is fast. The paper reports that the entire quantization process runs in minutes for 7B models.
A subtlety is that increasing s too aggressively can hurt the unprotected channels. When one column's weights are scaled up by s before quantization, the effective range of that column increases, which can compress the INT4 grid spacing and introduce error in other columns. The grid search finds the Pareto-optimal s that balances protecting salient channels against disturbing the rest.
AWQ requires a small calibration set to estimate per-channel activation magnitudes. The standard setup uses 128 to 512 randomly sampled sequences from the Pile or C4 text corpora. For each sequence, the model runs a forward pass and records the mean absolute activation magnitude for each input channel of each linear layer. These statistics are averaged over the calibration set to produce stable channel importance scores.
Because AWQ only needs activation magnitudes, not gradients or second-order information, the calibration requirement is far lighter than GPTQ's. The paper shows that AWQ is also more robust to calibration domain mismatch: when the calibration data comes from a different domain than the model's intended use, AWQ's perplexity increases by only 0.5 to 0.6 points, whereas GPTQ's perplexity degrades by 2.3 to 4.9 points under the same domain mismatch. This makes AWQ practical in cases where a perfectly matched calibration set is unavailable.
AWQ uses grouped quantization for the actual INT4 compression. With a group size of 128 (the standard configuration), every consecutive 128 weights along the input-channel dimension share one quantization scale and one zero-point. Each group's range is determined independently, allowing the quantization grid to adapt locally to the weight distribution. This substantially reduces quantization error compared to per-tensor quantization (one scale for the entire weight matrix) or even per-channel quantization (one scale per output channel), at the cost of slight storage overhead.
Storage cost for a group-size-128 INT4 weight matrix: each group of 128 weights is stored as 128 x 4 bits = 64 bytes, plus one FP16 scale (2 bytes) and one FP16 zero-point (2 bytes), for a total of 68 bytes per group versus 256 bytes at FP16. The effective compression ratio is about 3.76x, somewhat less than the theoretical 4x due to scale and zero-point overhead.
This format is often written W4A16g128 in model cards and documentation: 4-bit weights (W4), 16-bit activations (A16), group size 128 (g128). The W4A16 portion distinguishes AWQ from weight-and-activation quantization schemes that reduce both to INT8 or INT4.
AWQ-quantized weights cannot be used for inference without a corresponding kernel that knows how to dequantize INT4 weights and multiply them against FP16 activations efficiently. The MIT HAN Lab developed TinyChat as the inference counterpart to AWQ.
TinyChat implements W4A16 CUDA kernels that perform on-the-fly dequantization. The kernel loads packed INT4 weights from DRAM into GPU shared memory, unpacks and dequantizes them to FP16 in registers, then performs the FP16 matrix multiply-accumulate. This is fused into a single kernel call to avoid intermediate writes back to global memory. The kernel also fuses the activation function (such as SiLU in LLaMA's MLP block) to further reduce memory traffic.
The resulting end-to-end performance on server GPUs is reported as more than 3x speedup over the Hugging Face FP16 baseline, with specific figures of 2.7x on RTX 4090 for Llama-2-7B generation and 2.9x on Jetson Orin. On the Jetson Orin with 64 GB unified memory, TinyChat runs Llama-2-7B at 30 tokens per second and Llama-2-13B at 17 tokens per second. The Llama-2-70B model, which at FP16 would require far more memory than the Jetson Orin can provide, becomes deployable with AWQ at an interactive generation speed. These results were among the first to demonstrate the viability of 70B-class models on resource-constrained edge hardware.
On desktop GPUs, TinyChat's AWQ kernels run Llama-3-8B at approximately 2.7x the speed of FP16 on an RTX 4090. The framework also reports 38 tokens per second for Llama-2-7B on a Jetson Orin with optimized scheduling.
TinyChat 2.0, released in December 2024, extended the system to visual language models. The updated framework supports VILA and NVILA multimodal models, achieves 1.5 to 1.7x faster prefilling throughput compared to TinyChat 1.0 through improved kernel scheduling and parallelism, and adds a web demo interface.
The HAN Lab separately publishes TinyChatEngine, a C++ inference library targeting on-device deployment on CPUs, ARM processors, and mobile hardware where CUDA is not available.
The following table compares AWQ against the most commonly used alternative quantization methods for LLMs.
| Method | Precision | Approach | Calibration data | Quantization time (7B) | WikiText-2 perplexity (LLaMA-7B W4g128) | Hardware-friendly |
|---|---|---|---|---|---|---|
| RTN | W4A16 | Round-to-nearest | None | Seconds | ~5.9-6.0 | Yes |
| GPTQ | W4A16 | Hessian-based layer-wise optimization | 2,048+ samples | 2-4 hours | ~5.7 | Yes |
| AWQ | W4A16 | Activation-aware channel scaling | 128-512 samples | 10-30 minutes | ~5.78 | Yes |
| bitsandbytes NF4 | W4A16 | Normal Float 4 (used in QLoRA) | None | Seconds | ~5.7-5.9 | Limited |
| SmoothQuant | W8A8 | Activation outlier migration | Small set | Minutes | N/A | Yes (INT8 cores) |
RTN at 4-bit performs noticeably worse than FP16 on smaller models. On LLaMA-7B with group size 128, RTN perplexity on WikiText-2 is roughly 5.9 to 6.0, against FP16's 5.68. At 3-bit precision the gap is severe: RTN degrades to 43.2 perplexity on LLaMA-7B, making the model essentially unusable. AWQ at 3-bit achieves 13.0 on the same benchmark, a substantial improvement, though still noticeably worse than 4-bit.
GPTQ achieves comparable or slightly better perplexity than AWQ at 4 bits for larger models. On LLaMA-7B, GPTQ with group size 128 reaches approximately 5.67 WikiText-2 perplexity (essentially matching FP16), while AWQ reaches around 5.78. On LLaMA-13B and larger, the difference shrinks and the methods are often equivalent. GPTQ's main disadvantage is quantization time: hours per model, sensitivity to calibration domain, and occasional numerical instability on some architectures. AWQ takes 10 to 30 minutes for a 7B model and is far more robust to calibration distribution.
Bitsandbytes NF4, used in the QLoRA fine-tuning workflow, uses a different numeric format: it assigns INT4 grid points at quantiles of a standard normal distribution rather than at uniform intervals, matching the approximately normal distribution of trained neural network weights. This achieves comparable accuracy to GPTQ/AWQ at 4 bits. The inference speed of NF4 with bitsandbytes is generally slower than AWQ or GPTQ with optimized kernels, because NF4 inference kernels have historically been less aggressively optimized.
SmoothQuant (Xiao et al., 2022) operates in a different regime: W8A8 rather than W4A16. It migrates quantization difficulty from the activation side to the weight side by multiplying corresponding activation and weight channels by a smooth scaling factor, making both distributions manageable for INT8 quantization. This enables genuine INT8 matrix multiply-accumulate operations on hardware with strong INT8 throughput (NVIDIA A100, H100). SmoothQuant is complementary to AWQ: it targets applications where compute is the bottleneck (high batch sizes, compute-dense prefill), while AWQ targets memory-bound decode-phase inference at moderate batch sizes.
The memory reduction from AWQ's W4A16g128 format is approximately 3.5 to 3.8x compared to FP16, accounting for scale and zero-point overhead. In practical terms:
These numbers exclude activation memory and KV cache, which remain in FP16 and scale with context length and batch size.
The throughput benefit at single-request latency follows roughly from the memory reduction. Memory-bandwidth-limited inference on GPU is dominated by DRAM reads for weight matrices. Reducing weight storage by ~4x reduces the DRAM traffic per token by ~4x, giving approximately 4x speedup before other overheads. In practice, TinyChat and similar optimized kernels achieve 2.7x to 3.3x speedup over FP16 implementations, with the remainder lost to dequantization compute, kernel launch overhead, and the unchanging non-weight memory traffic (activations, KV cache).
At higher batch sizes, the arithmetic intensity shifts and weight quantization becomes less valuable. When processing a batch of 64 requests simultaneously, each weight loaded from DRAM is reused 64 times, amortizing the memory cost. The memory bandwidth bottleneck eases and compute throughput becomes the limiting factor. INT4 does not help compute throughput unless integer multiply-accumulate can be exploited, which W4A16 does not do. This is why W4A16 AWQ quantization is most impactful for low-latency single-request inference and less beneficial for high-throughput offline batch processing.
Hugging Face integrated AWQ support into the Transformers library in November 2023. Users can load any AWQ-quantized model from the Hugging Face Hub with the standard from_pretrained call:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
device_map="auto"
)
AWQ model checkpoints include a quantization_config block in their config.json with quant_method: "awq", bits: 4, and group_size: 128. The Transformers integration supports both the original llm-awq and the community autoawq backends.
For Llama and Mistral architectures, Transformers supports fused AWQ modules, which combine Q/K/V projections, output projections, and MLP layers into single fused kernels. Fused modules improve decode throughput roughly 2x compared to unfused operation at batch size one. The integration also supports ExLlamaV2 CUDA kernels via AwqConfig(version="exllama"), which work on both NVIDIA and AMD GPUs.
The AWQ Hugging Face model ecosystem grew rapidly in late 2023 and 2024, driven in part by community quantizer TheBloke, who published AWQ versions of nearly every major open-weight model release. By mid-2024, the Hub hosted over 7,000 AWQ-format models.
AutoAWQ is an open-source library by developer Casper Hansen, started in August 2023. It provides a simplified interface for quantizing models with AWQ and running inference, supporting a wider range of architectures than the original HAN Lab implementation. AutoAWQ integrates natively with the Hugging Face Hub for pushing and pulling quantized model checkpoints. The project accumulated over two million downloads and was Transformers' recommended AutoAWQ backend. AutoAWQ was deprecated in 2025 as its functionality was folded into mainline Transformers and llm-compressor.
vLLM supports AWQ natively and documents it as a recommended INT4 format for production serving. Starting from vLLM 0.6.1, AWQ models are served using Marlin kernels by default, which dramatically outperform the original TinyChat-style W4A16 kernels. Marlin AWQ achieves approximately 741 tokens per second output throughput on an A10G GPU for Llama-2-7B at batch size 1 to 4, compared to around 68 tokens per second with baseline AWQ kernels, a 10.9x improvement. This makes AWQ competitive with FP16 throughput on memory-bandwidth-limited hardware even with the Marlin overhead.
The vLLM project also ships an llm-compressor package that supports AWQ calibration as part of the model compression pipeline, tightly coupled to vLLM's serving infrastructure.
NVIDIA's TensorRT-LLM supports INT4 AWQ quantization natively. The NVIDIA TensorRT Model Optimizer includes AWQ calibration as a built-in quantization recipe, with output that integrates directly with TensorRT-LLM for deployment. This makes AWQ accessible to enterprise users via AWS SageMaker (which runs TensorRT-LLM backends), Google Cloud Vertex AI, and NVIDIA NIM microservices.
Additional production-grade systems that support AWQ include Hugging Face Text Generation Inference (TGI), LMDeploy (from Shanghai AI Lab), Intel Neural Compressor, AMD Quark, and FastChat. The llama.cpp project uses GGUF-format quantization with its own Q4_K and Q5_K algorithms rather than AWQ, but the practical effect is similar: 4-bit compressed LLMs runnable on consumer hardware. Users choosing between AWQ and GGUF generally prefer AWQ for NVIDIA GPU inference (better kernel support, Hugging Face integration) and GGUF for CPU inference or mixed CPU/GPU execution.
The most prominent use case highlighted in the AWQ paper is edge deployment. Running a 13B-parameter model at FP16 requires roughly 26 GB of memory, which exceeds the capacity of virtually all consumer GPUs and all laptop GPUs. AWQ compresses the same model to about 7 GB, fitting on a consumer RTX 4070 Ti (12 GB) or even some high-end gaming laptops. On the NVIDIA Jetson Orin with 16 GB of unified memory, TinyChat runs Llama-2-13B at 17 tokens per second with AWQ, where FP16 cannot run the model at all.
This opens a range of applications: private on-device assistants that process sensitive data without sending it to the cloud, low-latency code completion tools that avoid network round-trips, local document analysis for enterprises with data sovereignty requirements, and offline inference on devices without reliable internet access.
For API providers and enterprise ML teams, AWQ's memory reduction directly reduces the GPU count needed to serve a given model. A 70B model requiring four A100 80GB GPUs at FP16 can run on a single A100 80GB with AWQ. This represents a 4x reduction in GPU cost for that workload. The freed-up VRAM also allows batching more concurrent requests, further improving per-GPU throughput.
For organizations running inference at scale, the combination of reduced GPU count and higher utilization per GPU translates to meaningful cost savings, which is why AWQ adoption spread rapidly through production serving stacks.
AWQ's fast quantization time (10 to 30 minutes for a 7B model) makes it practical to release quantized checkpoints alongside base models at or near launch. GPTQ's multi-hour quantization made same-day quantized releases difficult for many model releases. The AWQ workflow fits into a typical CI/CD pipeline: after a new model is trained and evaluated, quantized AWQ versions at different sizes can be produced and uploaded to the Hugging Face Hub within the same release cycle.
The TinyChat framework demonstrated AWQ on visual language models: LLaVA-13B, OpenFlamingo-9B, and later VILA and NVILA models. AWQ's approach generalizes to multimodal transformers because the activation-based salience criterion applies to any linear layer, regardless of whether the inputs come from text, image patches, or other modalities. The paper reports minimal accuracy loss across eleven vision-language benchmarks for quantized VLMs.
AWQ is a robust and practical method, but it has known limitations worth understanding before deployment.
Calibration data is required. While AWQ uses far fewer samples than GPTQ and is less sensitive to domain mismatch, it is not a zero-shot method like RTN. The calibration run adds a step to the quantization pipeline and requires representative input data. For proprietary models where deployment data is sensitive, obtaining a suitable calibration corpus requires care.
W4A16 does not exploit INT8 tensor cores. Because activations remain in FP16, the critical matrix multiplications are still FP16 operations (after on-the-fly INT4 dequantization). The speedup comes from reduced DRAM traffic, not from faster arithmetic. On hardware where the bottleneck is compute rather than memory bandwidth (large batches, long context prefill on high-bandwidth accelerators), AWQ provides little benefit over FP16.
Accuracy degrades below 4 bits. AWQ is reliable at 4-bit but degrades significantly at 3-bit and is essentially unusable at 2-bit for most tasks. Post-training quantization to 2-3 bits without major quality loss requires more sophisticated methods such as AQLM or QuIP# that use vector quantization codebooks.
Non-linear layers and KV cache are not compressed. AWQ quantizes linear (fully connected) layers only. Embedding tables, layer normalizations, attention softmax, and particularly the KV cache remain in FP16. For very long context windows, the KV cache can dominate memory usage, leaving AWQ's compression largely irrelevant for that component. KV cache quantization (INT8 or FP8 keys and values) is a separate technique addressing this.
The channel-scaling heuristic assumes stable outlier channels. AWQ works because the same input channels consistently carry large activations across diverse inputs for standard transformer architectures. Some model families or unusual architectures may have more input-dependent outlier patterns, which could reduce the effectiveness of the fixed per-channel scale. In practice this has not been a significant problem for the major open-weight model families (Llama, Mistral, Qwen, Falcon, OPT), but it is a theoretical limitation worth noting for novel architectures.
Marlin is a high-performance CUDA kernel for FP16 x INT4 matrix multiplication, developed by IST Austria (IST-DASLab). It achieves close-to-ideal memory bandwidth utilization for INT4 weights through asynchronous prefetching from DRAM, a two-level tile structure that hides memory latency, and careful register allocation to maximize throughput. When combined with AWQ-quantized weights, the result (sometimes called Marlin AWQ) achieves approximately 741 tokens per second output throughput on an A10G GPU for Llama-2-7B, versus 68 tokens per second for the original AWQ kernels. Marlin AWQ is now the default kernel for AWQ models in vLLM.
AutoAWQ extended the original HAN Lab AWQ implementation to support a broader range of model architectures and simplified the end-to-end quantization workflow. It produced the majority of community AWQ model checkpoints on the Hugging Face Hub from late 2023 through 2024. AutoAWQ was deprecated in 2025 as Transformers and llm-compressor absorbed its capabilities.
OmniQuant (2023) extends AWQ's equivalent transformation idea by jointly learning optimal clipping bounds and transformation parameters through gradient-based optimization, rather than the grid search AWQ uses. This produces better accuracy than AWQ on some configurations, particularly for W4A4 (quantizing both weights and activations to 4 bits), at the cost of a more expensive calibration procedure.
QuaRot (2024) and SpinQuant (2024) apply random rotation matrices to transformer activations to eliminate the outlier channels that make activation quantization difficult. By spreading the dynamic range of activations more evenly across channels, these methods enable W4A4 quantization with better accuracy than AWQ at low bit widths. These approaches are more computationally demanding to calibrate but enable fully integer inference.
For extreme compression below 4 bits, methods like AQLM (Additive Quantization for Large Language Models) and QuIP# use learned vector quantization codebooks rather than scalar integer grids. A group of weights (say, 16 values) is encoded as a single codebook entry, achieving average bit widths of 2 bits or fewer. These methods maintain usable accuracy in regimes where AWQ and GPTQ degrade substantially, at the cost of more complex quantization and inference kernels.
SmoothQuant (Xiao et al., 2022) addresses the challenge of activation quantization by migrating quantization difficulty from activations to weights through per-channel scaling. AWQ and SmoothQuant share the insight that per-channel scaling can improve quantization quality, but AWQ applies it only to weights for W4A16, while SmoothQuant applies it to both sides to enable W8A8. The two methods are complementary and can be applied together.