Model Compression

Model compression refers to a family of techniques designed to reduce the size, memory footprint, and computational cost of machine learning models while preserving as much of their original performance as possible. As large language models have grown from millions to hundreds of billions of parameters, model compression has become essential for deploying these systems in production environments where hardware resources, latency requirements, and energy budgets impose hard constraints.

The core motivation is straightforward: a 70-billion-parameter model stored in FP16 requires roughly 140 GB of GPU memory just to load the weights, far exceeding the capacity of a single consumer GPU. Compression techniques make it possible to run such models on smaller hardware, reduce inference costs, and enable on-device deployment for mobile and edge applications [1].

The field traces its modern roots to the 2015 paper "Deep Compression" by Han, Mao, and Dally, which combined pruning, trained quantization, and Huffman coding to shrink AlexNet by 35x (from 240 MB to 6.9 MB) without losing accuracy. That pipeline established the template most modern compression stacks still follow: cut redundant connections first, then reduce numerical precision, and finally apply lossless coding where possible [2]. A decade later, the same principles drive a vast ecosystem of compression methods that make 70-billion and 400-billion-parameter models runnable on hardware ordinary developers actually own.

why compression matters

The scale problem has gotten dramatic. A modern frontier LLM like Llama 3.1 405B contains roughly 810 GB of FP16 weights, which exceeds the combined memory of eight H100 GPUs. Smaller models are not immune: even a 7-billion-parameter chat model occupies about 14 GB at FP16, locking out laptops and most phones. Compression closes the gap on four axes at once:

Axis	What compression buys	Why it matters
Memory	2x to 16x smaller weights	Fits larger models on smaller hardware; lets a single GPU hold more concurrent users
Latency	1.5x to 5x faster inference	Better user experience, lower per-token cost in serving
Energy	Proportional to memory bandwidth	Battery life on edge devices; lower data-center power draw
Cost	Smaller GPU SKUs become viable	Running Llama 3 70B on a single 48 GB card instead of two 80 GB cards

These gains compound in production. A serving cluster that compresses its weights by 4x can either serve 4x more users on the same hardware or move down to cheaper GPUs and keep throughput flat. For edge AI and on-device deployment, compression is not a nice-to-have; it is the only path to running modern models at all.

overview of main approaches

Model compression encompasses several families of techniques. Each operates on a different principle, and they can often be combined for greater effect.

Approach	Principle	Typical Compression	Retraining Required?	Key Trade-off
Quantization	Reduce numerical precision of weights and activations	2x to 4x memory reduction (up to 16x at 2-bit)	Sometimes (QAT) or No (PTQ)	Precision loss at very low bit widths
Pruning	Remove redundant or low-importance weights or structures	50% to 90% sparsity	Often beneficial	Irregular sparsity may not yield hardware speedups
Knowledge distillation	Train a smaller student model to mimic a larger teacher	Variable (architecture-dependent)	Yes (full training of student)	Student capacity limits performance ceiling
Low-rank factorization	Decompose weight matrices into products of smaller matrices	2x to 5x parameter reduction	Usually fine-tuning	Approximation error accumulates across layers
Weight sharing	Reuse parameters across layers or blocks	Variable; ALBERT cuts 18x	Yes (built into pretraining)	Limits expressive capacity per layer
Architectural compression	Cheaper attention, MoE, speculative decoding	Indirect (compute, not weights)	Often needs training	Engineering complexity
Compilation and kernel optimization	Fuse ops, reuse memory, optimize for target hardware	Compute and memory savings without weight changes	None	Hardware-specific tooling

Two notes on this taxonomy. First, the boundaries blur in practice; a real production stack might quantize, prune, and apply grouped query attention all at once. Second, compression is not free. Every method trades some quality for some efficiency, and the right point on that curve depends on the deployment.

quantization

Quantization reduces the numerical precision of model weights and, optionally, activations. A model originally stored in 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) is converted to lower-precision representations such as INT8, INT4, or FP8. Because each parameter occupies fewer bits, the model's memory footprint shrinks proportionally, and lower-precision arithmetic can execute faster on hardware that supports it [3].

Quantization has become the dominant compression technique for LLMs because it is cheap to apply, requires no retraining in the most common form, and combines well with everything else. The full topic gets its own treatment in quantization; this section covers the parts most relevant to compression as a whole.

post-training quantization (PTQ) vs. quantization-aware training (QAT)

The two main paradigms for quantization differ in when the precision reduction happens:

Aspect	Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
When applied	After training is complete	During training
Calibration data	Small calibration set (typically 128 to 1024 samples)	Full training dataset
Computational cost	Low (minutes to hours)	High (full training run)
Accuracy at INT8	Near-lossless for most models	Near-lossless
Accuracy at INT4	Noticeable degradation without advanced methods	Better preservation at very low bit widths
Use case	Quick deployment, large models where retraining is impractical	When maximum accuracy at low precision is required

PTQ has become the dominant approach for LLM quantization because retraining models with hundreds of billions of parameters is prohibitively expensive. QAT remains valuable for smaller models or when the deployment target demands extreme compression, like 2-bit or 3-bit weights [4].

common precision formats

INT8 quantization maps weights to 8-bit integers, cutting memory in half compared to FP16. Most modern models tolerate INT8 quantization with negligible accuracy loss, making it the safest starting point.

INT4 quantization pushes further to 4-bit integers, achieving a 4x reduction from FP16. At this precision, naive round-to-nearest quantization causes significant degradation, so advanced methods like GPTQ and AWQ are required to maintain quality.

FP8 (8-bit floating point) is a newer format supported by NVIDIA's Hopper and Blackwell GPU architectures. FP8 retains a small exponent, giving it better dynamic range than INT8, which makes it particularly effective for activations that span wide value ranges. NF4 (4-bit NormalFloat), introduced by Tim Dettmers in QLoRA, is a non-uniform 4-bit format optimized for the Gaussian-like distribution of trained weights [5].

modern quantization methods

Several methods have emerged specifically for quantizing large language models. Each makes a different trade between calibration cost, quantization quality, and the bit widths it can handle.

Method	Year	Approach	Key Innovation
LLM.int8()	2022	Mixed-precision INT8 with outlier handling	Identifies outlier feature dimensions and keeps them in FP16; pure INT8 for the rest. Underlies the bitsandbytes library [6]
GPTQ	2022	Layer-wise PTQ using approximate second-order information	Solves a reconstruction problem per layer to minimize output error; supports INT4 and INT3; quantizes 175B in 4 hours [7]
SmoothQuant	2022	Migrates quantization difficulty from activations to weights	Applies mathematically equivalent per-channel scaling to smooth activation outliers; enables W8A8 [8]
AWQ	2023	PTQ that protects salient weight channels	Identifies the ~1% of weights most important for accuracy (based on activation magnitudes) and scales them; faster calibration than GPTQ; MLSys 2024 best paper [9]
QLoRA / NF4	2023	4-bit base model with low-rank trainable adapters	NormalFloat 4 data type and double quantization; finetunes 65B on a single 48 GB GPU [5]
GGUF	2023	File format and quantization scheme for CPU/GPU inference	Designed for llama.cpp; supports mixed-precision quantization (e.g., Q4_K_M, Q5_K_S) with per-block scaling [10]
BitNet b1.58	2024	Native 1.58-bit ternary {-1, 0, 1} weights from pretraining	Replaces multiplications with additions; matches FP16 LLaMA at 3B+ scale with 3.55x less memory and 2.71x faster inference [11]
QuIP#	2024	Incoherence processing with lattice codebooks	Achieves 2-bit quantization with reasonable quality through E8 lattice codebooks

GPTQ works by solving a layer-wise reconstruction problem: for each layer, it finds the quantized weight matrix that minimizes the squared error between the original and quantized layer outputs on a small calibration set. The algorithm processes weights in order and uses approximate inverse Hessian information to compensate for quantization error. End-to-end inference speedups over FP16 reach about 3.25x on A100 and 4.5x on A6000 [7].

AWQ takes a different philosophy. Rather than treating all weights equally, it recognizes that a small fraction of weight channels (roughly 1%) disproportionately affect model accuracy. AWQ identifies these salient channels by examining activation magnitudes and applies per-channel scaling to protect them during quantization. This approach requires less calibration data than GPTQ and often produces faster quantization [9].

The GGUF format, used by llama.cpp and its ecosystem, supports a variety of quantization levels denoted by names like Q4_K_M (4-bit with medium-sized K-quant blocks) or Q8_0 (8-bit with simple block quantization). The K-quant variants use non-uniform quantization with different bit widths for different parts of the model based on their sensitivity, achieving better quality than uniform quantization at the same average bit width [10].

BitNet b1.58 is a more radical departure. Instead of quantizing a trained model, it pretrains directly with weights restricted to {-1, 0, +1}. The 1.58 figure comes from log2(3), since each weight carries log2(3) bits of information. Because every weight is one of three values, matrix multiplication reduces almost entirely to addition, removing the most expensive arithmetic from inference. The 2024 paper showed that BitNet b1.58 matches full-precision LLaMA at the 3B parameter scale on perplexity and downstream tasks while running 2.71x faster and using 3.55x less GPU memory; Microsoft followed up in 2025 with an open-weights 2B model trained on 4 trillion tokens [11].

quantization in practice: Llama 3 70B

A worked example clarifies what these numbers buy. Llama 3 70B at FP16 needs roughly 140 GB to hold the weights alone, requiring at least two 80 GB H100s or A100s. Quantize the same model to INT4 with AWQ and the weights drop to about 35 GB, which fits on a single 48 GB L40S, an A6000, or two consumer-grade RTX 4090s through tensor parallelism. Inference latency typically improves by 2x to 3x at the same time, since memory bandwidth is the dominant bottleneck for autoregressive decoding. The KV cache adds a few gigabytes more depending on context length, so practical deployments leave headroom of 5 GB to 10 GB on top of the weight footprint [12].

pruning

Pruning removes parameters from a trained model that contribute little to its output. The intuition is that neural networks are typically over-parameterized, and a significant fraction of weights can be zeroed out without meaningful performance loss.

structured vs. unstructured pruning

Unstructured pruning removes individual weights anywhere in the network, producing sparse weight matrices. While this can achieve very high sparsity levels (90%+), the resulting irregular sparsity patterns are difficult to accelerate on standard GPU hardware, which is optimized for dense matrix operations.

Structured pruning removes entire neurons, attention heads, or layers. The resulting model is a smaller but architecturally standard dense network that runs efficiently on existing hardware without specialized sparse computation libraries.

Pruning Type	Granularity	Hardware Friendliness	Achievable Sparsity	Practical Speedup
Unstructured	Individual weights	Low (requires sparse kernels)	80% to 95%	Limited without specialized hardware
Semi-structured (2:4)	2 of every 4 weights	High (NVIDIA Ampere+ support)	50% exactly	~2x on supported hardware
Structured (neurons/heads)	Rows, columns, or blocks	High (standard dense operations)	20% to 60%	Proportional to parameters removed

major pruning methods

Pruning has its own family of methods, each with different trade-offs:

Method	Year	Pruning Criterion	Notes
Magnitude pruning (Han et al.)	2015	Smallest absolute weight values	The original baseline; simple, often surprisingly competitive
Movement pruning (Sanh et al.)	2020	Movement during fine-tuning	Picks weights that drift toward zero; better for transfer learning than magnitude alone
SparseGPT (Frantar and Alistarh)	2023	Layer-wise reconstruction error	First method to prune 175B-class models in one shot; works in under 4.5 hours on OPT-175B [13]
Wanda (Sun et al.)	2023	Weight magnitude times input activation norm	No retraining, no weight updates, single forward pass; competitive with SparseGPT [14]
LLM-Pruner	2023	Coupled structure scoring	Structured pruning of LLaMA via dependency graphs

The simplest pruning criterion is weight magnitude: weights with the smallest absolute values are assumed to be least important and are set to zero. Despite its simplicity, magnitude pruning remains a strong baseline. It can be applied iteratively, with pruning followed by fine-tuning cycles to recover accuracy.

SparseGPT

SparseGPT, introduced in 2023 and presented at ICML 2023, was the first method to demonstrate that massive language models (up to 176 billion parameters) could be pruned to 50% sparsity in one shot, without any retraining, at minimal accuracy loss. The method formulates pruning as a layer-wise sparse reconstruction problem and solves it using approximate second-order information, similar in spirit to GPTQ for quantization [13].

SparseGPT can achieve 60% unstructured sparsity with negligible perplexity increase on models like OPT-175B and BLOOM-176B, completing the pruning process in under 4.5 hours. It also supports semi-structured 2:4 and 4:8 sparsity patterns that can leverage hardware acceleration on NVIDIA Ampere and later GPUs.

Wanda

Wanda (Pruning by Weights and Activations) takes the opposite philosophy from SparseGPT. Where SparseGPT solves a fairly heavy reconstruction problem, Wanda uses a remarkably simple pruning criterion: prune weights with the smallest values of |W| times ||X||, where ||X|| is the L2 norm of the input activations to that weight. This single forward pass through the calibration set is enough to identify which weights matter [14].

The insight behind Wanda is that LLMs have emergent activation outliers; a small subset of hidden state features is exceptionally large in magnitude. Multiplying weight magnitude by input activation norm captures the actual contribution each weight makes to the layer output, which turns out to track importance much better than magnitude alone. Wanda matches SparseGPT on LLaMA models while requiring no retraining and no weight updates, just a forward pass and a thresholding step.

the awkward truth about pruning at LLM scale

Research in 2023 and 2024 surfaced an uncomfortable result: at LLM scale, pruning generally hurts more per parameter removed than quantization does per bit removed. Frantar and Alistarh's own follow-up work showed that 50% sparsity costs more perplexity than INT4 quantization, and going past 50% sparsity gets steep quickly. The community has converged on quantization as the workhorse and pruning as an option to combine with it, especially when targeting the 2:4 semi-structured pattern that maps cleanly to NVIDIA's sparse Tensor Core support.

knowledge distillation

Knowledge distillation transfers the learned behavior of a large teacher model to a smaller student model. The foundational framework was established by Hinton, Vinyals, and Dean in their 2015 paper "Distilling the Knowledge in a Neural Network" [15].

how distillation works

In standard training, a model learns from hard labels (e.g., "this image is a cat"). In distillation, the student model is trained to match the teacher's full output probability distribution, known as soft labels or soft targets. These soft targets contain richer information than hard labels because they encode the teacher's learned similarities between classes.

For example, if a teacher model assigns probabilities of 0.7 to "cat," 0.2 to "tiger," and 0.05 to "dog," these soft targets reveal that the teacher considers cats more similar to tigers than to dogs. This relational knowledge, which Hinton called dark knowledge, helps the student generalize better than it could from hard labels alone.

A key innovation in Hinton's work is temperature scaling. The softmax function is computed with a temperature parameter T:

p_i = exp(z_i / T) / sum(exp(z_j / T))

Higher temperatures produce softer probability distributions, revealing more about the teacher's internal representations. During distillation, both teacher and student use the same elevated temperature (typically T=2 to T=20), and the student's loss is a weighted combination of the distillation loss (matching the teacher's soft targets) and the standard cross-entropy loss against hard labels.

distillation in the LLM era

Distillation has become a central strategy for creating smaller, deployable versions of large language models. The early NLP examples used encoder-only models like BERT:

Model	Year	Teacher	Student	Result
DistilBERT (Sanh et al.)	2019	BERT-base	6-layer student	40% smaller, 60% faster, retains 97% of BERT's GLUE performance [16]
TinyBERT (Jiao et al.)	2019	BERT-base	4-layer student with embedding distillation	7.5x smaller, 9.4x faster
MobileBERT (Sun et al.)	2020	Custom inverted-bottleneck BERT	25M parameter student	4.3x smaller, runs on mobile devices

For decoder-only LLMs, distillation has shifted from logit-matching to data distillation: the teacher generates training examples and the student trains on them. Stanford's Alpaca (2023) used 52,000 instruction-following examples generated by GPT-3.5 to fine-tune LLaMA 7B; UC Berkeley's Vicuna (2023) used ChatGPT conversation transcripts. Gemma, Phi, and Qwen model families use distillation as part of their training pipeline to create compact yet capable models. The Phi series in particular has shown that careful synthetic-data distillation can give a 3-billion-parameter model performance close to far larger frontier models on certain benchmarks.

A 2025 study on optimal compression ordering found that the sequence of pruning, then distillation, then quantization (P-KD-Q) yields the best balance between compression ratio and preserved capability. Pruning first establishes the structural foundation, distillation recovers lost knowledge, and quantization applies last without interfering with architectural changes [17].

low-rank factorization and LoRA

Low-rank factorization decomposes a large weight matrix W of dimensions m by n into the product of two smaller matrices: W is approximately equal to A times B, where A is m by r and B is r by n, with rank r much smaller than both m and n. This reduces the number of parameters from m*n to r*(m+n).

Applied directly as a compression technique on pretrained weights, low-rank factorization is less popular than quantization for LLMs because the approximation error compounds across layers and the achievable compression is modest. The technique sees more use as a building block in adapter methods.

LoRA and its descendants

LoRA (Low-Rank Adaptation), introduced by Edward Hu and colleagues at Microsoft in 2021, applies low-rank updates during fine-tuning rather than compressing existing weights. The pretrained weights stay frozen; only two small matrices A (m by r) and B (r by n) are trained, with the effective fine-tuned weight being W + AB. For GPT-3 175B, LoRA reduces trainable parameters by 10,000x and GPU memory by 3x compared to full fine-tuning, while matching or beating full fine-tuning quality on RoBERTa, DeBERTa, GPT-2, and GPT-3 [18].

LoRA itself is parameter-efficient fine-tuning rather than model compression in the strict sense, but it spawned a family of methods that combine low-rank ideas with compression:

Method	Year	Idea	Combines With
LoRA	2021	Low-rank adapter matrices added to frozen base	Standard fine-tuning
QLoRA	2023	LoRA on top of a 4-bit NF4 quantized base; double quantization; paged optimizers	Quantization (4-bit base) [5]
DoRA	2024	Decomposes weight updates into magnitude and direction; applies LoRA to direction only	LoRA, quantization
LongLoRA	2023	Sparse local attention plus LoRA for long-context fine-tuning	Sparse attention
MoRA	2024	High-rank update through square matrix with non-parameter operations	Replaces LoRA

QLoRA was the breakthrough that made fine-tuning genuinely accessible. By keeping the base model in 4-bit NF4 quantization and only training low-rank adapters on top, QLoRA fits 65-billion-parameter fine-tuning into a single 48 GB GPU. The Guanaco model family produced by the QLoRA paper reached 99.3% of ChatGPT's quality on the Vicuna benchmark using just 24 hours of fine-tuning on one GPU [5]. The technique is now standard in the open-source fine-tuning ecosystem.

Weight sharing reduces parameter count by reusing the same parameters in multiple places. ALBERT (A Lite BERT, Lan et al. 2019) is the canonical example: it shares all transformer-block parameters across layers, so a 12-layer ALBERT has the same parameter count as a single layer of standard BERT. ALBERT-Large achieves 18x fewer parameters than BERT-Large with comparable or better performance on GLUE, SQuAD, and RACE [19]. Universal Transformer (Dehghani et al. 2018) applies the same idea recurrently, using one shared transformer block iteratively rather than stacking distinct ones.

Weight sharing is harder to retrofit onto existing models than quantization or pruning because it requires training from scratch. Modern frontier LLMs generally do not use it, in part because the parameter savings come at the cost of expressive capacity that scaling laws reward.

architectural compression

Some of the largest practical compression wins come not from compressing a fixed architecture but from designing the architecture to need less compute or memory at inference time.

grouped-query and multi-query attention

Standard multi-head attention stores separate key and value projections for each attention head, which makes the KV cache one of the dominant memory consumers during long-context generation. Multi-Query Attention (MQA) shares a single KV head across all query heads; Grouped-Query Attention (GQA, Ainslie et al. 2023) is the middle ground, splitting query heads into groups that share KV. GQA preserves nearly the quality of full multi-head attention while shrinking the KV cache by the group factor (often 4x to 8x). Llama 2 70B, Llama 3, and Mistral all use GQA; the smaller cache enables longer context windows on the same hardware [20].

mixture of experts

Mixture of Experts (MoE) routes each token to only a subset of expert subnetworks, so a model with hundreds of billions of total parameters can run with the per-token compute of a much smaller dense model. Mixtral 8x7B is a well-known example: 47 billion total parameters but only 13 billion active per token, giving it the inference cost of a 13B model and the quality (on many tasks) closer to a 70B model. MoE is not weight compression in the traditional sense; the weights still exist. It compresses compute, which often matters more in practice.

speculative decoding

Speculative decoding (Leviathan et al. 2023) accelerates inference without changing the model output distribution. A small "draft" model proposes several tokens, and the large target model verifies them in parallel; tokens that match are accepted, mismatches restart from the divergence point. The 2023 paper demonstrated 2x to 3x acceleration on T5-XXL with identical sampling distributions, and the technique has since become standard in vLLM, TensorRT-LLM, and other production inference engines [21].

Like MoE, speculative decoding does not shrink weights; it amortizes them by getting more useful work per forward pass through the large model. Combined with quantization, the gains multiply.

compilation and serving

The last layer of compression lives in the inference engine. Even an uncompressed model can run several times faster (or slower) depending on how its operators are scheduled and whether the runtime exploits hardware features.

Engine	Origin	Strengths
llama.cpp	Community (Gerganov)	CPU and Apple Silicon; GGUF; runs almost anywhere; quantizations from Q2_K to Q8_0
vLLM	UC Berkeley	PagedAttention, continuous batching, broad quantization support; throughput leader for many workloads
TensorRT-LLM	NVIDIA	Tightest NVIDIA hardware integration, FP8 on Hopper/Blackwell, in-flight batching, compiled kernels
TGI (Text Generation Inference)	Hugging Face	Production-ready serving with Hugging Face model hub integration
SGLang	LMSYS	Structured generation, RadixAttention for prefix caching
LMDeploy	InternLM	Throughput-focused; tensor parallelism with fused kernels
MLC LLM	CMU / Apache TVM	Cross-platform compilation including iOS, Android, WebGPU
ExLlamaV2	Community	Fast 2-8 bit GPTQ-style quantization on consumer GPUs
ONNX Runtime	Microsoft	Cross-framework deployment with quantization, hardware backends

None of these engines compress models in the literal weight-reduction sense; they make compressed models actually run at the speed their bit width promises. The Marlin kernel inside vLLM, for instance, achieves 741 tokens per second decoding INT4 AWQ models on an A100 [22], which is essentially impossible without hand-tuned mixed-precision GEMM kernels.

tools and libraries

A developer compressing a model in 2026 has a deep toolbox:

Library	What it does
bitsandbytes	The original LLM.int8() implementation; 8-bit and 4-bit quantization for PyTorch, drives QLoRA fine-tuning [6]
AutoGPTQ	Open-source GPTQ implementation, packs quantized models for inference
AutoAWQ	Open-source AWQ implementation, similar role to AutoGPTQ
llama.cpp	C/C++ inference with GGUF, the de facto CPU/Apple Silicon path
Optimum (Hugging Face)	Wraps multiple quantization backends with a unified API
NVIDIA TensorRT Model Optimizer	Unified library for quantization, pruning, distillation, and speculative decoding
Intel Neural Compressor	CPU-focused compression with PTQ, QAT, pruning, distillation
Microsoft DeepSpeed-Inference	Compression with model-parallel inference
Microsoft BitNet inference	Reference 1-bit and 1.58-bit inference framework

For most practical LLM deployments, the workflow is: pick a base model, choose AWQ or GPTQ for GPU serving (or GGUF for CPU/Apple Silicon), quantize once with a small calibration set, deploy with vLLM or TensorRT-LLM. The whole pipeline takes hours, not days, and the resulting model usually loses less than a point of MMLU accuracy.

benchmarks: how much quality survives

Reported quality losses from compression are surprisingly small at moderate bit widths but degrade quickly at extremes. The numbers below are typical patterns from public benchmarks; exact figures vary by model and method.

Method	Compression	Typical perplexity hit	Typical MMLU hit	When it breaks
FP16 baseline	1x	0	0	n/a
INT8 (LLM.int8, SmoothQuant)	2x	<0.1	<0.5 pts	Almost never
FP8 (Hopper/Blackwell)	2x	<0.1	<0.5 pts	Reasoning at long context
INT4 GPTQ	4x	0.1 to 0.5	1 to 2 pts	Aggressive group sizes
INT4 AWQ	4x	0.1 to 0.4	1 to 2 pts	Multi-step reasoning
GGUF Q4_K_M	~3.5x	0.1 to 0.3	1 to 2 pts	Smaller models suffer more
GGUF Q2_K	~6x	0.5 to 1.5	5+ pts	General degradation
2:4 sparsity (SparseGPT)	~1.6x effective	0.2 to 0.6	1 to 3 pts	At LLM scale, often worse than INT4
BitNet b1.58 (native)	~10x	0 (matches FP16 at 3B+)	matches	Requires pretraining from scratch

Accuracy on reasoning-heavy benchmarks like GSM8K and HumanEval tends to drop more than on general knowledge benchmarks like MMLU when models are aggressively quantized. Long-context tasks are also more sensitive to KV cache quantization than weight quantization.

limitations and tradeoffs

Compression is not free, and several patterns emerge once methods get pushed.

Quantization degrades reasoning faster than it degrades surface fluency. A 4-bit model often sounds nearly identical to its FP16 counterpart on conversational tasks but stumbles on multi-step math or code. The reason appears to be that reasoning chains accumulate quantization error across many forward passes.

Pruning hurts more than quantization at LLM scale. The 2023 SparseGPT paper itself noted that 50% sparsity loses more perplexity than INT4 quantization on the same model. This is part of why pruning has not displaced quantization in production.

Distillation loses long-tail knowledge. A student model trained on the teacher's outputs picks up the common patterns easily but misses rare facts the teacher learned from training data the student never sees. This shows up as worse performance on niche topics and reduced ability to recall specific entities.

Compression interacts with calibration data quality. PTQ methods like GPTQ and AWQ depend on a few hundred to a few thousand calibration samples. Using calibration data that does not match the deployment distribution (English-only calibration for a multilingual model, or news text for a coding model) can cause noticeable quality regressions that look like the method failed.

Hardware support changes the math. INT4 looks like a 4x compression on paper, but if the GPU has no INT4 Tensor Cores, the actual speedup is closer to memory-bandwidth-limited 2x. Conversely, FP8 on Blackwell or 2:4 sparsity on Ampere can give nearly the full theoretical speedup because the silicon was designed for them.

KV cache compression is its own problem. Weight quantization barely touches the KV cache, which dominates memory at long context. Methods like AWQ-KV, KIVI, and FP8 KV cache are emerging specifically for this, but the trade-offs are different from weight quantization and the field is still working out the right defaults.

when to use which method

The choice of compression method depends on the deployment scenario:

For quick deployment on consumer GPUs, start with quantization using GGUF (for CPU/Apple Silicon) or AWQ/GPTQ (for NVIDIA GPUs). INT4 quantization with AWQ or GPTQ is the most popular approach for running 70B-parameter models on hardware with 24 GB to 48 GB of VRAM.

For maximum throughput in production serving, combine INT4/FP8 quantization with optimized inference engines like vLLM or TensorRT. Add speculative decoding when the workload is latency-bound rather than throughput-bound. Marlin-AWQ kernels achieve some of the highest throughput, reaching 741 tokens per second in benchmarks [22].

For creating a permanently smaller model, use structured pruning followed by distillation. NVIDIA's approach of pruning a large model (e.g., Llama 3.1 8B to 4B parameters) followed by distillation from the original model has shown strong results, with pruned models achieving 30% speed improvements while retaining most of the original quality [23].

For edge and mobile deployment, combine aggressive quantization (INT4 or lower) with GGUF and llama.cpp, or use MLC LLM for cross-platform compilation including WebGPU. The GGUF format with llama.cpp enables running quantized models on CPUs and Apple Silicon devices without a discrete GPU.

For maintaining near-original quality, use INT8 quantization or FP8 on supported hardware. These higher-precision quantization formats typically cause less than 1% accuracy degradation.

For building a small model from scratch with extreme efficiency, native low-bit pretraining like BitNet b1.58 is the frontier option, though it requires training the model from the ground up rather than retrofitting an existing one.

current state and future directions

As of early 2026, model compression is a rapidly evolving field driven by the practical necessity of deploying increasingly large models. Several trends define the current landscape.

The llama.cpp ecosystem and GGUF format have democratized model compression, making it possible for individuals to run quantized versions of state-of-the-art models on personal hardware. Community members routinely quantize new open-weight models within hours of their release, and a healthy collection of pre-quantized models is available on Hugging Face for nearly every popular open-weight release.

Hardware is adapting to compression. NVIDIA's Blackwell architecture includes native FP8 and FP4 support, and the 2:4 structured sparsity pattern supported since Ampere provides a clean path from pruning to hardware acceleration. Apple's M-series chips have unified memory architectures that benefit dramatically from quantized models.

Research continues toward more extreme compression. Methods like QuIP# and AQLM push toward 2-bit quantization, and 1-bit and 1.58-bit models are an active research frontier. BitNet b1.58 in particular has demonstrated that native low-bit training is viable, though it currently requires pretraining from scratch rather than retrofitting existing models.

KV cache compression is becoming as important as weight compression for long-context serving. As context windows reach 1 million tokens or more, the KV cache can dwarf the model weights in memory cost, and techniques specifically targeting it are receiving heavy attention.

The combination of multiple compression techniques, applied in the optimal order of pruning, distillation, then quantization, represents the current best practice for maximum compression with preserved capability [17]. Tools like NVIDIA's TensorRT Model Optimizer provide unified libraries that support quantization, pruning, distillation, and speculative decoding in a single framework.

The broader trajectory points the same direction: as models get bigger, the gap between the resources needed to train them and the resources available to deploy them widens, and compression closes that gap. Every major open-weight release in 2025 came with quantized variants on day one, often before the FP16 weights had finished syncing across mirrors. That is the new normal.

references

Zhu, X. et al. (2024). "A Survey on Model Compression for Large Language Models." Transactions of the Association for Computational Linguistics. MIT Press. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00704/125482/A-Survey-on-Model-Compression-for-Large-Language
Han, S., Mao, H., & Dally, W. J. (2015). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR 2016. https://arxiv.org/abs/1510.00149
Nagel, M. et al. (2021). "A White Paper on Neural Network Quantization." arXiv preprint. https://arxiv.org/abs/2106.08295
Gholami, A. et al. (2021). "A Survey of Quantization Methods for Efficient Neural Network Inference." arXiv preprint. https://arxiv.org/abs/2103.13630
Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS 2023. https://arxiv.org/abs/2305.14314
Dettmers, T. et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. https://arxiv.org/abs/2208.07339
Frantar, E. et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. https://arxiv.org/abs/2210.17323
Xiao, G. et al. (2022). "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." ICML 2023. https://arxiv.org/abs/2211.10438
Lin, J. et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024 (Best Paper). https://arxiv.org/abs/2306.00978
Gerganov, G. (2023). "llama.cpp: LLM inference in C/C++." GitHub. https://github.com/ggerganov/llama.cpp
Ma, S. et al. (2024). "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv preprint. https://arxiv.org/abs/2402.17764
Hugging Face. (2024). "Meta-Llama-3.1-70B-Instruct-AWQ-INT4 model card." https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
Frantar, E. & Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023. https://arxiv.org/abs/2301.00774
Sun, M. et al. (2023). "A Simple and Effective Pruning Approach for Large Language Models." ICLR 2024. https://arxiv.org/abs/2306.11695
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv preprint. https://arxiv.org/abs/1503.02531
Sanh, V. et al. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." NeurIPS 2019 EMC2 Workshop. https://arxiv.org/abs/1910.01108
Gupta, A. et al. (2025). "A Survey of Model Compression Techniques: Past, Present, and Future." Frontiers in Robotics and AI. https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2025.1518965/full
Hu, E. J. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. https://arxiv.org/abs/2106.09685
Lan, Z. et al. (2019). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR 2020. https://arxiv.org/abs/1909.11942
Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. https://arxiv.org/abs/2305.13245
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://arxiv.org/abs/2211.17192
"The Complete Guide to LLM Quantization with vLLM." Jarvis Labs Docs (2025). https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
"Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer." NVIDIA Technical Blog (2024). https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/

why compression matters

overview of main approaches

quantization

post-training quantization (PTQ) vs. quantization-aware training (QAT)

common precision formats

modern quantization methods

quantization in practice: Llama 3 70B

pruning

structured vs. unstructured pruning

major pruning methods

SparseGPT

Wanda

the awkward truth about pruning at LLM scale

knowledge distillation

how distillation works

distillation in the LLM era

low-rank factorization and LoRA

LoRA and its descendants

weight sharing

architectural compression

grouped-query and multi-query attention

mixture of experts

speculative decoding

compilation and serving

tools and libraries

benchmarks: how much quality survives

limitations and tradeoffs

when to use which method

current state and future directions

see also

references

Improve this article

Related Articles

Knowledge Distillation

GPTQ

AWQ

why compression matters

overview of main approaches

quantization

post-training quantization (PTQ) vs. quantization-aware training (QAT)

common precision formats

modern quantization methods

quantization in practice: Llama 3 70B

pruning

structured vs. unstructured pruning

major pruning methods

SparseGPT

Wanda

the awkward truth about pruning at LLM scale

knowledge distillation

how distillation works

distillation in the LLM era

low-rank factorization and LoRA

LoRA and its descendants

weight sharing

architectural compression

grouped-query and multi-query attention

mixture of experts

speculative decoding

compilation and serving

tools and libraries

benchmarks: how much quality survives

limitations and tradeoffs

when to use which method

current state and future directions

see also

references

Related Articles

Knowledge Distillation

GPTQ

AWQ