Model compression refers to a family of techniques designed to reduce the size, memory footprint, and computational cost of machine learning models while preserving as much of their original performance as possible. As large language models have grown from millions to hundreds of billions of parameters, model compression has become essential for deploying these systems in production environments where hardware resources, latency requirements, and energy budgets impose hard constraints.
The core motivation is straightforward: a 70-billion-parameter model stored in FP16 requires roughly 140 GB of GPU memory just to load the weights, far exceeding the capacity of a single consumer GPU. Compression techniques make it possible to run such models on smaller hardware, reduce inference costs, and enable on-device deployment for mobile and edge applications [1].
The field traces its modern roots to the 2015 paper "Deep Compression" by Han, Mao, and Dally, which combined pruning, trained quantization, and Huffman coding to shrink AlexNet by 35x (from 240 MB to 6.9 MB) without losing accuracy. That pipeline established the template most modern compression stacks still follow: cut redundant connections first, then reduce numerical precision, and finally apply lossless coding where possible [2]. A decade later, the same principles drive a vast ecosystem of compression methods that make 70-billion and 400-billion-parameter models runnable on hardware ordinary developers actually own.
The scale problem has gotten dramatic. A modern frontier LLM like Llama 3.1 405B contains roughly 810 GB of FP16 weights, which exceeds the combined memory of eight H100 GPUs. Smaller models are not immune: even a 7-billion-parameter chat model occupies about 14 GB at FP16, locking out laptops and most phones. Compression closes the gap on four axes at once:
| Axis | What compression buys | Why it matters |
|---|---|---|
| Memory | 2x to 16x smaller weights | Fits larger models on smaller hardware; lets a single GPU hold more concurrent users |
| Latency | 1.5x to 5x faster inference | Better user experience, lower per-token cost in serving |
| Energy | Proportional to memory bandwidth | Battery life on edge devices; lower data-center power draw |
| Cost | Smaller GPU SKUs become viable | Running Llama 3 70B on a single 48 GB card instead of two 80 GB cards |
These gains compound in production. A serving cluster that compresses its weights by 4x can either serve 4x more users on the same hardware or move down to cheaper GPUs and keep throughput flat. For edge AI and on-device deployment, compression is not a nice-to-have; it is the only path to running modern models at all.
Model compression encompasses several families of techniques. Each operates on a different principle, and they can often be combined for greater effect.
| Approach | Principle | Typical Compression | Retraining Required? | Key Trade-off |
|---|---|---|---|---|
| Quantization | Reduce numerical precision of weights and activations | 2x to 4x memory reduction (up to 16x at 2-bit) | Sometimes (QAT) or No (PTQ) | Precision loss at very low bit widths |
| Pruning | Remove redundant or low-importance weights or structures | 50% to 90% sparsity | Often beneficial | Irregular sparsity may not yield hardware speedups |
| Knowledge distillation | Train a smaller student model to mimic a larger teacher | Variable (architecture-dependent) | Yes (full training of student) | Student capacity limits performance ceiling |
| Low-rank factorization | Decompose weight matrices into products of smaller matrices | 2x to 5x parameter reduction | Usually fine-tuning | Approximation error accumulates across layers |
| Weight sharing | Reuse parameters across layers or blocks | Variable; ALBERT cuts 18x | Yes (built into pretraining) | Limits expressive capacity per layer |
| Architectural compression | Cheaper attention, MoE, speculative decoding | Indirect (compute, not weights) | Often needs training | Engineering complexity |
| Compilation and kernel optimization | Fuse ops, reuse memory, optimize for target hardware | Compute and memory savings without weight changes | None | Hardware-specific tooling |
Two notes on this taxonomy. First, the boundaries blur in practice; a real production stack might quantize, prune, and apply grouped query attention all at once. Second, compression is not free. Every method trades some quality for some efficiency, and the right point on that curve depends on the deployment.
Quantization reduces the numerical precision of model weights and, optionally, activations. A model originally stored in 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) is converted to lower-precision representations such as INT8, INT4, or FP8. Because each parameter occupies fewer bits, the model's memory footprint shrinks proportionally, and lower-precision arithmetic can execute faster on hardware that supports it [3].
Quantization has become the dominant compression technique for LLMs because it is cheap to apply, requires no retraining in the most common form, and combines well with everything else. The full topic gets its own treatment in quantization; this section covers the parts most relevant to compression as a whole.
The two main paradigms for quantization differ in when the precision reduction happens:
| Aspect | Post-Training Quantization (PTQ) | Quantization-Aware Training (QAT) |
|---|---|---|
| When applied | After training is complete | During training |
| Calibration data | Small calibration set (typically 128 to 1024 samples) | Full training dataset |
| Computational cost | Low (minutes to hours) | High (full training run) |
| Accuracy at INT8 | Near-lossless for most models | Near-lossless |
| Accuracy at INT4 | Noticeable degradation without advanced methods | Better preservation at very low bit widths |
| Use case | Quick deployment, large models where retraining is impractical | When maximum accuracy at low precision is required |
PTQ has become the dominant approach for LLM quantization because retraining models with hundreds of billions of parameters is prohibitively expensive. QAT remains valuable for smaller models or when the deployment target demands extreme compression, like 2-bit or 3-bit weights [4].
INT8 quantization maps weights to 8-bit integers, cutting memory in half compared to FP16. Most modern models tolerate INT8 quantization with negligible accuracy loss, making it the safest starting point.
INT4 quantization pushes further to 4-bit integers, achieving a 4x reduction from FP16. At this precision, naive round-to-nearest quantization causes significant degradation, so advanced methods like GPTQ and AWQ are required to maintain quality.
FP8 (8-bit floating point) is a newer format supported by NVIDIA's Hopper and Blackwell GPU architectures. FP8 retains a small exponent, giving it better dynamic range than INT8, which makes it particularly effective for activations that span wide value ranges. NF4 (4-bit NormalFloat), introduced by Tim Dettmers in QLoRA, is a non-uniform 4-bit format optimized for the Gaussian-like distribution of trained weights [5].
Several methods have emerged specifically for quantizing large language models. Each makes a different trade between calibration cost, quantization quality, and the bit widths it can handle.
| Method | Year | Approach | Key Innovation |
|---|---|---|---|
| LLM.int8() | 2022 | Mixed-precision INT8 with outlier handling | Identifies outlier feature dimensions and keeps them in FP16; pure INT8 for the rest. Underlies the bitsandbytes library [6] |
| GPTQ | 2022 | Layer-wise PTQ using approximate second-order information | Solves a reconstruction problem per layer to minimize output error; supports INT4 and INT3; quantizes 175B in 4 hours [7] |
| SmoothQuant | 2022 | Migrates quantization difficulty from activations to weights | Applies mathematically equivalent per-channel scaling to smooth activation outliers; enables W8A8 [8] |
| AWQ | 2023 | PTQ that protects salient weight channels | Identifies the ~1% of weights most important for accuracy (based on activation magnitudes) and scales them; faster calibration than GPTQ; MLSys 2024 best paper [9] |
| QLoRA / NF4 | 2023 | 4-bit base model with low-rank trainable adapters | NormalFloat 4 data type and double quantization; finetunes 65B on a single 48 GB GPU [5] |
| GGUF | 2023 | File format and quantization scheme for CPU/GPU inference | Designed for llama.cpp; supports mixed-precision quantization (e.g., Q4_K_M, Q5_K_S) with per-block scaling [10] |
| BitNet b1.58 | 2024 | Native 1.58-bit ternary {-1, 0, 1} weights from pretraining | Replaces multiplications with additions; matches FP16 LLaMA at 3B+ scale with 3.55x less memory and 2.71x faster inference [11] |
| QuIP# | 2024 | Incoherence processing with lattice codebooks | Achieves 2-bit quantization with reasonable quality through E8 lattice codebooks |
GPTQ works by solving a layer-wise reconstruction problem: for each layer, it finds the quantized weight matrix that minimizes the squared error between the original and quantized layer outputs on a small calibration set. The algorithm processes weights in order and uses approximate inverse Hessian information to compensate for quantization error. End-to-end inference speedups over FP16 reach about 3.25x on A100 and 4.5x on A6000 [7].
AWQ takes a different philosophy. Rather than treating all weights equally, it recognizes that a small fraction of weight channels (roughly 1%) disproportionately affect model accuracy. AWQ identifies these salient channels by examining activation magnitudes and applies per-channel scaling to protect them during quantization. This approach requires less calibration data than GPTQ and often produces faster quantization [9].
The GGUF format, used by llama.cpp and its ecosystem, supports a variety of quantization levels denoted by names like Q4_K_M (4-bit with medium-sized K-quant blocks) or Q8_0 (8-bit with simple block quantization). The K-quant variants use non-uniform quantization with different bit widths for different parts of the model based on their sensitivity, achieving better quality than uniform quantization at the same average bit width [10].
BitNet b1.58 is a more radical departure. Instead of quantizing a trained model, it pretrains directly with weights restricted to {-1, 0, +1}. The 1.58 figure comes from log2(3), since each weight carries log2(3) bits of information. Because every weight is one of three values, matrix multiplication reduces almost entirely to addition, removing the most expensive arithmetic from inference. The 2024 paper showed that BitNet b1.58 matches full-precision LLaMA at the 3B parameter scale on perplexity and downstream tasks while running 2.71x faster and using 3.55x less GPU memory; Microsoft followed up in 2025 with an open-weights 2B model trained on 4 trillion tokens [11].
A worked example clarifies what these numbers buy. Llama 3 70B at FP16 needs roughly 140 GB to hold the weights alone, requiring at least two 80 GB H100s or A100s. Quantize the same model to INT4 with AWQ and the weights drop to about 35 GB, which fits on a single 48 GB L40S, an A6000, or two consumer-grade RTX 4090s through tensor parallelism. Inference latency typically improves by 2x to 3x at the same time, since memory bandwidth is the dominant bottleneck for autoregressive decoding. The KV cache adds a few gigabytes more depending on context length, so practical deployments leave headroom of 5 GB to 10 GB on top of the weight footprint [12].
Pruning removes parameters from a trained model that contribute little to its output. The intuition is that neural networks are typically over-parameterized, and a significant fraction of weights can be zeroed out without meaningful performance loss.
Unstructured pruning removes individual weights anywhere in the network, producing sparse weight matrices. While this can achieve very high sparsity levels (90%+), the resulting irregular sparsity patterns are difficult to accelerate on standard GPU hardware, which is optimized for dense matrix operations.
Structured pruning removes entire neurons, attention heads, or layers. The resulting model is a smaller but architecturally standard dense network that runs efficiently on existing hardware without specialized sparse computation libraries.
| Pruning Type | Granularity | Hardware Friendliness | Achievable Sparsity | Practical Speedup |
|---|---|---|---|---|
| Unstructured | Individual weights | Low (requires sparse kernels) | 80% to 95% | Limited without specialized hardware |
| Semi-structured (2:4) | 2 of every 4 weights | High (NVIDIA Ampere+ support) | 50% exactly | ~2x on supported hardware |
| Structured (neurons/heads) | Rows, columns, or blocks | High (standard dense operations) | 20% to 60% | Proportional to parameters removed |
Pruning has its own family of methods, each with different trade-offs:
| Method | Year | Pruning Criterion | Notes |
|---|---|---|---|
| Magnitude pruning (Han et al.) | 2015 | Smallest absolute weight values | The original baseline; simple, often surprisingly competitive |
| Movement pruning (Sanh et al.) | 2020 | Movement during fine-tuning | Picks weights that drift toward zero; better for transfer learning than magnitude alone |
| SparseGPT (Frantar and Alistarh) | 2023 | Layer-wise reconstruction error | First method to prune 175B-class models in one shot; works in under 4.5 hours on OPT-175B [13] |
| Wanda (Sun et al.) | 2023 | Weight magnitude times input activation norm | No retraining, no weight updates, single forward pass; competitive with SparseGPT [14] |
| LLM-Pruner | 2023 | Coupled structure scoring | Structured pruning of LLaMA via dependency graphs |
The simplest pruning criterion is weight magnitude: weights with the smallest absolute values are assumed to be least important and are set to zero. Despite its simplicity, magnitude pruning remains a strong baseline. It can be applied iteratively, with pruning followed by fine-tuning cycles to recover accuracy.
SparseGPT, introduced in 2023 and presented at ICML 2023, was the first method to demonstrate that massive language models (up to 176 billion parameters) could be pruned to 50% sparsity in one shot, without any retraining, at minimal accuracy loss. The method formulates pruning as a layer-wise sparse reconstruction problem and solves it using approximate second-order information, similar in spirit to GPTQ for quantization [13].
SparseGPT can achieve 60% unstructured sparsity with negligible perplexity increase on models like OPT-175B and BLOOM-176B, completing the pruning process in under 4.5 hours. It also supports semi-structured 2:4 and 4:8 sparsity patterns that can leverage hardware acceleration on NVIDIA Ampere and later GPUs.
Wanda (Pruning by Weights and Activations) takes the opposite philosophy from SparseGPT. Where SparseGPT solves a fairly heavy reconstruction problem, Wanda uses a remarkably simple pruning criterion: prune weights with the smallest values of |W| times ||X||, where ||X|| is the L2 norm of the input activations to that weight. This single forward pass through the calibration set is enough to identify which weights matter [14].
The insight behind Wanda is that LLMs have emergent activation outliers; a small subset of hidden state features is exceptionally large in magnitude. Multiplying weight magnitude by input activation norm captures the actual contribution each weight makes to the layer output, which turns out to track importance much better than magnitude alone. Wanda matches SparseGPT on LLaMA models while requiring no retraining and no weight updates, just a forward pass and a thresholding step.
Research in 2023 and 2024 surfaced an uncomfortable result: at LLM scale, pruning generally hurts more per parameter removed than quantization does per bit removed. Frantar and Alistarh's own follow-up work showed that 50% sparsity costs more perplexity than INT4 quantization, and going past 50% sparsity gets steep quickly. The community has converged on quantization as the workhorse and pruning as an option to combine with it, especially when targeting the 2:4 semi-structured pattern that maps cleanly to NVIDIA's sparse Tensor Core support.
Knowledge distillation transfers the learned behavior of a large teacher model to a smaller student model. The foundational framework was established by Hinton, Vinyals, and Dean in their 2015 paper "Distilling the Knowledge in a Neural Network" [15].
In standard training, a model learns from hard labels (e.g., "this image is a cat"). In distillation, the student model is trained to match the teacher's full output probability distribution, known as soft labels or soft targets. These soft targets contain richer information than hard labels because they encode the teacher's learned similarities between classes.
For example, if a teacher model assigns probabilities of 0.7 to "cat," 0.2 to "tiger," and 0.05 to "dog," these soft targets reveal that the teacher considers cats more similar to tigers than to dogs. This relational knowledge, which Hinton called dark knowledge, helps the student generalize better than it could from hard labels alone.
A key innovation in Hinton's work is temperature scaling. The softmax function is computed with a temperature parameter T:
p_i = exp(z_i / T) / sum(exp(z_j / T))
Higher temperatures produce softer probability distributions, revealing more about the teacher's internal representations. During distillation, both teacher and student use the same elevated temperature (typically T=2 to T=20), and the student's loss is a weighted combination of the distillation loss (matching the teacher's soft targets) and the standard cross-entropy loss against hard labels.
Distillation has become a central strategy for creating smaller, deployable versions of large language models. The early NLP examples used encoder-only models like BERT:
| Model | Year | Teacher | Student | Result |
|---|---|---|---|---|
| DistilBERT (Sanh et al.) | 2019 | BERT-base | 6-layer student | 40% smaller, 60% faster, retains 97% of BERT's GLUE performance [16] |
| TinyBERT (Jiao et al.) | 2019 | BERT-base | 4-layer student with embedding distillation | 7.5x smaller, 9.4x faster |
| MobileBERT (Sun et al.) | 2020 | Custom inverted-bottleneck BERT | 25M parameter student | 4.3x smaller, runs on mobile devices |
For decoder-only LLMs, distillation has shifted from logit-matching to data distillation: the teacher generates training examples and the student trains on them. Stanford's Alpaca (2023) used 52,000 instruction-following examples generated by GPT-3.5 to fine-tune LLaMA 7B; UC Berkeley's Vicuna (2023) used ChatGPT conversation transcripts. Gemma, Phi, and Qwen model families use distillation as part of their training pipeline to create compact yet capable models. The Phi series in particular has shown that careful synthetic-data distillation can give a 3-billion-parameter model performance close to far larger frontier models on certain benchmarks.
A 2025 study on optimal compression ordering found that the sequence of pruning, then distillation, then quantization (P-KD-Q) yields the best balance between compression ratio and preserved capability. Pruning first establishes the structural foundation, distillation recovers lost knowledge, and quantization applies last without interfering with architectural changes [17].
Low-rank factorization decomposes a large weight matrix W of dimensions m by n into the product of two smaller matrices: W is approximately equal to A times B, where A is m by r and B is r by n, with rank r much smaller than both m and n. This reduces the number of parameters from m*n to r*(m+n).
Applied directly as a compression technique on pretrained weights, low-rank factorization is less popular than quantization for LLMs because the approximation error compounds across layers and the achievable compression is modest. The technique sees more use as a building block in adapter methods.
LoRA (Low-Rank Adaptation), introduced by Edward Hu and colleagues at Microsoft in 2021, applies low-rank updates during fine-tuning rather than compressing existing weights. The pretrained weights stay frozen; only two small matrices A (m by r) and B (r by n) are trained, with the effective fine-tuned weight being W + AB. For GPT-3 175B, LoRA reduces trainable parameters by 10,000x and GPU memory by 3x compared to full fine-tuning, while matching or beating full fine-tuning quality on RoBERTa, DeBERTa, GPT-2, and GPT-3 [18].
LoRA itself is parameter-efficient fine-tuning rather than model compression in the strict sense, but it spawned a family of methods that combine low-rank ideas with compression:
| Method | Year | Idea | Combines With |
|---|---|---|---|
| LoRA | 2021 | Low-rank adapter matrices added to frozen base | Standard fine-tuning |
| QLoRA | 2023 | LoRA on top of a 4-bit NF4 quantized base; double quantization; paged optimizers | Quantization (4-bit base) [5] |
| DoRA | 2024 | Decomposes weight updates into magnitude and direction; applies LoRA to direction only | LoRA, quantization |
| LongLoRA | 2023 | Sparse local attention plus LoRA for long-context fine-tuning | Sparse attention |
| MoRA | 2024 | High-rank update through square matrix with non-parameter operations | Replaces LoRA |
QLoRA was the breakthrough that made fine-tuning genuinely accessible. By keeping the base model in 4-bit NF4 quantization and only training low-rank adapters on top, QLoRA fits 65-billion-parameter fine-tuning into a single 48 GB GPU. The Guanaco model family produced by the QLoRA paper reached 99.3% of ChatGPT's quality on the Vicuna benchmark using just 24 hours of fine-tuning on one GPU [5]. The technique is now standard in the open-source fine-tuning ecosystem.
Weight sharing reduces parameter count by reusing the same parameters in multiple places. ALBERT (A Lite BERT, Lan et al. 2019) is the canonical example: it shares all transformer-block parameters across layers, so a 12-layer ALBERT has the same parameter count as a single layer of standard BERT. ALBERT-Large achieves 18x fewer parameters than BERT-Large with comparable or better performance on GLUE, SQuAD, and RACE [19]. Universal Transformer (Dehghani et al. 2018) applies the same idea recurrently, using one shared transformer block iteratively rather than stacking distinct ones.
Weight sharing is harder to retrofit onto existing models than quantization or pruning because it requires training from scratch. Modern frontier LLMs generally do not use it, in part because the parameter savings come at the cost of expressive capacity that scaling laws reward.
Some of the largest practical compression wins come not from compressing a fixed architecture but from designing the architecture to need less compute or memory at inference time.
Standard multi-head attention stores separate key and value projections for each attention head, which makes the KV cache one of the dominant memory consumers during long-context generation. Multi-Query Attention (MQA) shares a single KV head across all query heads; Grouped-Query Attention (GQA, Ainslie et al. 2023) is the middle ground, splitting query heads into groups that share KV. GQA preserves nearly the quality of full multi-head attention while shrinking the KV cache by the group factor (often 4x to 8x). Llama 2 70B, Llama 3, and Mistral all use GQA; the smaller cache enables longer context windows on the same hardware [20].
Mixture of Experts (MoE) routes each token to only a subset of expert subnetworks, so a model with hundreds of billions of total parameters can run with the per-token compute of a much smaller dense model. Mixtral 8x7B is a well-known example: 47 billion total parameters but only 13 billion active per token, giving it the inference cost of a 13B model and the quality (on many tasks) closer to a 70B model. MoE is not weight compression in the traditional sense; the weights still exist. It compresses compute, which often matters more in practice.
Speculative decoding (Leviathan et al. 2023) accelerates inference without changing the model output distribution. A small "draft" model proposes several tokens, and the large target model verifies them in parallel; tokens that match are accepted, mismatches restart from the divergence point. The 2023 paper demonstrated 2x to 3x acceleration on T5-XXL with identical sampling distributions, and the technique has since become standard in vLLM, TensorRT-LLM, and other production inference engines [21].
Like MoE, speculative decoding does not shrink weights; it amortizes them by getting more useful work per forward pass through the large model. Combined with quantization, the gains multiply.
The last layer of compression lives in the inference engine. Even an uncompressed model can run several times faster (or slower) depending on how its operators are scheduled and whether the runtime exploits hardware features.
| Engine | Origin | Strengths |
|---|---|---|
| llama.cpp | Community (Gerganov) | CPU and Apple Silicon; GGUF; runs almost anywhere; quantizations from Q2_K to Q8_0 |
| vLLM | UC Berkeley | PagedAttention, continuous batching, broad quantization support; throughput leader for many workloads |
| TensorRT-LLM | NVIDIA | Tightest NVIDIA hardware integration, FP8 on Hopper/Blackwell, in-flight batching, compiled kernels |
| TGI (Text Generation Inference) | Hugging Face | Production-ready serving with Hugging Face model hub integration |
| SGLang | LMSYS | Structured generation, RadixAttention for prefix caching |
| LMDeploy | InternLM | Throughput-focused; tensor parallelism with fused kernels |
| MLC LLM | CMU / Apache TVM | Cross-platform compilation including iOS, Android, WebGPU |
| ExLlamaV2 | Community | Fast 2-8 bit GPTQ-style quantization on consumer GPUs |
| ONNX Runtime | Microsoft | Cross-framework deployment with quantization, hardware backends |
None of these engines compress models in the literal weight-reduction sense; they make compressed models actually run at the speed their bit width promises. The Marlin kernel inside vLLM, for instance, achieves 741 tokens per second decoding INT4 AWQ models on an A100 [22], which is essentially impossible without hand-tuned mixed-precision GEMM kernels.
A developer compressing a model in 2026 has a deep toolbox:
| Library | What it does |
|---|---|
| bitsandbytes | The original LLM.int8() implementation; 8-bit and 4-bit quantization for PyTorch, drives QLoRA fine-tuning [6] |
| AutoGPTQ | Open-source GPTQ implementation, packs quantized models for inference |
| AutoAWQ | Open-source AWQ implementation, similar role to AutoGPTQ |
| llama.cpp | C/C++ inference with GGUF, the de facto CPU/Apple Silicon path |
| Optimum (Hugging Face) | Wraps multiple quantization backends with a unified API |
| NVIDIA TensorRT Model Optimizer | Unified library for quantization, pruning, distillation, and speculative decoding |
| Intel Neural Compressor | CPU-focused compression with PTQ, QAT, pruning, distillation |
| Microsoft DeepSpeed-Inference | Compression with model-parallel inference |
| Microsoft BitNet inference | Reference 1-bit and 1.58-bit inference framework |
For most practical LLM deployments, the workflow is: pick a base model, choose AWQ or GPTQ for GPU serving (or GGUF for CPU/Apple Silicon), quantize once with a small calibration set, deploy with vLLM or TensorRT-LLM. The whole pipeline takes hours, not days, and the resulting model usually loses less than a point of MMLU accuracy.
Reported quality losses from compression are surprisingly small at moderate bit widths but degrade quickly at extremes. The numbers below are typical patterns from public benchmarks; exact figures vary by model and method.
| Method | Compression | Typical perplexity hit | Typical MMLU hit | When it breaks |
|---|---|---|---|---|
| FP16 baseline | 1x | 0 | 0 | n/a |
| INT8 (LLM.int8, SmoothQuant) | 2x | <0.1 | <0.5 pts | Almost never |
| FP8 (Hopper/Blackwell) | 2x | <0.1 | <0.5 pts | Reasoning at long context |
| INT4 GPTQ | 4x | 0.1 to 0.5 | 1 to 2 pts | Aggressive group sizes |
| INT4 AWQ | 4x | 0.1 to 0.4 | 1 to 2 pts | Multi-step reasoning |
| GGUF Q4_K_M | ~3.5x | 0.1 to 0.3 | 1 to 2 pts | Smaller models suffer more |
| GGUF Q2_K | ~6x | 0.5 to 1.5 | 5+ pts | General degradation |
| 2:4 sparsity (SparseGPT) | ~1.6x effective | 0.2 to 0.6 | 1 to 3 pts | At LLM scale, often worse than INT4 |
| BitNet b1.58 (native) | ~10x | 0 (matches FP16 at 3B+) | matches | Requires pretraining from scratch |
Accuracy on reasoning-heavy benchmarks like GSM8K and HumanEval tends to drop more than on general knowledge benchmarks like MMLU when models are aggressively quantized. Long-context tasks are also more sensitive to KV cache quantization than weight quantization.
Compression is not free, and several patterns emerge once methods get pushed.
Quantization degrades reasoning faster than it degrades surface fluency. A 4-bit model often sounds nearly identical to its FP16 counterpart on conversational tasks but stumbles on multi-step math or code. The reason appears to be that reasoning chains accumulate quantization error across many forward passes.
Pruning hurts more than quantization at LLM scale. The 2023 SparseGPT paper itself noted that 50% sparsity loses more perplexity than INT4 quantization on the same model. This is part of why pruning has not displaced quantization in production.
Distillation loses long-tail knowledge. A student model trained on the teacher's outputs picks up the common patterns easily but misses rare facts the teacher learned from training data the student never sees. This shows up as worse performance on niche topics and reduced ability to recall specific entities.
Compression interacts with calibration data quality. PTQ methods like GPTQ and AWQ depend on a few hundred to a few thousand calibration samples. Using calibration data that does not match the deployment distribution (English-only calibration for a multilingual model, or news text for a coding model) can cause noticeable quality regressions that look like the method failed.
Hardware support changes the math. INT4 looks like a 4x compression on paper, but if the GPU has no INT4 Tensor Cores, the actual speedup is closer to memory-bandwidth-limited 2x. Conversely, FP8 on Blackwell or 2:4 sparsity on Ampere can give nearly the full theoretical speedup because the silicon was designed for them.
KV cache compression is its own problem. Weight quantization barely touches the KV cache, which dominates memory at long context. Methods like AWQ-KV, KIVI, and FP8 KV cache are emerging specifically for this, but the trade-offs are different from weight quantization and the field is still working out the right defaults.
The choice of compression method depends on the deployment scenario:
For quick deployment on consumer GPUs, start with quantization using GGUF (for CPU/Apple Silicon) or AWQ/GPTQ (for NVIDIA GPUs). INT4 quantization with AWQ or GPTQ is the most popular approach for running 70B-parameter models on hardware with 24 GB to 48 GB of VRAM.
For maximum throughput in production serving, combine INT4/FP8 quantization with optimized inference engines like vLLM or TensorRT. Add speculative decoding when the workload is latency-bound rather than throughput-bound. Marlin-AWQ kernels achieve some of the highest throughput, reaching 741 tokens per second in benchmarks [22].
For creating a permanently smaller model, use structured pruning followed by distillation. NVIDIA's approach of pruning a large model (e.g., Llama 3.1 8B to 4B parameters) followed by distillation from the original model has shown strong results, with pruned models achieving 30% speed improvements while retaining most of the original quality [23].
For edge and mobile deployment, combine aggressive quantization (INT4 or lower) with GGUF and llama.cpp, or use MLC LLM for cross-platform compilation including WebGPU. The GGUF format with llama.cpp enables running quantized models on CPUs and Apple Silicon devices without a discrete GPU.
For maintaining near-original quality, use INT8 quantization or FP8 on supported hardware. These higher-precision quantization formats typically cause less than 1% accuracy degradation.
For building a small model from scratch with extreme efficiency, native low-bit pretraining like BitNet b1.58 is the frontier option, though it requires training the model from the ground up rather than retrofitting an existing one.
As of early 2026, model compression is a rapidly evolving field driven by the practical necessity of deploying increasingly large models. Several trends define the current landscape.
The llama.cpp ecosystem and GGUF format have democratized model compression, making it possible for individuals to run quantized versions of state-of-the-art models on personal hardware. Community members routinely quantize new open-weight models within hours of their release, and a healthy collection of pre-quantized models is available on Hugging Face for nearly every popular open-weight release.
Hardware is adapting to compression. NVIDIA's Blackwell architecture includes native FP8 and FP4 support, and the 2:4 structured sparsity pattern supported since Ampere provides a clean path from pruning to hardware acceleration. Apple's M-series chips have unified memory architectures that benefit dramatically from quantized models.
Research continues toward more extreme compression. Methods like QuIP# and AQLM push toward 2-bit quantization, and 1-bit and 1.58-bit models are an active research frontier. BitNet b1.58 in particular has demonstrated that native low-bit training is viable, though it currently requires pretraining from scratch rather than retrofitting existing models.
KV cache compression is becoming as important as weight compression for long-context serving. As context windows reach 1 million tokens or more, the KV cache can dwarf the model weights in memory cost, and techniques specifically targeting it are receiving heavy attention.
The combination of multiple compression techniques, applied in the optimal order of pruning, distillation, then quantization, represents the current best practice for maximum compression with preserved capability [17]. Tools like NVIDIA's TensorRT Model Optimizer provide unified libraries that support quantization, pruning, distillation, and speculative decoding in a single framework.
The broader trajectory points the same direction: as models get bigger, the gap between the resources needed to train them and the resources available to deploy them widens, and compression closes that gap. Every major open-weight release in 2025 came with quantized variants on day one, often before the FP16 weights had finished syncing across mirrors. That is the new normal.