Model compression refers to a family of techniques designed to reduce the size, memory footprint, and computational cost of machine learning models while preserving as much of their original performance as possible. As large language models have grown from millions to hundreds of billions of parameters, model compression has become essential for deploying these systems in production environments where hardware resources, latency requirements, and energy budgets impose hard constraints.
The core motivation is straightforward: a 70-billion-parameter model stored in FP16 requires roughly 140 GB of GPU memory just to load the weights, far exceeding the capacity of a single consumer GPU. Compression techniques make it possible to run such models on smaller hardware, reduce inference costs, and enable on-device deployment for mobile and edge applications [1].
Model compression encompasses four primary families of techniques. Each operates on a different principle, and they can often be combined for greater effect.
| Approach | Principle | Typical Compression | Retraining Required? | Key Trade-off |
|---|---|---|---|---|
| Quantization | Reduce numerical precision of weights and activations | 2x to 4x memory reduction | Sometimes (QAT) or No (PTQ) | Precision loss at very low bit widths |
| Pruning | Remove redundant or low-importance weights or structures | 50% to 90% sparsity | Often beneficial | Irregular sparsity may not yield hardware speedups |
| Knowledge distillation | Train a smaller student model to mimic a larger teacher | Variable (architecture-dependent) | Yes (full training of student) | Student capacity limits performance ceiling |
| Low-rank factorization | Decompose weight matrices into products of smaller matrices | 2x to 5x parameter reduction | Usually fine-tuning | Approximation error accumulates across layers |
Quantization reduces the numerical precision of model weights and, optionally, activations. A model originally stored in 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) is converted to lower-precision representations such as INT8, INT4, or FP8. Because each parameter occupies fewer bits, the model's memory footprint shrinks proportionally, and lower-precision arithmetic can execute faster on hardware that supports it [2].
The two main paradigms for quantization differ in when the precision reduction happens:
| Aspect | Post-Training Quantization (PTQ) | Quantization-Aware Training (QAT) |
|---|---|---|
| When applied | After training is complete | During training |
| Calibration data | Small calibration set (typically 128 to 1024 samples) | Full training dataset |
| Computational cost | Low (minutes to hours) | High (full training run) |
| Accuracy at INT8 | Near-lossless for most models | Near-lossless |
| Accuracy at INT4 | Noticeable degradation without advanced methods | Better preservation at very low bit widths |
| Use case | Quick deployment, large models where retraining is impractical | When maximum accuracy at low precision is required |
PTQ has become the dominant approach for LLM quantization because retraining models with hundreds of billions of parameters is prohibitively expensive. QAT remains valuable for smaller models or when the deployment target demands extreme compression (e.g., 2-bit or 3-bit) [3].
INT8 quantization maps weights to 8-bit integers, cutting memory in half compared to FP16. Most modern models tolerate INT8 quantization with negligible accuracy loss, making it the safest starting point.
INT4 quantization pushes further to 4-bit integers, achieving a 4x reduction from FP16. At this precision, naive round-to-nearest quantization causes significant degradation, so advanced methods like GPTQ and AWQ are required to maintain quality.
FP8 (8-bit floating point) is a newer format supported by NVIDIA's Hopper and Blackwell GPU architectures. FP8 retains a small exponent, giving it better dynamic range than INT8, which makes it particularly effective for activations that span wide value ranges.
Several methods have emerged specifically for quantizing large language models:
| Method | Year | Approach | Key Innovation |
|---|---|---|---|
| GPTQ | 2023 | Layer-wise PTQ using approximate second-order information | Solves a reconstruction problem per layer to minimize output error; supports INT4 and INT3 |
| AWQ (Activation-Aware Weight Quantization) | 2023 | PTQ that protects salient weight channels | Identifies the small fraction of weights most important for accuracy (based on activation magnitudes) and skips or scales them; faster calibration than GPTQ |
| GGUF | 2023 | File format and quantization scheme for CPU/GPU inference | Designed for llama.cpp; supports mixed-precision quantization (e.g., Q4_K_M, Q5_K_S) with per-block scaling |
| SmoothQuant | 2023 | Migrates quantization difficulty from activations to weights | Applies mathematically equivalent per-channel scaling to smooth activation outliers |
| QuIP# | 2024 | Uses incoherence processing for extreme quantization | Achieves 2-bit quantization with reasonable quality through lattice codebooks |
GPTQ works by solving a layer-wise reconstruction problem: for each layer, it finds the quantized weight matrix that minimizes the squared error between the original and quantized layer outputs on a small calibration set. The algorithm processes weights in order and uses approximate inverse Hessian information to compensate for quantization error [4].
AWQ takes a different philosophy. Rather than treating all weights equally, it recognizes that a small fraction of weight channels (roughly 1%) disproportionately affect model accuracy. AWQ identifies these salient channels by examining activation magnitudes and applies per-channel scaling to protect them during quantization. This approach requires less calibration data than GPTQ and often produces faster quantization [5].
The GGUF format, used by llama.cpp and its ecosystem, supports a variety of quantization levels denoted by names like Q4_K_M (4-bit with medium-sized K-quant blocks) or Q8_0 (8-bit with simple block quantization). The "K-quant" variants use non-uniform quantization with different bit widths for different layers based on their sensitivity, achieving better quality than uniform quantization at the same average bit width [6].
Pruning removes parameters from a trained model that contribute little to its output. The intuition is that neural networks are typically over-parameterized, and a significant fraction of weights can be zeroed out without meaningful performance loss.
Unstructured pruning removes individual weights anywhere in the network, producing sparse weight matrices. While this can achieve very high sparsity levels (90%+), the resulting irregular sparsity patterns are difficult to accelerate on standard GPU hardware, which is optimized for dense matrix operations.
Structured pruning removes entire neurons, attention heads, or layers. The resulting model is a smaller but architecturally standard dense network that runs efficiently on existing hardware without specialized sparse computation libraries.
| Pruning Type | Granularity | Hardware Friendliness | Achievable Sparsity | Practical Speedup |
|---|---|---|---|---|
| Unstructured | Individual weights | Low (requires sparse kernels) | 80% to 95% | Limited without specialized hardware |
| Semi-structured (2:4) | 2 of every 4 weights | High (NVIDIA Ampere+ support) | 50% exactly | ~2x on supported hardware |
| Structured (neurons/heads) | Rows, columns, or blocks | High (standard dense operations) | 20% to 60% | Proportional to parameters removed |
The simplest pruning criterion is weight magnitude: weights with the smallest absolute values are assumed to be least important and are set to zero. Despite its simplicity, magnitude pruning remains a strong baseline. It can be applied iteratively, with pruning followed by fine-tuning cycles to recover accuracy.
SparseGPT, introduced in 2023 and presented at ICML 2023, was the first method to demonstrate that massive language models (up to 176 billion parameters) could be pruned to 50% sparsity in one shot, without any retraining, at minimal accuracy loss. The method formulates pruning as a layer-wise sparse reconstruction problem and solves it using approximate second-order information, similar in spirit to GPTQ for quantization [7].
SparseGPT can achieve 60% unstructured sparsity with negligible perplexity increase on models like OPT-175B and BLOOM-176B, completing the pruning process in under 4.5 hours. It also supports semi-structured 2:4 and 4:8 sparsity patterns that can leverage hardware acceleration on NVIDIA Ampere and later GPUs.
Research in 2024 and 2025 has built on SparseGPT with methods like Wanda (Weights and Activations), which uses a simpler pruning criterion based on the product of weight magnitude and input activation norm, achieving comparable results with even lower computational cost.
Knowledge distillation transfers the learned behavior of a large "teacher" model to a smaller "student" model. The foundational framework was established by Hinton, Vinyals, and Dean in their 2015 paper "Distilling the Knowledge in a Neural Network" [8].
In standard training, a model learns from hard labels (e.g., "this image is a cat"). In distillation, the student model is trained to match the teacher's full output probability distribution, known as "soft labels" or "soft targets." These soft targets contain richer information than hard labels because they encode the teacher's learned similarities between classes.
For example, if a teacher model assigns probabilities of 0.7 to "cat," 0.2 to "tiger," and 0.05 to "dog," these soft targets reveal that the teacher considers cats more similar to tigers than to dogs. This relational knowledge, which Hinton called "dark knowledge," helps the student generalize better than it could from hard labels alone.
A key innovation in Hinton's work is temperature scaling. The softmax function is computed with a temperature parameter T:
p_i = exp(z_i / T) / sum(exp(z_j / T))
Higher temperatures produce softer probability distributions, revealing more about the teacher's internal representations. During distillation, both teacher and student use the same elevated temperature (typically T=2 to T=20), and the student's loss is a weighted combination of the distillation loss (matching the teacher's soft targets) and the standard cross-entropy loss against hard labels.
Distillation has become a central strategy for creating smaller, deployable versions of large language models. Notable examples include:
A 2025 study on optimal compression ordering found that the sequence of Pruning, then Distillation, then Quantization (P-KD-Q) yields the best balance between compression ratio and preserved capability. Pruning first establishes the structural foundation, distillation recovers lost knowledge, and quantization applies last without interfering with architectural changes [9].
Low-rank factorization decomposes a large weight matrix W of dimensions m x n into the product of two smaller matrices: W is approximately equal to A times B, where A is m x r and B is r x n, with rank r much smaller than both m and n. This reduces the number of parameters from mn to r(m+n).
This approach is closely related to LoRA (Low-Rank Adaptation), which applies low-rank updates during fine-tuning rather than compressing existing weights. Full low-rank factorization of pretrained weights is less common for LLMs than quantization or pruning, but it sees use in combination with other techniques and in specialized settings like compressing embedding layers.
The following table summarizes the practical considerations for choosing among compression methods:
| Criterion | Quantization | Pruning | Distillation | Low-Rank Factorization |
|---|---|---|---|---|
| Memory reduction | 2x to 4x | Proportional to sparsity (with sparse format) | Depends on student architecture | 2x to 5x |
| Inference speedup | 1.5x to 3x with hardware support | Limited (unstructured) to proportional (structured) | Determined by student size | Moderate |
| Implementation effort | Low (many off-the-shelf tools) | Medium | High (requires training pipeline) | Medium |
| Quality preservation | Excellent at INT8; good at INT4 with GPTQ/AWQ | Good at 50% sparsity; degrades at higher rates | Depends on student capacity | Variable |
| Computational cost | Minutes (PTQ) | Hours (one-shot methods) | Days to weeks (full training) | Hours (factorization + fine-tuning) |
| Combinability | Combines well with pruning and distillation | Combines well with quantization | Often the final step | Combines with quantization |
The choice of compression method depends on the deployment scenario:
For quick deployment on consumer GPUs: Start with quantization using GGUF (for CPU/Apple Silicon) or AWQ/GPTQ (for NVIDIA GPUs). INT4 quantization with AWQ or GPTQ is the most popular approach for running 70B-parameter models on hardware with 24 GB to 48 GB of VRAM.
For maximum throughput in production serving: Combine INT4/FP8 quantization with optimized inference engines like vLLM or TensorRT-LLM. Marlin-AWQ kernels achieve the highest throughput, reaching 741 tokens per second in benchmarks [10].
For creating a permanently smaller model: Use structured pruning followed by distillation. NVIDIA's approach of pruning a large model (e.g., Llama 3.1 8B to 4B parameters) followed by distillation from the original model has shown strong results, with pruned models achieving 30% speed improvements while retaining most of the original quality [11].
For edge and mobile deployment: Combine aggressive quantization (INT4 or lower) with structured pruning. The GGUF format with llama.cpp enables running quantized models on CPUs and Apple Silicon devices without a discrete GPU.
For maintaining near-original quality: Use INT8 quantization or FP8 on supported hardware. These higher-precision quantization formats typically cause less than 1% accuracy degradation.
As of early 2026, model compression is a rapidly evolving field driven by the practical necessity of deploying increasingly large models. Several trends define the current landscape:
The llama.cpp ecosystem and GGUF format have democratized model compression, making it possible for individuals to run quantized versions of state-of-the-art models on personal hardware. Community members routinely quantize new open-weight models within hours of their release.
Hardware is adapting to compression. NVIDIA's Blackwell architecture includes native FP8 and FP4 support, and the 2:4 structured sparsity pattern supported since Ampere provides a clean path from pruning to hardware acceleration.
Research continues toward more extreme compression. Methods like QuIP# and AQLM push toward 2-bit quantization, and 1-bit models (binary or ternary weights) are an active research frontier, though they currently require specialized training procedures and suffer larger accuracy penalties.
The combination of multiple compression techniques, applied in the optimal order of pruning, distillation, then quantization, represents the current best practice for maximum compression with preserved capability [9]. Tools like NVIDIA's TensorRT Model Optimizer provide unified libraries that support quantization, pruning, distillation, and speculative decoding in a single framework.