GPTQ
Last reviewed
May 7, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 4,296 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 4,296 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method for large language models that compresses weight matrices to 4, 3, or 2 bits using approximate second-order information. Introduced by Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh at IST Austria's Distributed Algorithms and Systems Lab (DASLab) and ETH Zurich, GPTQ can quantize a 175-billion-parameter model in roughly four GPU hours while preserving near-FP16 accuracy. The paper was accepted at ICLR 2023 (arXiv:2210.17323) and has become one of the most widely adopted quantization approaches for running large models on consumer and data-center GPUs.
As large language models grew from hundreds of millions to hundreds of billions of parameters, the computational and memory cost of inference scaled proportionally. A 70-billion-parameter model in 16-bit floating-point requires roughly 140 GB of GPU memory just to hold the weights, far exceeding what a single GPU can accommodate. Distributing inference across multiple GPUs adds latency, cost, and engineering complexity.
Quantization addresses this by representing weights using fewer bits. A 4-bit integer uses one quarter of the storage of a 16-bit float, which in theory allows a 70B model to fit in around 35 GB. The challenge is doing this without destroying the model's accuracy.
The simplest approach, called round-to-nearest (RTN), independently rounds each weight to the nearest representable value in the target format. RTN is fast and requires no calibration data, but it accumulates per-layer quantization errors that compound across the depth of a transformer. At 4-bit precision, RTN often degrades perplexity modestly but manageably; at 3-bit it frequently fails catastrophically, producing output that is no longer coherent.
Before GPTQ, the best alternatives to RTN came from the Optimal Brain Surgeon family of methods, which used second-order (Hessian-based) information to choose which weights to prune or quantize and how to compensate for the resulting errors. These methods were accurate but so computationally expensive that they had only been applied to models with a few hundred million parameters at most. GPTQ's core contribution was to make this class of methods fast enough to run on models with 175 billion parameters.
The paper "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al., 2022) was submitted to arXiv on October 31, 2022, and accepted at ICLR 2023. The official implementation is maintained by IST DASLab at https://github.com/IST-DASLab/gptq.
The paper demonstrates that GPTQ can compress OPT-175B from 16-bit to 4-bit in approximately 4.2 GPU hours on a single NVIDIA A100, achieving a WikiText2 perplexity of 8.37 versus 8.34 for FP16, a gap of only 0.03. Round-to-nearest at the same bitwidth gives 10.54, a much larger 2.2-point degradation. At 3-bit, RTN on OPT-175B produces a perplexity above 7,000 (complete model collapse), while GPTQ holds at 8.68. These numbers were striking because no prior method had achieved anything close to this accuracy at 3-bit for a model of that scale.
For inference throughput, the paper reports a 3.25x generation speedup on an NVIDIA A100 and a 4.5x speedup on an A6000 when comparing GPTQ-compressed 3-bit models against FP16 baselines, primarily because less data needs to be transferred from GPU memory to compute units during the memory-bandwidth-bound autoregressive decoding phase.
GPTQ builds on Optimal Brain Quantization (OBQ), which itself descends from Optimal Brain Surgeon (OBS), a technique from the early 1990s for pruning neural networks by removing the weights that cause the least increase in output error. OBQ adapted this to quantization: rather than setting a weight to zero, it rounds the weight to a nearby quantized value, then adjusts remaining weights to compensate for the error introduced.
For a single linear layer with weight matrix W and inputs X, OBQ minimizes the layer-wise reconstruction error:
min_{W_hat} ||WX - W_hat X||_F^2
The update rule after quantizing a single weight uses the inverse of the layer's Hessian matrix H = 2XX^T. After quantizing weight w_q, the remaining weights are updated as:
delta W = -(w_q - quant(w_q)) / [H^{-1}]_{qq} * H^{-1}_{:,q}
This compensates for the quantization error by redistributing it across the remaining, not-yet-quantized weights. OBQ's approach produces excellent results but has a per-layer runtime of O(d_row * d_col^3), which is prohibitive for weight matrices with tens of thousands of rows and columns.
GPTQ introduces three algorithmic changes that collectively reduce the complexity to something tractable at billion-parameter scale.
Arbitrary order quantization. OBQ quantizes weights greedily, picking the weight that causes the least error at each step and using a different optimal order for every row of the matrix. Frantar et al. observe that for large matrices, quantizing all rows in the same column order yields essentially the same accuracy as per-row optimal ordering. This is because the Hessian matrix H depends only on the layer's inputs X, not on the weights themselves, so the relative importance of columns is shared across rows. Fixing a single column order for all rows reduces the dominant cost from O(d_row * d_col^3) to O(max(d_row * d_col^2, d_col^3)), a substantial reduction.
Lazy batch updates. Even after the ordering simplification, repeatedly updating the full remaining-weight matrix after each column quantization causes too many GPU memory reads and writes. GPTQ instead processes weights in blocks of 128 columns at a time. Within a block, updates to weights outside the block are deferred; the external weights are updated in a single pass after the block is done. This batching keeps the working set in fast on-chip memory and reduces global memory bandwidth by roughly an order of magnitude.
Cholesky decomposition for numerical stability. Inverting large Hessian matrices repeatedly causes numerical errors to accumulate. GPTQ precomputes the Cholesky decomposition of the Hessian inverse once per layer at the start of quantization. This makes subsequent updates numerically stable and avoids the matrix inversion errors that would otherwise cause accuracy problems, especially on models with tens of thousands of parameters per layer. A small dampening factor of 1% is added to the diagonal of H before decomposition to handle near-zero eigenvalues.
GPTQ processes one transformer layer at a time. For each layer, it records a batch of calibration activations by running a small dataset (typically 128 sequences from C4 or WikiText2) through the model up to that layer. These activations serve as the matrix X in the Hessian computation. The layer's weight matrix is then quantized using the three innovations above, and the algorithm moves to the next layer. Because each layer is handled independently and the calibration data can be discarded afterward, GPU memory usage during quantization stays relatively low.
The basic GPTQ algorithm uses a single scale and zero-point per output channel (per-channel quantization). Group quantization divides each weight row into smaller contiguous blocks, called groups, and assigns separate scale and zero-point parameters to each group. A typical group size is 128, meaning that for every 128 consecutive weights in a row, there is one scale and one offset.
Group quantization improves accuracy, particularly at 3-bit precision, because the local dynamic range within a group is usually narrower than the full row's range, allowing for better use of the available quantization levels. The cost is a small increase in storage: a 4-bit model with group size 128 stores one 16-bit scale per 128 weights, adding roughly 12.5% overhead compared to per-channel quantization, but still far less than the original FP16 model.
An optional heuristic called activation ordering (sometimes written as act_order in code) further improves accuracy, particularly for smaller models or aggressive bit-widths. Without act-order, GPTQ quantizes columns left-to-right. With act-order, columns are sorted in decreasing order of their contribution to the Hessian diagonal before quantization begins. In practice this means quantizing first the weight columns corresponding to activations with the highest variance, which are the ones that would cause the most damage if quantized poorly.
For LLaMA-7B at 4-bit, enabling both --true-sequential (which respects the sequential structure of the attention and MLP sublayers within a block rather than treating each transformer block as a single layer) and --act-order reduces WikiText2 perplexity from around 7.15 to 6.09, a substantial improvement. For larger models the benefit of act-order is smaller because the per-column variance is already more evenly distributed.
Act-order has a practical downside: when used without --static-groups, the column reordering interacts with group boundaries and requires tracking a permutation during inference, which can add latency. The --static-groups option precomputes group grids relative to the reordered columns, removing this inference-time cost.
4-bit is GPTQ's primary target and the configuration most commonly used in practice. At 4-bit, models typically retain perplexity within 0.1-0.5 points of the FP16 baseline, depending on model size and architecture. Larger models quantize more cleanly than smaller ones: a 70B model at 4-bit is usually indistinguishable from FP16 on most benchmarks, while a 7B model may show a more noticeable degradation. In terms of storage, 4-bit quantization with group size 128 reduces a model's weight footprint to roughly 4.1 bits per parameter on average when accounting for the group scales.
At 3-bit, GPTQ's advantage over RTN is most dramatic. RTN typically collapses at 3-bit for models under 30B parameters, while GPTQ maintains usable quality. On OPT-66B, act-order drops WikiText2 perplexity from 14.16 to 9.95 at 3-bit, a difference of more than 4 points. The practical use case for 3-bit is fitting very large models into constrained GPU memory: a 70B model at 3-bit (with group scales) occupies roughly 28-30 GB, fitting on a 32 GB GPU.
The paper also explores 2-bit and ternary (-1, 0, +1) quantization, primarily to demonstrate the algorithm's behavior under extreme compression rather than as a production-ready configuration. At 2-bit, OPT-175B achieves a perplexity of around 10-12 depending on the group size, which remains coherent. Ternary quantization of OPT-175B gives a perplexity of approximately 9.2, notable mainly because ternary weights can be implemented with extremely efficient custom hardware using addition and subtraction rather than multiplication.
In practice, 2-bit models show noticeable quality degradation for most tasks and are rarely used for general-purpose inference. They remain relevant for research into hardware-optimized inference and as a reference point for the accuracy ceiling of post-training methods at extreme compression ratios.
The table below summarizes the main trade-offs between GPTQ and the other widely deployed quantization methods for LLMs.
| Method | Calibration required | Quantization time | GPU memory at inference | Supports fine-tuning | Best use case |
|---|---|---|---|---|---|
| GPTQ | Yes (small dataset) | Hours (175B model) | ~4x reduction at 4-bit | Via adapters (PEFT) | GPU inference, large model compression |
| AWQ | Yes (small dataset) | Minutes | ~4x reduction at 4-bit | Via adapters | GPU inference, better generalization |
| bitsandbytes (NF4) | No | None (on-the-fly) | ~4x reduction | Yes (QLoRA) | Fine-tuning, quick deployment |
| RTN | No | Seconds | ~4x reduction at 4-bit | N/A | Fast prototyping, large models |
| GGUF (llama.cpp) | No | Minutes | Varies (Q4 to Q8) | No | CPU inference, Apple Silicon |
Round-to-nearest is fast and requires no calibration data, but at 3-bit or below the accuracy gap over GPTQ is large enough that RTN is effectively unusable. At 4-bit, GPTQ typically outperforms RTN by 1-3 perplexity points for sub-30B models, which often translates to measurable downstream benchmark differences.
AWQ (Activation-aware Weight Quantization, Lin et al., 2023) takes a different approach to protecting accuracy: it identifies a small fraction of salient weight channels by examining activation magnitudes, then scales those channels up before quantization to reduce their relative error. AWQ is generally faster to apply than GPTQ (minutes rather than hours for large models) and has been shown to generalize better to out-of-distribution inputs. GPTQ uses second-order information to compensate errors across the entire weight matrix, which can give it a slight edge on in-distribution benchmarks but may overfit to the calibration distribution.
In practice, the accuracy difference between GPTQ and AWQ at 4-bit is small for most tasks. AWQ tends to be preferred when quantization speed matters or when the deployment requires robustness to diverse input domains. GPTQ has a larger catalog of pre-quantized models on the Hugging Face Hub, built primarily through the community contributions of Thomas "TheBloke" Johansson, who quantized hundreds of popular models during 2023.
bitsandbytes (developed by Tim Dettmers) quantizes weights on the fly during model loading using NF4 (NormalFloat4), a data type designed specifically for normally distributed neural network weights. Because bitsandbytes does not require an offline calibration pass, it is the simplest way to load a large model in reduced precision. Its primary advantage is compatibility with QLoRA fine-tuning: the 4-bit base model stays frozen while low-rank adapters are trained on top, enabling fine-tuning of very large models on a single GPU.
The main tradeoff is inference speed. bitsandbytes is slower than GPTQ with optimized kernels (ExLlama or Marlin) for pure generation workloads, because the on-the-fly dequantization is not as efficiently fused as the offline-prepared GPTQ kernels.
The memory savings from GPTQ are straightforward to calculate: a 70B model at FP16 uses about 140 GB. At 4-bit with group size 128, the weights take up roughly 35-37 GB, fitting on a single 40 GB A100 or two consumer 24 GB GPUs.
The inference speed story is more nuanced. LLM text generation is memory-bandwidth bound during the autoregressive decoding phase: the GPU spends most of its time loading weight matrices from HBM memory rather than performing arithmetic. A 4-bit representation has one quarter the bytes to transfer, which in principle gives a 4x speedup. In practice, the actual speedup at batch size 1 is 3-4x because dequantization takes some compute and because the speedup depends on how efficiently the quantized kernels are implemented.
At larger batch sizes the picture changes. When many sequences are processed simultaneously, the arithmetic intensity (compute per byte) increases, and the memory bandwidth advantage of quantized weights shrinks. This is why the ExLlama kernel provides large benefits at batch size 1 but does not scale as well to batch sizes above 16-32. The Marlin kernel (discussed in the successor work section) addresses this by using a different tiling strategy.
A benchmark from the Hugging Face integration blog, run on a single NVIDIA A100-SXM4-80GB GPU with batch size 1, prompt length 512, and 512 output tokens, shows:
| Configuration | Per-token latency (ms) | Throughput (tokens/s) | Peak memory (MB) |
|---|---|---|---|
| FP16 baseline | 36.96 | 27.06 | 29,153 |
| GPTQ 4-bit (ExLlama kernel) | 33.71 | 29.66 | 10,484 |
| GPTQ 4-bit (AutoGPTQ legacy kernel) | 46.44 | 21.53 | 10,345 |
The ExLlama kernel here is both faster and uses less memory than FP16, while the older AutoGPTQ kernel is slower despite using less memory, illustrating how much kernel quality matters.
The original implementation at https://github.com/IST-DASLab/gptq covers OPT, BLOOM, and LLaMA families with Python and CUDA code. The repository includes evaluation scripts for WikiText2, PTB, and C4 perplexity, as well as zero-shot task benchmarks via the lm-evaluation-harness. It serves primarily as a research reference; production users generally use one of the downstream libraries described below.
Shortly after the original paper, the community contributor qwopqwop200 released GPTQ-for-LLaMa (https://github.com/qwopqwop200/GPTQ-for-LLaMa), a fork adapted specifically for LLaMA and its instruction-tuned variants. This project was the first to demonstrate GPTQ working on the LLaMA models that had been released in early 2023, and it attracted significant attention in the open-source community. Several forks appeared in parallel, including versions by oobabooga that integrated with the text-generation-webui project.
AutoGPTQ (https://github.com/AutoGPTQ/AutoGPTQ) is a Python library that provides a high-level API for quantizing and loading GPTQ models across a wide range of transformer architectures. It became the standard way to create GPTQ-quantized models on Hugging Face because it integrates with the Transformers library and supports model serialization to and from the Hub. In August 2023, Hugging Face incorporated AutoGPTQ support directly into the transformers library via the GPTQConfig class, making GPTQ a first-class quantization option alongside bitsandbytes.
GPTQModel (https://github.com/ModelCloud/GPTQModel) is a maintained fork of AutoGPTQ that has since diverged substantially from the original. It adds asymmetric quantization (potentially lower quantization error than symmetric), faster quantization speed, lower memory usage during quantization, broader architecture coverage including multimodal models (Qwen2-VL, Ovis), and support for additional hardware platforms including AMD ROCm, Apple Silicon, and Intel datacenter GPUs. Hugging Face's transformers documentation now recommends GPTQModel over AutoGPTQ for new projects, citing the lack of continued support in AutoGPTQ for new model families.
ExLlama is a Python/C++/CUDA inference library written by the developer turboderp, optimized for running 4-bit GPTQ models on consumer NVIDIA GPUs. It implemented the first substantially optimized GPTQ inference kernel, delivering significantly better throughput than the original GPTQ CUDA kernels. ExLlama became integrated into AutoGPTQ and Hugging Face Transformers as the default kernel for GPTQ models.
ExLlamaV2 followed with further kernel optimizations and expanded model support. It is activated in Transformers via the exllama_config parameter (set to version 2) in GPTQConfig.
As of mid-2023, Hugging Face Transformers provides native GPTQ support through the GPTQConfig class. Users can quantize a model directly:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
device_map="auto",
quantization_config=gptq_config
)
Pre-quantized models can be loaded without specifying a quantization config:
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
The Marlin backend can be activated for additional speed by passing backend="marlin" to GPTQConfig.
A major factor in GPTQ's adoption was the systematic quantization effort by Thomas Johansson, known on Hugging Face as "TheBloke". Beginning in mid-2023, TheBloke quantized hundreds of popular open-source LLMs to GPTQ format and published them on the Hugging Face Hub with detailed model cards explaining the quantization parameters and expected performance tradeoffs. His repositories covering Llama 2, Mistral, Falcon, and many other models each attracted tens of thousands of downloads, building a de facto library of ready-to-use quantized models. This work reduced the barrier to running large models substantially: users could download a pre-quantized GPTQ model and load it with two lines of Python rather than running a multi-hour quantization job themselves.
vLLM, the high-throughput serving framework, supports GPTQ models including both the standard ExLlama-based path and the Marlin kernel path. For batch sizes above a few tokens, the Marlin-GPTQ combination in vLLM benchmarks around 712 tokens per second on a modern A100, compared to 741 tokens per second for Marlin-AWQ, with both exceeding FP16 throughput at those batch sizes due to memory savings freeing up resources for larger batches.
Running 70B models on a single GPU. A 70B model at FP16 requires at least two 80 GB A100s or four 40 GB A100s. GPTQ at 4-bit fits it on one 40 GB A100, and at 3-bit it fits on a single 32 GB consumer GPU such as an NVIDIA RTX 3090 or 4090. This is the most common reason practitioners reach for GPTQ.
Faster single-user inference. For chatbot-style applications with one user at a time, the memory bandwidth savings translate directly into token generation speed. A single user on a 4-bit model will often get higher tokens per second than on the FP16 model of the same family, despite slightly lower model quality.
Model serving at reduced GPU cost. Fitting a model on fewer GPUs reduces hardware rental costs for cloud-hosted inference. A deployment that previously required four A100s might run on two using GPTQ, cutting infrastructure cost roughly in half.
Research and local experimentation. Researchers with access to a single consumer GPU can run 13B or 34B models at 4-bit that would otherwise be inaccessible on their hardware. This has been important for the academic community working with open-source models.
Combined with fine-tuning adapters. While you cannot fine-tune the quantized weights themselves with GPTQ, you can add LoRA adapters via PEFT on top of a loaded GPTQ model. The Hugging Face PEFT library supports this workflow, though the ExLlama kernels should be disabled during the adapter training phase.
Calibration data sensitivity. GPTQ's accuracy depends on the calibration dataset matching the intended deployment distribution. The paper uses C4 or WikiText2 as defaults. If a model is quantized on a general-domain corpus but then used primarily for code generation, the quantization may not have optimally preserved the code-relevant weights. AWQ researchers have noted that GPTQ can show 2-4 perplexity point degradation when tested on a different domain from its calibration set, compared to less than 1 point for AWQ.
Long quantization time for large models. Quantizing a 175B model takes approximately 4 GPU hours on an A100. For a 70B model the time is roughly 1-2 hours. While this is a one-time cost, it requires access to high-memory GPUs and is impractical to repeat frequently. Users typically rely on pre-quantized model releases rather than quantizing from scratch.
GPU requirement. GPTQ's optimized kernels (ExLlama, Marlin) are CUDA-specific and require an NVIDIA GPU for full performance. Running GPTQ-quantized weights on CPU is possible through some implementations but is substantially slower and lacks the kernel-level optimizations that make GPTQ fast on GPU. Users wanting to run inference on CPU or Apple Silicon generally choose GGUF format instead. GPTQModel has added some CPU and AMD ROCm support, but NVIDIA GPU remains the primary target.
Not suited for fine-tuning quantized weights. GPTQ quantizes weights offline; those quantized integers cannot be updated by gradient descent without special handling. PEFT adapter methods work around this but add training complexity.
Performance variability with smaller models. For models under 7B parameters, the accuracy gap between GPTQ and RTN at 4-bit is smaller, and some benchmarks show marginal or no advantage for GPTQ over simpler methods. The second-order approach gives the largest benefit for very large matrices, which are more prevalent in larger models.
Kernel performance at larger batch sizes. The ExLlama kernel is optimized for single-sequence inference (batch size 1). At batch sizes above 16-32, the memory bandwidth advantage of quantized weights shrinks, and the Marlin kernel is needed to maintain throughput. For high-batch-size serving, careful kernel selection and benchmarking are necessary.
Marlin (Mixed Auto-Regressive Linear) is a highly optimized 4-bit GPTQ inference kernel for NVIDIA A100 and later GPUs, developed at IST DASLab (arXiv:2408.11743). Standard 4-bit matmul kernels are limited by the overhead of dequantization and do not saturate GPU compute at small batch sizes. Marlin redesigns the memory access pattern to tile work across GPU thread blocks in a way that keeps both the memory system and the arithmetic units busy simultaneously.
The result is near-ideal speedups: Marlin delivers approximately 3.87x throughput improvement over FP16 for batch sizes up to 16-32 tokens, compared to the theoretical maximum of 4x. Prior kernels achieved this only at batch size 1. Marlin is available as a backend option in Hugging Face Transformers via the backend="marlin" parameter in GPTQConfig and is supported in vLLM as the default GPTQ inference path for Ampere-class GPUs.
A follow-up extension, MARLIN (full caps, for Mixed-Precision Auto-Regressive Parallel Inference), generalizes the approach to W4A8 (4-bit weights, 8-bit activations) quantization for scenarios where activations are also quantized.
ExLlamaV3 (turboderp-org/exllamav3) extends the ExLlama series with the EXL3 weight format. EXL3 is a streamlined variant of QTIP (Quantization with Trellises and Incoherence Processing) from Cornell RelaxML, adapted for practical use. The format is designed to give better perplexity-per-bit than standard GPTQ at sub-4-bit configurations, particularly in the 2-3 bit range. As of 2025, EXL3 is in active development and has been discussed as a potential successor format for the community distribution of quantized models.
GPTQModel continues to add quantization methods beyond the original GPTQ algorithm, including ParoQuant, FOEM (First-Order Error Matters), and FP8 support, while also broadening hardware coverage. The library serves as the primary maintained implementation for the broader GPTQ family of methods.
Several research methods targeting the 2-bit regime have improved significantly over GPTQ at extreme compression. QuIP# (Cornell) and AQLM (Activation-aware Linear Quantization Method) use lattice codebooks and learned codebook-based quantization respectively to push accuracy at 2-bit beyond what scalar quantization methods like GPTQ can achieve. These are more computationally expensive to apply and are primarily research systems rather than production tools.