# GPTQ

> Source: https://aiwiki.ai/wiki/gptq
> Updated: 2026-06-21
> Categories: AI Inference, Deep Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot post-training quantization method that compresses the weights of large language models to 3 or 4 bits using approximate second-order (Hessian-based) information, with what its authors call "negligible accuracy degradation relative to the uncompressed baseline."[1] Introduced by Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh at IST Austria's Distributed Algorithms and Systems Lab (DASLab) and ETH Zurich, GPTQ can quantize a 175-billion-parameter model in roughly four GPU hours and was the first method to "execute an 175 billion-parameter model inside a single GPU for generative inference."[1] The paper was accepted at ICLR 2023 (arXiv:2210.17323, submitted October 31, 2022) and has become one of the most widely adopted quantization approaches for running large models on consumer and data-center GPUs.[1]

## What problem does GPTQ solve?

As large language models grew from hundreds of millions to hundreds of billions of parameters, the computational and memory cost of inference scaled proportionally. A 70-billion-parameter model in 16-bit floating-point requires roughly 140 GB of GPU memory just to hold the weights, far exceeding what a single GPU can accommodate. Distributing inference across multiple GPUs adds latency, cost, and engineering complexity.

[Quantization](/wiki/quantization) addresses this by representing weights using fewer bits. A 4-bit integer uses one quarter of the storage of a 16-bit float, which in theory allows a 70B model to fit in around 35 GB. The challenge is doing this without destroying the model's accuracy.

The simplest approach, called round-to-nearest (RTN), independently rounds each weight to the nearest representable value in the target format. RTN is fast and requires no calibration data, but it accumulates per-layer quantization errors that compound across the depth of a transformer. At 4-bit precision, RTN often degrades perplexity modestly but manageably; at 3-bit it frequently fails catastrophically, producing output that is no longer coherent.[1]

Before GPTQ, the best alternatives to RTN came from the Optimal Brain Surgeon family of methods, which used second-order (Hessian-based) information to choose which weights to prune or quantize and how to compensate for the resulting errors.[14] These methods were accurate but so computationally expensive that they had only been applied to models with a few hundred million parameters at most. GPTQ's core contribution was to make this class of methods fast enough to run on models with 175 billion parameters.[1]

## What did the GPTQ paper report?

The paper "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" (Frantar et al., 2022) was submitted to arXiv on October 31, 2022, and accepted at ICLR 2023.[1] The official implementation is maintained by IST DASLab at https://github.com/IST-DASLab/gptq.[2] The abstract describes GPTQ as "a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient."[1]

The paper demonstrates that GPTQ can compress OPT-175B from 16-bit to 4-bit in approximately 4.2 GPU hours on a single NVIDIA A100, achieving a WikiText2 perplexity of 8.37 versus 8.34 for FP16, a gap of only 0.03. Round-to-nearest at the same bitwidth gives 10.54, a much larger 2.2-point degradation. At 3-bit, RTN on OPT-175B produces a perplexity above 7,000 (complete model collapse), while GPTQ holds at 8.68. These numbers were striking because no prior method had achieved anything close to this accuracy at 3-bit for a model of that scale.[1]

For inference throughput, the paper reports an end-to-end 3.25x generation speedup on an NVIDIA A100 and a 4.5x speedup on an A6000 when comparing GPTQ-compressed 3-bit models against FP16 baselines, primarily because less data needs to be transferred from GPU memory to compute units during the memory-bandwidth-bound autoregressive decoding phase.[1] The paper also shows the method "can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels."[1]

## How does the GPTQ algorithm work?

### Optimal Brain Quantization as the foundation

GPTQ builds on Optimal Brain Quantization (OBQ), which itself descends from Optimal Brain Surgeon (OBS), a technique from the early 1990s for pruning neural networks by removing the weights that cause the least increase in output error.[14] OBQ adapted this to quantization: rather than setting a weight to zero, it rounds the weight to a nearby quantized value, then adjusts remaining weights to compensate for the error introduced.[1]

For a single linear layer with weight matrix W and inputs X, OBQ minimizes the layer-wise reconstruction error:

```
min_{W_hat} ||WX - W_hat X||_F^2
```

The update rule after quantizing a single weight uses the inverse of the layer's Hessian matrix H = 2XX^T. After quantizing weight w_q, the remaining weights are updated as:

```
delta W = -(w_q - quant(w_q)) / [H^{-1}]_{qq} * H^{-1}_{:,q}
```

This compensates for the quantization error by redistributing it across the remaining, not-yet-quantized weights. OBQ's approach produces excellent results but has a per-layer runtime of O(d_row * d_col^3), which is prohibitive for weight matrices with tens of thousands of rows and columns.[1]

### Three key innovations

GPTQ introduces three algorithmic changes that collectively reduce the complexity to something tractable at billion-parameter scale.[1]

**Arbitrary order quantization.** OBQ quantizes weights greedily, picking the weight that causes the least error at each step and using a different optimal order for every row of the matrix. Frantar et al. observe that for large matrices, quantizing all rows in the same column order yields essentially the same accuracy as per-row optimal ordering. This is because the Hessian matrix H depends only on the layer's inputs X, not on the weights themselves, so the relative importance of columns is shared across rows. Fixing a single column order for all rows reduces the dominant cost from O(d_row * d_col^3) to O(max(d_row * d_col^2, d_col^3)), a substantial reduction.[1]

**Lazy batch updates.** Even after the ordering simplification, repeatedly updating the full remaining-weight matrix after each column quantization causes too many GPU memory reads and writes. GPTQ instead processes weights in blocks of 128 columns at a time. Within a block, updates to weights outside the block are deferred; the external weights are updated in a single pass after the block is done. This batching keeps the working set in fast on-chip memory and reduces global memory bandwidth by roughly an order of magnitude.[1]

**Cholesky decomposition for numerical stability.** Inverting large Hessian matrices repeatedly causes numerical errors to accumulate. GPTQ precomputes the Cholesky decomposition of the Hessian inverse once per layer at the start of quantization. This makes subsequent updates numerically stable and avoids the matrix inversion errors that would otherwise cause accuracy problems, especially on models with tens of thousands of parameters per layer. A small dampening factor of 1% is added to the diagonal of H before decomposition to handle near-zero eigenvalues.[1]

### Layer-wise processing

GPTQ processes one transformer layer at a time. For each layer, it records a batch of calibration activations by running a small dataset (typically 128 sequences from C4 or WikiText2) through the model up to that layer. These activations serve as the matrix X in the Hessian computation. The layer's weight matrix is then quantized using the three innovations above, and the algorithm moves to the next layer. Because each layer is handled independently and the calibration data can be discarded afterward, GPU memory usage during quantization stays relatively low.[1]

### Group quantization

The basic GPTQ algorithm uses a single scale and zero-point per output channel (per-channel quantization). Group quantization divides each weight row into smaller contiguous blocks, called groups, and assigns separate scale and zero-point parameters to each group. A typical group size is 128, meaning that for every 128 consecutive weights in a row, there is one scale and one offset.

Group quantization improves accuracy, particularly at 3-bit precision, because the local dynamic range within a group is usually narrower than the full row's range, allowing for better use of the available quantization levels. The cost is a small increase in storage: a 4-bit model with group size 128 stores one 16-bit scale per 128 weights, adding roughly 12.5% overhead compared to per-channel quantization, but still far less than the original FP16 model.

### Activation ordering (act-order)

An optional heuristic called activation ordering (sometimes written as `act_order` in code) further improves accuracy, particularly for smaller models or aggressive bit-widths. Without act-order, GPTQ quantizes columns left-to-right. With act-order, columns are sorted in decreasing order of their contribution to the Hessian diagonal before quantization begins. In practice this means quantizing first the weight columns corresponding to activations with the highest variance, which are the ones that would cause the most damage if quantized poorly.

For [LLaMA](/wiki/llama)-7B at 4-bit, enabling both `--true-sequential` (which respects the sequential structure of the attention and MLP sublayers within a block rather than treating each transformer block as a single layer) and `--act-order` reduces WikiText2 perplexity from around 7.15 to 6.09, a substantial improvement.[2] For larger models the benefit of act-order is smaller because the per-column variance is already more evenly distributed.

Act-order has a practical downside: when used without `--static-groups`, the column reordering interacts with group boundaries and requires tracking a permutation during inference, which can add latency. The `--static-groups` option precomputes group grids relative to the reordered columns, removing this inference-time cost.[2]

## What are the bit-width variants?

### 4-bit quantization

4-bit is GPTQ's primary target and the configuration most commonly used in practice. At 4-bit, models typically retain perplexity within 0.1-0.5 points of the FP16 baseline, depending on model size and architecture. Larger models quantize more cleanly than smaller ones: a 70B model at 4-bit is usually indistinguishable from FP16 on most benchmarks, while a 7B model may show a more noticeable degradation. In terms of storage, 4-bit quantization with group size 128 reduces a model's weight footprint to roughly 4.1 bits per parameter on average when accounting for the group scales.

### 3-bit quantization

At 3-bit, GPTQ's advantage over RTN is most dramatic. RTN typically collapses at 3-bit for models under 30B parameters, while GPTQ maintains usable quality.[1] On OPT-66B, act-order drops WikiText2 perplexity from 14.16 to 9.95 at 3-bit, a difference of more than 4 points.[1] The practical use case for 3-bit is fitting very large models into constrained GPU memory: a 70B model at 3-bit (with group scales) occupies roughly 28-30 GB, fitting on a 32 GB GPU.

### 2-bit and ternary quantization

The paper also explores 2-bit and ternary (-1, 0, +1) quantization, primarily to demonstrate the algorithm's behavior under extreme compression rather than as a production-ready configuration. At 2-bit, OPT-175B achieves a perplexity of around 10-12 depending on the group size, which remains coherent. Ternary quantization of OPT-175B gives a perplexity of approximately 9.2, notable mainly because ternary weights can be implemented with extremely efficient custom hardware using addition and subtraction rather than multiplication.[1]

In practice, 2-bit models show noticeable quality degradation for most tasks and are rarely used for general-purpose inference. They remain relevant for research into hardware-optimized inference and as a reference point for the accuracy ceiling of post-training methods at extreme compression ratios.

## How does GPTQ compare with other quantization methods?

The table below summarizes the main trade-offs between GPTQ and the other widely deployed quantization methods for LLMs.

| Method | Calibration required | Quantization time | GPU memory at inference | Supports fine-tuning | Best use case |
|---|---|---|---|---|---|
| [GPTQ](/wiki/gptq) | Yes (small dataset) | Hours (175B model) | ~4x reduction at 4-bit | Via adapters (PEFT) | GPU inference, large model compression |
| [AWQ](/wiki/awq) | Yes (small dataset) | Minutes | ~4x reduction at 4-bit | Via adapters | GPU inference, better generalization |
| bitsandbytes (NF4) | No | None (on-the-fly) | ~4x reduction | Yes (QLoRA) | Fine-tuning, quick deployment |
| RTN | No | Seconds | ~4x reduction at 4-bit | N/A | Fast prototyping, large models |
| GGUF (llama.cpp) | No | Minutes | Varies (Q4 to Q8) | No | CPU inference, Apple Silicon |

### GPTQ vs RTN

Round-to-nearest is fast and requires no calibration data, but at 3-bit or below the accuracy gap over GPTQ is large enough that RTN is effectively unusable.[1] At 4-bit, GPTQ typically outperforms RTN by 1-3 perplexity points for sub-30B models, which often translates to measurable downstream benchmark differences.[1]

### GPTQ vs AWQ

[AWQ](/wiki/awq) (Activation-aware Weight Quantization, Lin et al., 2023) takes a different approach to protecting accuracy: it identifies a small fraction of salient weight channels by examining activation magnitudes, then scales those channels up before quantization to reduce their relative error.[4] AWQ is generally faster to apply than GPTQ (minutes rather than hours for large models) and has been shown to generalize better to out-of-distribution inputs.[4] GPTQ uses second-order information to compensate errors across the entire weight matrix, which can give it a slight edge on in-distribution benchmarks but may overfit to the calibration distribution.

In practice, the accuracy difference between GPTQ and AWQ at 4-bit is small for most tasks. AWQ tends to be preferred when quantization speed matters or when the deployment requires robustness to diverse input domains. GPTQ has a larger catalog of pre-quantized models on the Hugging Face Hub, built primarily through the community contributions of Thomas "TheBloke" Johansson, who quantized hundreds of popular models during 2023.

### GPTQ vs bitsandbytes

bitsandbytes (developed by Tim Dettmers) quantizes weights on the fly during model loading using NF4 (NormalFloat4), a data type designed specifically for normally distributed neural network weights.[3] Because bitsandbytes does not require an offline calibration pass, it is the simplest way to load a large model in reduced precision. Its primary advantage is compatibility with QLoRA fine-tuning: the 4-bit base model stays frozen while low-rank adapters are trained on top, enabling fine-tuning of very large models on a single GPU.

The main tradeoff is inference speed. bitsandbytes is slower than GPTQ with optimized kernels (ExLlama or Marlin) for pure generation workloads, because the on-the-fly dequantization is not as efficiently fused as the offline-prepared GPTQ kernels.[6]

## What are the performance and memory tradeoffs?

The memory savings from GPTQ are straightforward to calculate: a 70B model at FP16 uses about 140 GB. At 4-bit with group size 128, the weights take up roughly 35-37 GB, fitting on a single 40 GB A100 or two consumer 24 GB GPUs.

The inference speed story is more nuanced. LLM text generation is memory-bandwidth bound during the autoregressive decoding phase: the GPU spends most of its time loading weight matrices from HBM memory rather than performing arithmetic. A 4-bit representation has one quarter the bytes to transfer, which in principle gives a 4x speedup. In practice, the actual speedup at batch size 1 is 3-4x because dequantization takes some compute and because the speedup depends on how efficiently the quantized kernels are implemented.

At larger batch sizes the picture changes. When many sequences are processed simultaneously, the arithmetic intensity (compute per byte) increases, and the memory bandwidth advantage of quantized weights shrinks. This is why the ExLlama kernel provides large benefits at batch size 1 but does not scale as well to batch sizes above 16-32. The Marlin kernel (discussed in the successor work section) addresses this by using a different tiling strategy.[5]

A benchmark from the Hugging Face integration blog, run on a single NVIDIA A100-SXM4-80GB GPU with batch size 1, prompt length 512, and 512 output tokens, shows:[6]

| Configuration | Per-token latency (ms) | Throughput (tokens/s) | Peak memory (MB) |
|---|---|---|---|
| FP16 baseline | 36.96 | 27.06 | 29,153 |
| GPTQ 4-bit (ExLlama kernel) | 33.71 | 29.66 | 10,484 |
| GPTQ 4-bit (AutoGPTQ legacy kernel) | 46.44 | 21.53 | 10,345 |

The ExLlama kernel here is both faster and uses less memory than FP16, while the older AutoGPTQ kernel is slower despite using less memory, illustrating how much kernel quality matters.[6]

## Ecosystem and adoption

### IST DASLab reference implementation

The original implementation at https://github.com/IST-DASLab/gptq covers OPT, BLOOM, and LLaMA families with Python and CUDA code.[2] The repository includes evaluation scripts for WikiText2, PTB, and C4 perplexity, as well as zero-shot task benchmarks via the lm-evaluation-harness.[2] It serves primarily as a research reference; production users generally use one of the downstream libraries described below.

### GPTQ-for-LLaMa

Shortly after the original paper, the community contributor qwopqwop200 released GPTQ-for-LLaMa (https://github.com/qwopqwop200/GPTQ-for-LLaMa), a fork adapted specifically for [LLaMA](/wiki/llama) and its instruction-tuned variants.[13] This project was the first to demonstrate GPTQ working on the LLaMA models that had been released in early 2023, and it attracted significant attention in the open-source community. Several forks appeared in parallel, including versions by oobabooga that integrated with the text-generation-webui project.

### AutoGPTQ

AutoGPTQ (https://github.com/AutoGPTQ/AutoGPTQ) is a Python library that provides a high-level API for quantizing and loading GPTQ models across a wide range of transformer architectures.[8] It became the standard way to create GPTQ-quantized models on Hugging Face because it integrates with the Transformers library and supports model serialization to and from the Hub. In August 2023, Hugging Face incorporated AutoGPTQ support directly into the transformers library via the `GPTQConfig` class, making GPTQ a first-class quantization option alongside bitsandbytes.[6]

### GPTQModel

GPTQModel (https://github.com/ModelCloud/GPTQModel) is a maintained fork of AutoGPTQ that has since diverged substantially from the original.[9] It adds asymmetric quantization (potentially lower quantization error than symmetric), faster quantization speed, lower memory usage during quantization, broader architecture coverage including multimodal models (Qwen2-VL, Ovis), and support for additional hardware platforms including AMD ROCm, Apple Silicon, and Intel datacenter GPUs.[9] Hugging Face's transformers documentation now recommends GPTQModel over AutoGPTQ for new projects, citing the lack of continued support in AutoGPTQ for new model families.[7]

### ExLlama and ExLlamaV2

ExLlama is a Python/C++/CUDA inference library written by the developer turboderp, optimized for running 4-bit GPTQ models on consumer NVIDIA GPUs.[10] It implemented the first substantially optimized GPTQ inference kernel, delivering significantly better throughput than the original GPTQ CUDA kernels. ExLlama became integrated into AutoGPTQ and [Hugging Face Transformers](/wiki/transformers_library) as the default kernel for GPTQ models.[6]

ExLlamaV2 followed with further kernel optimizations and expanded model support.[10] It is activated in Transformers via the `exllama_config` parameter (set to version 2) in `GPTQConfig`.[7]

### Hugging Face Transformers integration

As of mid-2023, [Hugging Face Transformers](/wiki/transformers_library) provides native GPTQ support through the `GPTQConfig` class.[7] Users can quantize a model directly:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    quantization_config=gptq_config
)
```

Pre-quantized models can be loaded without specifying a quantization config:

```python
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)
```

The Marlin backend can be activated for additional speed by passing `backend="marlin"` to `GPTQConfig`.[7]

### TheBloke and community model distribution

A major factor in GPTQ's adoption was the systematic quantization effort by Thomas Johansson, known on Hugging Face as "TheBloke". Beginning in mid-2023, TheBloke quantized hundreds of popular open-source LLMs to GPTQ format and published them on the Hugging Face Hub with detailed model cards explaining the quantization parameters and expected performance tradeoffs. His repositories covering Llama 2, Mistral, Falcon, and many other models each attracted tens of thousands of downloads, building a de facto library of ready-to-use quantized models. This work reduced the barrier to running large models substantially: users could download a pre-quantized GPTQ model and load it with two lines of Python rather than running a multi-hour quantization job themselves.

### vLLM integration

vLLM, the high-throughput serving framework, supports GPTQ models including both the standard ExLlama-based path and the Marlin kernel path. For batch sizes above a few tokens, the Marlin-GPTQ combination in vLLM benchmarks around 712 tokens per second on a modern A100, compared to 741 tokens per second for Marlin-AWQ, with both exceeding FP16 throughput at those batch sizes due to memory savings freeing up resources for larger batches.

## What is GPTQ used for?

**Running 70B models on a single GPU.** A 70B model at FP16 requires at least two 80 GB A100s or four 40 GB A100s. GPTQ at 4-bit fits it on one 40 GB A100, and at 3-bit it fits on a single 32 GB consumer GPU such as an NVIDIA RTX 3090 or 4090. This is the most common reason practitioners reach for GPTQ.

**Faster single-user inference.** For chatbot-style applications with one user at a time, the memory bandwidth savings translate directly into token generation speed. A single user on a 4-bit model will often get higher tokens per second than on the FP16 model of the same family, despite slightly lower model quality.

**Model serving at reduced GPU cost.** Fitting a model on fewer GPUs reduces hardware rental costs for cloud-hosted inference. A deployment that previously required four A100s might run on two using GPTQ, cutting infrastructure cost roughly in half.

**Research and local experimentation.** Researchers with access to a single consumer GPU can run 13B or 34B models at 4-bit that would otherwise be inaccessible on their hardware. This has been important for the academic community working with open-source models.

**Combined with fine-tuning adapters.** While you cannot fine-tune the quantized weights themselves with GPTQ, you can add [LoRA](/wiki/lora) adapters via PEFT on top of a loaded GPTQ model. The Hugging Face PEFT library supports this workflow, though the ExLlama kernels should be disabled during the adapter training phase.[7]

## What are the limitations of GPTQ?

**Calibration data sensitivity.** GPTQ's accuracy depends on the calibration dataset matching the intended deployment distribution. The paper uses C4 or WikiText2 as defaults. If a model is quantized on a general-domain corpus but then used primarily for code generation, the quantization may not have optimally preserved the code-relevant weights. AWQ researchers have noted that GPTQ can show 2-4 perplexity point degradation when tested on a different domain from its calibration set, compared to less than 1 point for AWQ.[4]

**Long quantization time for large models.** Quantizing a 175B model takes approximately 4 GPU hours on an A100. For a 70B model the time is roughly 1-2 hours. While this is a one-time cost, it requires access to high-memory GPUs and is impractical to repeat frequently. Users typically rely on pre-quantized model releases rather than quantizing from scratch.[1]

**GPU requirement.** GPTQ's optimized kernels (ExLlama, Marlin) are CUDA-specific and require an NVIDIA GPU for full performance. Running GPTQ-quantized weights on CPU is possible through some implementations but is substantially slower and lacks the kernel-level optimizations that make GPTQ fast on GPU. Users wanting to run inference on CPU or Apple Silicon generally choose GGUF format instead. GPTQModel has added some CPU and AMD ROCm support, but NVIDIA GPU remains the primary target.[9]

**Not suited for fine-tuning quantized weights.** GPTQ quantizes weights offline; those quantized integers cannot be updated by gradient descent without special handling. PEFT adapter methods work around this but add training complexity.

**Performance variability with smaller models.** For models under 7B parameters, the accuracy gap between GPTQ and RTN at 4-bit is smaller, and some benchmarks show marginal or no advantage for GPTQ over simpler methods. The second-order approach gives the largest benefit for very large matrices, which are more prevalent in larger models.

**Kernel performance at larger batch sizes.** The ExLlama kernel is optimized for single-sequence inference (batch size 1). At batch sizes above 16-32, the memory bandwidth advantage of quantized weights shrinks, and the Marlin kernel is needed to maintain throughput.[5] For high-batch-size serving, careful kernel selection and benchmarking are necessary.

## What came after GPTQ?

### Marlin

Marlin (Mixed Auto-Regressive Linear) is a highly optimized 4-bit GPTQ inference kernel for NVIDIA A100 and later GPUs, developed at IST DASLab (arXiv:2408.11743).[5][12] Standard 4-bit matmul kernels are limited by the overhead of dequantization and do not saturate GPU compute at small batch sizes. Marlin redesigns the memory access pattern to tile work across GPU thread blocks in a way that keeps both the memory system and the arithmetic units busy simultaneously.[5]

The result is near-ideal speedups: Marlin delivers approximately 3.87x throughput improvement over FP16 for batch sizes up to 16-32 tokens, compared to the theoretical maximum of 4x.[5] Prior kernels achieved this only at batch size 1. Marlin is available as a backend option in Hugging Face Transformers via the `backend="marlin"` parameter in `GPTQConfig` and is supported in vLLM as the default GPTQ inference path for Ampere-class GPUs.[7]

A follow-up extension, MARLIN (full caps, for Mixed-Precision Auto-Regressive Parallel Inference), generalizes the approach to W4A8 (4-bit weights, 8-bit activations) quantization for scenarios where activations are also quantized.[5]

### ExLlamaV3 and EXL3

ExLlamaV3 (turboderp-org/exllamav3) extends the ExLlama series with the EXL3 weight format.[11] EXL3 is a streamlined variant of QTIP (Quantization with Trellises and Incoherence Processing) from Cornell RelaxML, adapted for practical use. The format is designed to give better perplexity-per-bit than standard GPTQ at sub-4-bit configurations, particularly in the 2-3 bit range. As of 2025, EXL3 is in active development and has been discussed as a potential successor format for the community distribution of quantized models.[11]

### GPTQModel evolution

GPTQModel continues to add quantization methods beyond the original GPTQ algorithm, including ParoQuant, FOEM (First-Order Error Matters), and FP8 support, while also broadening hardware coverage.[9] The library serves as the primary maintained implementation for the broader GPTQ family of methods.

### QuIP# and AQLM

Several research methods targeting the 2-bit regime have improved significantly over GPTQ at extreme compression. QuIP# (Cornell) and AQLM (Activation-aware Linear Quantization Method) use lattice codebooks and learned codebook-based quantization respectively to push accuracy at 2-bit beyond what scalar quantization methods like GPTQ can achieve. These are more computationally expensive to apply and are primarily research systems rather than production tools.

## See also

- [Quantization](/wiki/quantization)
- [AWQ](/wiki/awq)
- [LLaMA](/wiki/llama)
- [Hugging Face Transformers](/wiki/transformers_library)
- [LoRA](/wiki/lora)
- [bitsandbytes](/wiki/bitsandbytes)
- [vLLM](/wiki/vllm)
- [GGUF](/wiki/gguf)
- [QLoRA](/wiki/qlora)
- [Post-training quantization](/wiki/post_training_quantization)

## References

1. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323. ICLR 2023. https://arxiv.org/abs/2210.17323
2. IST DASLab. GPTQ Reference Implementation. GitHub. https://github.com/IST-DASLab/gptq
3. Dettmers, T., Lewis, M., Shleifer, S., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339.
4. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978. MLSys 2024.
5. Frantar, E., & Alistarh, D. (2024). MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models. arXiv:2408.11743.
6. Hugging Face. Making LLMs lighter with AutoGPTQ and transformers. https://huggingface.co/blog/gptq-integration
7. Hugging Face. GPTQ Quantization Documentation. https://huggingface.co/docs/transformers/en/quantization/gptq
8. AutoGPTQ. GitHub. https://github.com/AutoGPTQ/AutoGPTQ
9. GPTQModel. GitHub. https://github.com/ModelCloud/GPTQModel
10. turboderp. ExLlamaV2. GitHub. https://github.com/turboderp-org/exllamav2
11. turboderp. ExLlamaV3. GitHub. https://github.com/turboderp-org/exllamav3
12. IST DASLab. Marlin kernel. GitHub. https://github.com/IST-DASLab/marlin
13. qwopqwop200. GPTQ-for-LLaMa. GitHub. https://github.com/qwopqwop200/GPTQ-for-LLaMa
14. Hassibi, B., Stork, D. G., & Wolff, G. J. (1993). Optimal Brain Surgeon and general network pruning. IEEE International Conference on Neural Networks.