ExLlamaV2 (EXL2)

AI Inference Developer Tools Open Source AI

24 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,751 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

EXL2 (ExLlamaV2 format) is an open-source, mixed-bit weight-quantization format for compressing large language models so they run fast on a single consumer-class NVIDIA GPU. It is the native format of the ExLlamaV2 inference library created by the developer who publishes under the pseudonym turboderp, and it extends the GPTQ algorithm by letting different parts of a model be stored at different precisions to hit any average target between 2 and 8 bits per weight.^[1] In the words of the official ExLlamaV2 README, "EXL2 is based on the same optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization," and "the format allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight" so that "more important weights (columns) are quantized with more bits."^[1]

ExLlamaV2 is the successor to the original ExLlama project from 2023 and was, through 2023 to 2025, among the fastest open-source backends for running quantized LLaMA and similar decoder-only transformers on consumer cards such as the RTX 3090 and RTX 4090.^[1]^[3] It was first written for the LLaMA family and has since added support for many other architectures including Mistral, Mixtral, Qwen, and Gemma variants.^[3] The library reached version 0.3.2 on 2025-07-13, after which the project was archived in favor of a successor project, ExLlamaV3, whose EXL3 format moves from GPTQ-derived measurement to a streamlined variant of the QTIP trellis quantizer from Cornell RelaxML.^[3]^[4]^[7]

Infobox

Field	Value
Project	ExLlamaV2
Format	EXL2 (mixed-bit weight quantization)
Developer	turboderp (GitHub: turboderp-org)
First preliminary release	2023-08-12
Latest release	0.3.2 (2025-07-13)
License	MIT
Primary language	Python, CUDA, C++
Successor	ExLlamaV3 (EXL3)
Reference backend	TabbyAPI
Repository	github.com/turboderp-org/exllamav2

^[1]^[3]^[4]^[5]

What is EXL2?

EXL2 is the weight-quantization file format used by ExLlamaV2, a fast inference library for running LLMs locally on modern consumer-class GPUs.^[1]^[2] A quantized model in EXL2 form stores its weights at low bit precision (an average of roughly 2 to 8 bits per weight instead of the 16 bits of a half-precision model), which shrinks the file and the VRAM footprint enough that large models fit on a single graphics card. The format is built around the GPTQ optimization method but generalizes it: rather than fixing one bit width for the whole model, EXL2 mixes bit widths so that the most sensitive weights keep more precision and the least sensitive weights are compressed more aggressively.^[1]^[9] The result is described by the project as "a new quantization format" that supports "2, 3, 4, 5, 6 and 8-bit quantization" and allows "mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight."^[1]

The term EXL2 refers specifically to the format; ExLlamaV2 (sometimes abbreviated EXL2 informally) is the library that produces and runs it. Both are maintained by turboderp under the MIT license, with source on GitHub at github.com/turboderp-org/exllamav2.^[1]^[3]

History

The project began as ExLlama, a memory-efficient rewrite of the Hugging Face Transformers implementation of LLaMA aimed at quantized weights, published on GitHub by turboderp in mid-2023.^[6] The author describes ExLlama as "a more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights."^[6] The original ExLlama targeted GPTQ models on a single CUDA GPU and quickly became the fastest open-source backend for 4-bit LLaMA-1 and LLaMA-2 inference on consumer cards such as the RTX 3090 and RTX 4090.^[3]^[6]

A preliminary, tentative release of ExLlamaV2 was tagged on 2023-08-12.^[4] The rewrite introduced the EXL2 quantization format, a new generator with paged batching, and reworked CUDA kernels. Through 2024 and 2025, ExLlamaV2 added support for further model families and multimodal inputs. The 0.2.3 release on 2024-09-29 removed the safetensors hard dependency and added the XTC sampler and YaRN context extension; 0.2.4 (2024-11-12) added Pixtral and refactored the multimodal pipeline; 0.2.5 through 0.2.7 (December 2024) added Qwen2-VL image then basic video support and Cohere2 and Granite3 architectures; 0.2.8 (2025-02-08) added Qwen2.5-VL; 0.2.9 (2025-04-23) added Gemma 3, Mistral 3.1 (text and vision), and GLM4; and 0.3.0 (2025-05-12) added Qwen3 and Qwen3-MoE.^[4]

In April 2025 turboderp announced an early preview of ExLlamaV3, a fresh codebase using a streamlined variant of the QTIP quantizer from Cornell RelaxML rather than EXL2's GPTQ-derived measurement loop, and computing Hessians on the fly with a fused Viterbi kernel.^[7] ExLlamaV2's final tagged release (0.3.2) shipped on 2025-07-13, and the repository was subsequently archived with development continuing in ExLlamaV3.^[3]^[4]

The EXL2 author should not be confused with Tim Dettmers, who created QLoRA and the bitsandbytes library; those projects are unrelated to ExLlamaV2 and pursue a different quantization approach.

How does EXL2 quantization work?

Relation to GPTQ

EXL2 inherits the core optimization step from the GPTQ algorithm of Frantar, Ashkboos, Hoefler, and Alistarh (arXiv 2210.17323, October 2022, ICLR 2023).^[8] GPTQ is a one-shot post-training weight-quantization method based on approximate second-order (Hessian-based) information, capable of compressing GPT models with 175 billion parameters in roughly four GPU hours down to 3 to 4 bits per weight with little accuracy degradation.^[8] GPTQ typically quantizes an entire linear layer (or fixed-size groups within it) to a single bit-width.

EXL2 generalizes this in two ways. First, it allows the converter to mix multiple bit widths and group sizes within the same matrix: more sensitive columns or sub-blocks can be stored at higher precision and less sensitive ones at lower precision.^[1]^[9] As the README puts it, "more important weights (columns) are quantized with more bits."^[1] Second, the converter searches over the available per-row settings to hit a user-specified average bits-per-weight (bpw) target while minimizing total reconstruction error.^[9] The official ExLlamaV2 README describes the result as "a new quantization format" that "supports 2, 3, 4, 5, 6 and 8-bit quantization" and allows "mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight."^[1]

Calibration and the two-pass converter

The bundled convert.py script runs in two phases.^[9] The first pass is a measurement phase: the converter requantizes each module of the model roughly twelve times under different bit-width and group-size choices and records the resulting per-module quantization error against a calibration set.^[9] The second pass uses those measurements to select, per row, the actual quantization parameters that minimize total error subject to the target average bits-per-weight, then writes the quantized weights to disk.^[9]

Calibration data may be supplied by the user in Parquet format or drawn from the converter's built-in default set. The script concatenates the supplied text into one long string and uses the first rows * length tokens as calibration material, with default settings of 100 rows of 2048 tokens each for the quantization pass and 16 rows by default during the measurement pass.^[9] WikiText is a common third-party choice for the calibration corpus.^[10]

The -b flag sets the target average bits per weight, and -hb controls the precision of the output (head) layer specifically, with permitted values of 2, 3, 4, 5, 6, or 8 bits.^[9] The documentation notes that the nominal 6-bit setting in practice produces a mixed-precision quantization averaging about 6.3 bits per weight, because of the per-row search.^[9] After conversion, the script reports a calibration perplexity; values above roughly 30 indicate problems and values in the thousands indicate complete failure of the run.^[9]

Hardware requirements for the conversion itself depend on model width rather than depth: converting a 70B model requires roughly 64 GB of system RAM and 24 GB of CUDA VRAM, while a 7B model needs about 16 GB of RAM and 8 GB of VRAM.^[9]

Supported bit widths and file structure

EXL2 supports per-row choices among 2, 3, 4, 5, 6, and 8-bit quantization, combined to produce average rates that, in practice, fall in the 2.5 to 8 bpw range; published quants on Hugging Face commonly include 2.4, 2.55, 3.0, 4.0, 4.25, 4.65, 5.0, 6.0, and 8.0 bpw variants.^[1]^[10]^[11] The format stores quantized weights, scales, and a metadata table describing the per-row bit allocation in a single safetensors-compatible file. Per ExLlamaV3 documentation, EXL2 renames some tensors during conversion to fit a Llama-style structure, a behavior that ExLlamaV3 explicitly drops to ease portability to other frameworks.^[7]

Supported model architectures

ExLlamaV2 began with LLaMA-1 architecture support inherited from the original ExLlama, then extended to a long list of decoder-only families through its release history.^[3]^[4] The list of architectures with first-class support, in roughly the order they were added, includes Llama 1 and 2; Mistral 7B and Mixtral 8x7B / 8x22B mixture-of-experts; Llama 3 and 3.1; Qwen 1 / 1.5 / 2; Gemma 1 / 2; Cohere2 (Aya / Command R); Granite3; Qwen2-VL and Qwen2.5-VL (multimodal); Gemma 3 (text and vision); Mistral 3.1 (text and vision); GLM4; and Qwen3 / Qwen3-MoE.^[4] Pixtral support was added in 0.2.4 and basic Qwen2-VL video support in 0.2.7.^[4] Architectures supported only at FP16 in some intermediate releases may not have a full EXL2 quantization path until a later version, so users are advised to consult release notes for the specific architecture they want to quantize.^[4]

FlashAttention and paged attention

ExLlamaV2 relies on FlashAttention for its attention kernels. From version 0.0.21 onward, ExLlamaV2 supports paged attention via FlashAttention 2.5.7 or newer, which enables non-contiguous KV cache storage and underlies the library's dynamic batching generator.^[1] The dynamic generator added in the same line of releases supports continuous batching, smart prompt caching, K/V cache deduplication, asyncio-streamed generation, and consolidates earlier single and batched generators behind one API.^[1] The 4-bit ("Q4") KV cache mode is supported in the dynamic generator and is reported to perform better than the older FP8 cache mode that was dropped in the consolidation.^[1]

Paged attention as used by ExLlamaV2 is the same conceptual mechanism introduced by PagedAttention in vLLM, namely splitting the KV cache into fixed-size blocks that can be allocated non-contiguously, but ExLlamaV2 accesses it through the FlashAttention 2.5.7+ implementation rather than vLLM's own kernels.^[1]

Dynamic generator: continuous batching, prefix caching, and deduplication

The dynamic generator implements continuous batching using paged attention. Rather than the static, pre-allocated caches of the older generator, the dynamic generator maintains a job queue and activates as many jobs as fit in the current cache, freeing pages as jobs complete.^[19] Because pages are managed by a block table rather than by contiguous slabs, the only memory wasted is what is needed to align sequences to page boundaries, currently 256 tokens; this eliminates the padding waste of fixed-batch generators.^[19]

The same page table lets the dynamic generator perform two related kinds of cache reuse. When multiple prompts share a common prefix, such as a fixed system message, the generator points multiple sequences at the same cached keys and values rather than recomputing them, sharply reducing prefill time and VRAM use.^[19] Separately, recently used pages are not immediately freed when a job completes; the generator can later reattach a new job to those pages if it sees the same prefix, enabling effective prompt caching across successive chat turns.^[19] The Python API exposes both single completions and batched completions via generator.generate(prompt=...) and a streaming iterate() method that yields token-level result dictionaries keyed by user-supplied job identifiers.^[19]

Tensor parallelism and multi-GPU inference

ExLlamaV2 added tensor-parallel inference in autumn 2024, joining its existing pipeline-parallel mode for splitting models across multiple GPUs.^[20] Tensor parallelism partitions each linear layer's weight matrices across GPUs and runs the matmul in parallel on the shards, so each forward pass uses all cards simultaneously rather than sequentially as in pipeline mode. Community reports indicate that the new tensor-parallel path made ExLlamaV2 a viable choice for multi-GPU batched serving of EXL2-quantized models, where llama.cpp still lacks comparable tensor-parallel and continuous-batching support.^[20]

Speculative decoding

ExLlamaV2 includes built-in support for speculative decoding through a smaller draft model. In speculative decoding a fast draft model proposes a short run of tokens that the target model verifies in a single parallel forward pass, accepting the longest prefix consistent with its own distribution and falling back to standard sampling for the first rejected position. Output is provably identical to running the target model alone.^[12]

In ExLlamaV2 the draft model can be any model whose tokenizer matches the target's, including a much smaller member of the same family (for example a Qwen 0.5B draft alongside a Qwen 72B target).^[12] Community deployments report speed increases of roughly 100% to 200% on tasks where draft acceptance is high.^[12] Speculative decoding is exposed both through ExLlamaV2's own dynamic generator and through the OpenAI-compatible front end provided by TabbyAPI, which lets users configure a draft model and draft length in config.yml.^[13]

Hardware support

ExLlamaV2 targets modern NVIDIA consumer cards. The bundled CUDA kernels and the FlashAttention 2.5.7+ paged-attention path require Ampere-generation or newer GPUs for full performance; older architectures will run but without paged-attention batching.^[1]^[13] Reference benchmarks from the project's README report, for example, around 205 tokens per second for LLaMA-1 7B in 4-bit on an RTX 4090 and 770 tokens per second for TinyLlama 1.1B in EXL2.^[3] A LLaMA-2 70B model fits on a single 24 GB consumer card such as an RTX 3090 or 4090 at roughly 2.55 bits per weight with a 2048-token context, with output described as coherent and mostly stable at that precision.^[1]^[11]

While ExLlamaV2 is primarily a CUDA-only project, community forks and unofficial builds offer support for AMD HIP / ROCm on Linux, allowing some AMD cards such as the Radeon RX 7900 XTX to run EXL2 models.^[21] Performance on those targets is generally below the NVIDIA reference path, and several integrations explicitly recommend NVIDIA Ampere-or-newer hardware for the best single-GPU INT4 throughput.^[13]^[21]

KV cache quantization

ExLlamaV2 supports running the KV cache itself at reduced precision in addition to the weights. The supported modes include FP16 (default), an FP8 mode in earlier generators, and a 4-bit "Q4" cache mode in the dynamic generator that is reported by the project to outperform the older FP8 path.^[1] Reducing the cache precision lets longer contexts fit on a given card; for example, on a single 24 GB GPU, 4-bit cache typically roughly doubles the maximum context length over FP16 cache at the same weight precision, at the cost of small quality degradation on long-context tasks.^[1]^[11]

Samplers and decoding controls

ExLlamaV2 implements the conventional set of decoding samplers and a few less-common ones. The 0.2.3 release explicitly added the XTC (exclude-top-choices) sampler popularized in the local-LLM community, alongside standard temperature, top-k, top-p, min-p, repetition-penalty, and dynamic-temperature controls.^[4] The dynamic generator also supports JSON-schema-constrained decoding when wrapped by TabbyAPI, classifier-free guidance, and sampler overrides applied per request through the API.^[13]

How fast is EXL2, and how good is the quality?

The most-cited public head-to-head comparison of EXL2 against other 4-bit quantization formats is Maxime Labonne's November 2023 post on ExLlamaV2^[10] and the oobabooga benchmark published the same month, which evaluated GPTQ, AWQ, EXL2, and llama.cpp's GGUF Q4_K_M / Q4_K_S on Llama 2 13B on the WikiText perplexity task.^[14] Numbers from the latter are reproduced below.

Format	Variant	WikiText perplexity (lower = better)	VRAM (GB)
EXL2	4.900 bpw	4.31	(not reported)
EXL2	4.650 bpw	4.32	(not reported)
AWQ	4-bit g32	4.33	10.6
GGUF (llama.cpp)	Q4_K_M	4.33	(not reported)
GPTQ	4-bit g32 act-order	4.34	(not reported)
GGUF (llama.cpp)	Q4_K_S	4.34	8.6
Transformers `load_in_4bit`	NF4	4.36	(not reported)
EXL2	4.000 bpw	(not reported)	7.9
GPTQ	4-bit g128 act-order	(not reported)	7.9

Source: oobabooga, "GPTQ, AWQ, EXL2, llama.cpp: detailed comparison" (2023).^[14]

On generation speed, the same benchmark reports EXL2 4.250 bpw at about 56.9 tokens per second, GPTQ via ExLlamaV2 at 64.1 tokens per second, and llama.cpp's Q4_K_S at 35.3 tokens per second on the same hardware, with llama.cpp taking roughly 2.22 times longer than ExLlamaV2 to process a 3200-token prompt.^[14] Multiple independent comparisons reach a similar qualitative ranking: when the model fits entirely in GPU VRAM, EXL2 is among the fastest formats available on NVIDIA, but it loses that advantage as soon as part of the model needs to be offloaded, because ExLlamaV2 has no CPU-offloading path.^[15]^[16] At 4-bit, the spread between any two well-tuned 4-bit formats on perplexity is generally smaller than the spread between any 4-bit and any 8-bit format.^[14]^[15] These figures come from community and vendor benchmarks; absolute throughput depends heavily on the specific GPU, driver, and context length used.

Implementations and Downstream Use

TabbyAPI

TabbyAPI is the official API backend recommended by the ExLlamaV2 project. It is a FastAPI application that wraps ExLlamaV2 (and ExLlamaV3) and exposes an OpenAI-compatible REST API, with additional endpoints for model loading, LoRA management, sampling overrides, and embeddings.^[13] TabbyAPI is maintained by kingbri, Splice86, and turboderp.^[13]

TabbyAPI supports the same model formats as the underlying library, currently EXL3 (recommended) and EXL2 / GPTQ / FP16-BF16 (the latter group marked deprecated as of the current TabbyAPI README).^[13] It includes continuous batching with paged attention on Ampere or newer NVIDIA GPUs, JSON-schema-constrained decoding, multi-LoRA scaling, Jinja2 chat templates compatible with Hugging Face model cards, speculative decoding via a draft model, and tool/function calling compatible with the OpenAI API schema.^[13]

text-generation-webui

Llama-era open-source local chat UI text-generation-webui (oobabooga's project) shipped an ExLlamaV2 loader as one of several inference backends alongside Hugging Face Transformers, llama.cpp, TensorRT-LLM, and HQQ.^[17] The loader supports both EXL2 and GPTQ checkpoints. The integration includes wrappers (Exllamav2HF) that expose an ExLlamaV2 model as a transformers.PreTrainedModel, so it can be driven via the standard generate() method, and exposes ExLlamaV2's cache-quantization and speculative-decoding options through the UI.^[17]

Other integrations

ExLlamaV2 is also used as a backend in ExUI (turboderp's own minimal web UI), the lollms-webui project, and various community front ends.^[3] A third-party project, exl2-for-all, generalized the EXL2 quantization process to architectures beyond the Llama-style ones natively supported by ExLlamaV2.^[18] The format is widely distributed on Hugging Face: many community quantizers, including LoneStriker, bartowski, and turboderp's own account, publish multiple-bpw EXL2 variants of popular open-weight models.^[11]

Front-end clients

Because TabbyAPI exposes the OpenAI Chat Completions schema, any chat client that targets that schema can drive an ExLlamaV2 server with minimal configuration. The most commonly mentioned community clients include SillyTavern (for character-based roleplay, with documented TabbyAPI integration), Open WebUI, and LM Studio for sessions that import an OpenAI-compatible base URL.^[13] The HuggingFace Hub adapter shipped with text-generation-webui can also load EXL2 models directly from a model ID, which lets users move between FP16 Transformers checkpoints and EXL2 quantizations of the same model with a one-line config change.^[17]

What is EXL2 used for?

EXL2's main practical role is to make large, dense, open-weight LLMs runnable on a single consumer GPU at interactive speeds. The frequently cited reference workload is fitting Llama 2 70B onto one 24 GB card such as an RTX 3090 or 4090 at roughly 2.55 bits per weight, which keeps the model entirely in VRAM and so preserves ExLlamaV2's tokens-per-second advantage.^[1]^[11] Mid-range cards (12 to 16 GB) generally use EXL2 for 7B to 13B models at 4 to 6 bits per weight; higher-end multi-GPU rigs use EXL2 to host 70B-class models at 4 to 5 bpw with longer contexts and speculative decoding for further speedups.^[11]^[15]

Beyond local chat, the combination of ExLlamaV2, TabbyAPI, and speculative decoding has been adopted by hobbyist and small-team deployments that want an OpenAI-compatible API on their own hardware without operating a heavyweight server such as vLLM or SGLang.^[13]^[15] Use cases reported in community write-ups include creative-writing assistants, retrieval-augmented question answering, code completion, and offline research notebooks on workstation GPUs.^[17]^[15]

The library is also exposed through framework adapters: LangChain's ExLlamaV2 LLM integration lets Python applications call EXL2 models through LangChain's chains and agents abstractions without writing CUDA, and Hugging Face's Transformers can wrap an ExLlamaV2 model behind its PreTrainedModel interface via the Exllamav2HF shim shipped with text-generation-webui.^[17] Together these adapters let EXL2 fit into pipelines originally designed for FP16 Transformers models, with the quantized backend swapped in transparently.

EXL2 quantized weights are widely distributed on the Hugging Face Hub. Community quantizers including LoneStriker, bartowski, and turboderp's own account publish multiple bits-per-weight variants of new open-weight releases within days of upload, allowing users to pick the highest bpw that fits their card.^[11] Because the format encodes its per-row bit allocation in the file itself, no separate configuration file is needed at load time beyond the standard tokenizer and config files distributed with the model.^[9]^[11]

What are the limitations of EXL2?

ExLlamaV2 and the EXL2 format have several known limitations.

No CPU offloading. Unlike llama.cpp (GGUF), ExLlamaV2 keeps the entire model in GPU memory; if a quantized model does not fit in VRAM, the user must shard across multiple GPUs or pick a different backend.^[15]^[16]
NVIDIA-only and Ampere-or-newer for full features. The CUDA kernels and the paged-attention path require recent NVIDIA cards; AMD ROCm and Apple Silicon are not first-class supported targets, in contrast to llama.cpp.^[13]^[16]
Long, RAM-heavy conversions. EXL2 quantization runs the measurement pass twelve times across the model; community reports note that a 70B EXL2 quantization on a single 24 GB GPU can take many hours of wall time and substantial system RAM.^[9]^[11]
Non-portable tensor layout. ExLlamaV3's release notes explicitly call out that EXL2 renames some tensors during conversion to coerce models into a Llama-style structure, making EXL2 checkpoints harder to load in other frameworks; this is one motivation for the EXL3 redesign.^[7]
Architecture coverage trails llama.cpp. Because each architecture needs a custom forward pass in the ExLlamaV2 CUDA kernels, support for new model families generally lands later in ExLlamaV2 than in llama.cpp.^[4]^[16]
Project archived. As of mid-2025, ExLlamaV2 is archived with development continuing in ExLlamaV3; bug fixes for new model families are no longer expected on the V2 branch.^[3]^[4]

On the format side, several head-to-head comparisons note that pure 4-bit EXL2 at 4.0 bpw can be a hair behind the best of AWQ and Q4_K_M on Llama-2-13B WikiText perplexity, and that EXL2 only clearly wins on perplexity once the average bit rate is allowed to rise to roughly 4.65 or 4.9 bpw, at which point it also uses correspondingly more VRAM than competing 4-bit formats.^[14]^[15]

How does EXL2 compare to GGUF, GPTQ, and AWQ?

Format	Origin	Bit-width strategy	Typical use case
GPTQ	Frantar et al., 2022 (arXiv 2210.17323)	Fixed bit width per layer (commonly 4-bit)	Earlier-generation GPU inference, baseline for derivative work^[8]
EXL2	turboderp / ExLlamaV2, 2023	Mixed bit width per row, average 2 to 8 bpw	Single-GPU NVIDIA inference at interactive speed^[1]^[9]
AWQ	Lin et al., 2023	Activation-aware fixed bit width (commonly 4-bit)	Strong 4-bit perplexity, broad framework support^[14]^[15]
GGUF (llama.cpp)	ggerganov, 2023	Family of mixed K-quant schemes (Q2_K through Q8_0, with `_S` / `_M` / `_L` variants)	CPU+GPU hybrid, broad hardware portability^[14]^[16]
QTIP / EXL3	Cornell RelaxML, 2024-2025; ExLlamaV3	Vector quantization with Viterbi decoding	EXL2 successor, NVIDIA inference^[7]

QLoRA's 4-bit NF4 ("NormalFloat-4") format used by Hugging Face Transformers' load_in_4bit flag is generally categorized as a separate, training-oriented quantization scheme rather than an inference format; in the oobabooga benchmark above it sits at the worst-perplexity end of the table at 4-bit.^[14]

A practical takeaway from the published comparisons is that the choice of format depends more on hardware and deployment shape than on raw perplexity. On a single NVIDIA card with the model fully in VRAM, EXL2 (and its successor EXL3) lead on tokens per second; on heterogeneous CPU-plus-GPU setups or Apple Silicon, GGUF's offloading flexibility is decisive; AWQ is favored by serving stacks that prioritize 4-bit perplexity and broad framework portability; and GPTQ persists mainly as a baseline understood by every quantization-aware stack.^[14]^[15]^[16] The 2-to-3-bit ultra-low-precision range is the regime where EXL2's mixed bit allocation showed its clearest historical advantage, by letting users target 2.4 to 2.55 bpw while keeping enough higher-bit rows on the most sensitive layers to remain usable, something fixed-bit GPTQ at 2-bit was generally not able to do.^[1]^[11]

What is the difference between EXL2 and EXL3?

EXL3 is the weight-quantization format of ExLlamaV3, the successor library turboderp began previewing in April 2025.^[7] Where EXL2 derives from the GPTQ measurement loop and mixes bit widths to hit an average bitrate between 2 and 8 bpw, EXL3 uses "a streamlined variant of QTIP from Cornell RelaxML," a trellis-coded vector quantizer that encodes high-dimensional weight vectors into tail-biting trellis structures.^[7] EXL3 computes Hessians on the fly with a fused Viterbi kernel, which lets it convert a model in a single pass (a couple of minutes for small models, up to a few hours for 70B-class models on one RTX 4090) rather than EXL2's slower two-pass measurement-then-quantization flow.^[7] A practical result reported in the ExLlamaV3 documentation is that Llama-3.1-70B in EXL3 stays coherent down to 1.6 bpw, and with the output layer at 3 bpw and a 4096-token cache it can run in under 16 GB of VRAM.^[7] ExLlamaV3 also keeps the original model file structure largely intact instead of renaming tensors into a Llama-style layout, improving portability over EXL2.^[7] EXL2 remains widely used because of its large back catalogue of pre-quantized models on Hugging Face, but new development and the project's own recommendation have shifted to EXL3.^[3]^[7]

ELI5: What is EXL2?

Imagine you have a very big, very heavy book (an AI model) and you want it to fit on a small shelf (your gaming graphics card). EXL2 is a clever way to shrink the book by printing the boring pages in tiny print and the important pages in normal print, so the whole thing still makes sense but takes up much less space. The ExLlamaV2 program both shrinks the book this way and then reads it back to you very quickly, as long as the whole book fits on the shelf.

References

turboderp, "ExLlamaV2 README", GitHub, 2025-07-13. https://github.com/turboderp-org/exllamav2/blob/master/README.md. Accessed 2026-05-21. ↩
turboderp, "ExLlamaV2: A fast inference library for running LLMs locally on modern consumer-class GPUs (repository description)", GitHub, 2025-07-13. https://github.com/turboderp-org/exllamav2. Accessed 2026-05-21. ↩
turboderp-org, "exllamav2 repository (project status and integrations)", GitHub, 2025-07-13. https://github.com/turboderp-org/exllamav2. Accessed 2026-05-21. ↩
turboderp-org, "Releases: turboderp-org/exllamav2", GitHub, 2025-07-13. https://github.com/turboderp-org/exllamav2/releases. Accessed 2026-05-21. ↩
Internet Archive, "github.com-turboderp-exllamav2 snapshot 2023-09-14", Internet Archive, 2023-09-14. https://archive.org/details/github.com-turboderp-exllamav2_-_2023-09-14_15-53-34. Accessed 2026-05-21. ↩
turboderp, "exllama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights", GitHub, 2023. https://github.com/turboderp/exllama. Accessed 2026-05-21. ↩
turboderp-org, "ExLlamaV3 README and conversion notes", GitHub, 2025. https://github.com/turboderp-org/exllamav3. Accessed 2026-05-21. ↩
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", arXiv, 2022-10-31. https://arxiv.org/abs/2210.17323. Accessed 2026-05-21. ↩
turboderp, "ExLlamaV2 doc/convert.md: Conversion script reference", GitHub, 2024. https://github.com/turboderp-org/exllamav2/blob/master/doc/convert.md. Accessed 2026-05-21. ↩
Maxime Labonne, "ExLlamaV2: The Fastest Library to Run LLMs", TDS Archive (Medium), 2023-11-20. https://medium.com/data-science/exllamav2-the-fastest-library-to-run-llms-32aeda294d26. Accessed 2026-05-21. ↩
Benjamin Marie, "Run Llama 2 70B on Your GPU with ExLlamaV2", The Kaitchup / Medium, 2023-11. https://medium.com/data-science/run-llama-2-70b-on-your-gpu-with-exllamav2-588141a88598. Accessed 2026-05-21. ↩
grimjim, "Speculative decoding only requires that the tokenizers for the two LLMs used line up", Hugging Face Posts, 2024. https://huggingface.co/posts/grimjim/820999393776814. Accessed 2026-05-21. ↩
theroyallab, "tabbyAPI: The official API server for Exllama", GitHub, 2025. https://github.com/theroyallab/tabbyAPI/. Accessed 2026-05-21. ↩
oobabooga, "A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit", oobabooga blog, 2023-11. https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/. Accessed 2026-05-21. ↩
CraftRigs, "GGUF vs GPTQ vs AWQ vs EXL2: Which Quantization Format Should You Use?", CraftRigs, 2024. https://craftrigs.com/guides/gguf-vs-gptq-vs-awq-exl2-quantization-formats-explained/. Accessed 2026-05-21. ↩
Hardware Corner, "Quantization for Local LLMs: How It Works and Which Formats Fit Your Setup", Hardware Corner, 2024. https://www.hardware-corner.net/quantization-local-llms-formats/. Accessed 2026-05-21. ↩
DeepWiki, "ExLlama Integration in oobabooga/text-generation-webui", DeepWiki, 2025. https://deepwiki.com/oobabooga/text-generation-webui/3.2-exllama-integration. Accessed 2026-05-21. ↩
chu-tianxiang, "exl2-for-all: EXL2 quantization generalized to other models", GitHub, 2024. https://github.com/chu-tianxiang/exl2-for-all. Accessed 2026-05-21. ↩
turboderp, "ExLlamaV2 doc/dynamic.md: Dynamic generator reference", GitHub, 2024. https://github.com/turboderp-org/exllamav2/blob/master/doc/dynamic.md. Accessed 2026-05-21. ↩
Ahmad Osman, "Stop Wasting Your Multi-GPU Setup With llama.cpp: Use vLLM or ExLlamaV2 for Tensor Parallelism", Osman's Odyssey, 2024. https://www.ahmadosman.com/blog/do-not-use-llama-cpp-or-ollama-on-multi-gpus-setups-use-vllm-or-exllamav2/. Accessed 2026-05-21. ↩
wasdtech, "Install ExLlamaV2 for AMD HIP/ROCm on Linux", wasdtech.altervista.org, 2024. https://wasdtech.altervista.org/install-exllamav2-for-amd-hip-rocm-on-linux/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

AWQ (Activation-aware Weight Quantization)LLM.int8()Mistral 7B Optimum-Quanto SmoothQuant

Infobox

What is EXL2?

History

How does EXL2 quantization work?

Relation to GPTQ

Calibration and the two-pass converter

Supported bit widths and file structure

Supported model architectures

FlashAttention and paged attention

Dynamic generator: continuous batching, prefix caching, and deduplication

Tensor parallelism and multi-GPU inference

Speculative decoding

Hardware support

KV cache quantization

Samplers and decoding controls

How fast is EXL2, and how good is the quality?

Implementations and Downstream Use

TabbyAPI

text-generation-webui

Other integrations

Front-end clients

What is EXL2 used for?

What are the limitations of EXL2?

How does EXL2 compare to GGUF, GPTQ, and AWQ?

What is the difference between EXL2 and EXL3?

ELI5: What is EXL2?

See also

References

Improve this article

Related Articles

NVIDIA Dynamo

Optimum-Quanto

Text Generation Inference (TGI)

OpenVINO

NVIDIA Triton Inference Server

TensorFlow Serving

What links here

Related Articles

NVIDIA Dynamo

Optimum-Quanto

Text Generation Inference (TGI)

OpenVINO

NVIDIA Triton Inference Server

TensorFlow Serving

What links here