ExLlamaV2 (EXL2)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,108 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,108 words
Add missing citations, update stale details, or suggest a clearer explanation.
ExLlamaV2 is an open-source inference library and an associated weight-quantization format (commonly referred to as EXL2) for running large language models locally on consumer-class NVIDIA GPUs. It was created and is maintained by the developer who publishes under the pseudonym turboderp and is the successor to the original ExLlama project from 2023.[^1] The EXL2 format extends ideas from GPTQ by allowing mixed bit widths within a single model: different rows and layers can be stored at different precisions so that an average target bit-rate between roughly 2 and 8 bits per weight can be hit while keeping perplexity close to the full-precision baseline.[^1][^2] ExLlamaV2 was designed first for the LLaMA family of decoder-only transformers and has since added support for many other architectures including Mistral, Mixtral, Qwen, and Gemma variants.[^3] The library reached version 0.3.2 on 2025-07-13, after which the project was archived in favor of a successor project, ExLlamaV3.[^3][^4]
| Field | Value |
|---|---|
| Project | ExLlamaV2 |
| Format | EXL2 (mixed-bit weight quantization) |
| Developer | turboderp (GitHub: turboderp-org) |
| First preliminary release | 2023-08-12 |
| Latest release | 0.3.2 (2025-07-13) |
| License | MIT |
| Primary language | Python, CUDA, C++ |
| Successor | ExLlamaV3 |
| Reference backend | TabbyAPI |
| Repository | github.com/turboderp-org/exllamav2 |
[^1][^3][^4][^5]
The project began as ExLlama, a memory-efficient rewrite of the Hugging Face Transformers implementation of LLaMA aimed at quantized weights, published on GitHub by turboderp in mid-2023.[^6] The author describes ExLlama as "a more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights."[^6] The original ExLlama targeted GPTQ models on a single CUDA GPU and quickly became the fastest open-source backend for 4-bit LLaMA-1 and LLaMA-2 inference on consumer cards such as the RTX 3090 and RTX 4090.[^3][^6]
A preliminary, tentative release of ExLlamaV2 was tagged on 2023-08-12.[^4] The rewrite introduced the EXL2 quantization format, a new generator with paged batching, and reworked CUDA kernels. Through 2024 and 2025, ExLlamaV2 added support for further model families and multimodal inputs. The 0.2.3 release on 2024-09-29 removed the safetensors hard dependency and added the XTC sampler and YaRN context extension; 0.2.4 (2024-11-12) added Pixtral and refactored the multimodal pipeline; 0.2.5 through 0.2.7 (December 2024) added Qwen2-VL image then basic video support and Cohere2 and Granite3 architectures; 0.2.8 (2025-02-08) added Qwen2.5-VL; 0.2.9 (2025-04-23) added Gemma 3, Mistral 3.1 (text and vision), and GLM4; and 0.3.0 (2025-05-12) added Qwen3 and Qwen3-MoE.[^4]
In April 2025 turboderp announced an early preview of ExLlamaV3, a fresh codebase using a streamlined variant of the QTIP quantizer from Cornell RelaxML rather than EXL2's GPTQ-derived measurement loop, and computing Hessians on the fly with a fused Viterbi kernel.[^7] ExLlamaV2's final tagged release (0.3.2) shipped on 2025-07-13, and the repository was subsequently archived with development continuing in ExLlamaV3.[^3][^4]
The EXL2 author should not be confused with Tim Dettmers, who created QLoRA and the bitsandbytes library; those projects are unrelated to ExLlamaV2 and pursue a different quantization approach.
EXL2 inherits the core optimization step from the GPTQ algorithm of Frantar, Ashkboos, Hoefler, and Alistarh (arXiv 2210.17323, October 2022, ICLR 2023).[^8] GPTQ is a one-shot post-training weight-quantization method based on approximate second-order (Hessian-based) information, capable of compressing GPT models with 175 billion parameters in roughly four GPU hours down to 3 to 4 bits per weight with little accuracy degradation.[^8] GPTQ typically quantizes an entire linear layer (or fixed-size groups within it) to a single bit-width.
EXL2 generalizes this in two ways. First, it allows the converter to mix multiple bit widths and group sizes within the same matrix: more sensitive columns or sub-blocks can be stored at higher precision and less sensitive ones at lower precision.[^1][^9] Second, the converter searches over the available per-row settings to hit a user-specified average bits-per-weight (bpw) target while minimizing total reconstruction error.[^9] The official ExLlamaV2 README describes the result as "a new quantization format" that "supports 2, 3, 4, 5, 6 and 8-bit quantization" and allows "mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight."[^1]
The bundled convert.py script runs in two phases.[^9] The first pass is a measurement phase: the converter requantizes each module of the model roughly twelve times under different bit-width and group-size choices and records the resulting per-module quantization error against a calibration set.[^9] The second pass uses those measurements to select, per row, the actual quantization parameters that minimize total error subject to the target average bits-per-weight, then writes the quantized weights to disk.[^9]
Calibration data may be supplied by the user in Parquet format or drawn from the converter's built-in default set. The script concatenates the supplied text into one long string and uses the first rows * length tokens as calibration material, with default settings of 100 rows of 2048 tokens each for the quantization pass and 16 rows by default during the measurement pass.[^9] WikiText is a common third-party choice for the calibration corpus.[^10]
The -b flag sets the target average bits per weight, and -hb controls the precision of the output (head) layer specifically, with permitted values of 2, 3, 4, 5, 6, or 8 bits.[^9] The documentation notes that the nominal 6-bit setting in practice produces a mixed-precision quantization averaging about 6.3 bits per weight, because of the per-row search.[^9] After conversion, the script reports a calibration perplexity; values above roughly 30 indicate problems and values in the thousands indicate complete failure of the run.[^9]
Hardware requirements for the conversion itself depend on model width rather than depth: converting a 70B model requires roughly 64 GB of system RAM and 24 GB of CUDA VRAM, while a 7B model needs about 16 GB of RAM and 8 GB of VRAM.[^9]
EXL2 supports per-row choices among 2, 3, 4, 5, 6, and 8-bit quantization, combined to produce average rates that, in practice, fall in the 2.5 to 8 bpw range; published quants on Hugging Face commonly include 2.4, 2.55, 3.0, 4.0, 4.25, 4.65, 5.0, 6.0, and 8.0 bpw variants.[^1][^10][^11] The format stores quantized weights, scales, and a metadata table describing the per-row bit allocation in a single safetensors-compatible file. Per ExLlamaV3 documentation, EXL2 renames some tensors during conversion to fit a Llama-style structure, a behavior that ExLlamaV3 explicitly drops to ease portability to other frameworks.[^7]
ExLlamaV2 began with LLaMA-1 architecture support inherited from the original ExLlama, then extended to a long list of decoder-only families through its release history.[^3][^4] The list of architectures with first-class support, in roughly the order they were added, includes Llama 1 and 2; Mistral 7B and Mixtral 8x7B / 8x22B mixture-of-experts; Llama 3 and 3.1; Qwen 1 / 1.5 / 2; Gemma 1 / 2; Cohere2 (Aya / Command R); Granite3; Qwen2-VL and Qwen2.5-VL (multimodal); Gemma 3 (text and vision); Mistral 3.1 (text and vision); GLM4; and Qwen3 / Qwen3-MoE.[^4] Pixtral support was added in 0.2.4 and basic Qwen2-VL video support in 0.2.7.[^4] Architectures supported only at FP16 in some intermediate releases may not have a full EXL2 quantization path until a later version, so users are advised to consult release notes for the specific architecture they want to quantize.[^4]
ExLlamaV2 relies on FlashAttention for its attention kernels. From version 0.0.21 onward, ExLlamaV2 supports paged attention via FlashAttention 2.5.7 or newer, which enables non-contiguous KV cache storage and underlies the library's dynamic batching generator.[^1] The dynamic generator added in the same line of releases supports continuous batching, smart prompt caching, K/V cache deduplication, asyncio-streamed generation, and consolidates earlier single and batched generators behind one API.[^1] The 4-bit ("Q4") KV cache mode is supported in the dynamic generator and is reported to perform better than the older FP8 cache mode that was dropped in the consolidation.[^1]
Paged attention as used by ExLlamaV2 is the same conceptual mechanism introduced by PagedAttention in vLLM, namely splitting the KV cache into fixed-size blocks that can be allocated non-contiguously, but ExLlamaV2 accesses it through the FlashAttention 2.5.7+ implementation rather than vLLM's own kernels.[^1]
The dynamic generator implements continuous batching using paged attention. Rather than the static, pre-allocated caches of the older generator, the dynamic generator maintains a job queue and activates as many jobs as fit in the current cache, freeing pages as jobs complete.[^19] Because pages are managed by a block table rather than by contiguous slabs, the only memory wasted is what is needed to align sequences to page boundaries, currently 256 tokens; this eliminates the padding waste of fixed-batch generators.[^19]
The same page table lets the dynamic generator perform two related kinds of cache reuse. When multiple prompts share a common prefix, such as a fixed system message, the generator points multiple sequences at the same cached keys and values rather than recomputing them, sharply reducing prefill time and VRAM use.[^19] Separately, recently used pages are not immediately freed when a job completes; the generator can later reattach a new job to those pages if it sees the same prefix, enabling effective prompt caching across successive chat turns.[^19] The Python API exposes both single completions and batched completions via generator.generate(prompt=...) and a streaming iterate() method that yields token-level result dictionaries keyed by user-supplied job identifiers.[^19]
ExLlamaV2 added tensor-parallel inference in autumn 2024, joining its existing pipeline-parallel mode for splitting models across multiple GPUs.[^20] Tensor parallelism partitions each linear layer's weight matrices across GPUs and runs the matmul in parallel on the shards, so each forward pass uses all cards simultaneously rather than sequentially as in pipeline mode. Community reports indicate that the new tensor-parallel path made ExLlamaV2 a viable choice for multi-GPU batched serving of EXL2-quantized models, where llama.cpp still lacks comparable tensor-parallel and continuous-batching support.[^20]
ExLlamaV2 includes built-in support for speculative decoding through a smaller draft model. In speculative decoding a fast draft model proposes a short run of tokens that the target model verifies in a single parallel forward pass, accepting the longest prefix consistent with its own distribution and falling back to standard sampling for the first rejected position. Output is provably identical to running the target model alone.[^12]
In ExLlamaV2 the draft model can be any model whose tokenizer matches the target's, including a much smaller member of the same family (for example a Qwen 0.5B draft alongside a Qwen 72B target).[^12] Community deployments report speed increases of roughly 100% to 200% on tasks where draft acceptance is high.[^12] Speculative decoding is exposed both through ExLlamaV2's own dynamic generator and through the OpenAI-compatible front end provided by TabbyAPI, which lets users configure a draft model and draft length in config.yml.[^13]
ExLlamaV2 targets modern NVIDIA consumer cards. The bundled CUDA kernels and the FlashAttention 2.5.7+ paged-attention path require Ampere-generation or newer GPUs for full performance; older architectures will run but without paged-attention batching.[^1][^13] Reference benchmarks from the project's README report, for example, around 205 tokens per second for LLaMA-1 7B in 4-bit on an RTX 4090 and 770 tokens per second for TinyLlama 1.1B in EXL2.[^3] A LLaMA-2 70B model fits on a single 24 GB consumer card such as an RTX 3090 or 4090 at roughly 2.55 bits per weight with a 2048-token context, with output described as coherent and mostly stable at that precision.[^1][^11]
While ExLlamaV2 is primarily a CUDA-only project, community forks and unofficial builds offer support for AMD HIP / ROCm on Linux, allowing some AMD cards such as the Radeon RX 7900 XTX to run EXL2 models.[^21] Performance on those targets is generally below the NVIDIA reference path, and several integrations explicitly recommend NVIDIA Ampere-or-newer hardware for the best single-GPU INT4 throughput.[^13][^21]
ExLlamaV2 supports running the KV cache itself at reduced precision in addition to the weights. The supported modes include FP16 (default), an FP8 mode in earlier generators, and a 4-bit "Q4" cache mode in the dynamic generator that is reported by the project to outperform the older FP8 path.[^1] Reducing the cache precision lets longer contexts fit on a given card; for example, on a single 24 GB GPU, 4-bit cache typically roughly doubles the maximum context length over FP16 cache at the same weight precision, at the cost of small quality degradation on long-context tasks.[^1][^11]
ExLlamaV2 implements the conventional set of decoding samplers and a few less-common ones. The 0.2.3 release explicitly added the XTC (exclude-top-choices) sampler popularized in the local-LLM community, alongside standard temperature, top-k, top-p, min-p, repetition-penalty, and dynamic-temperature controls.[^4] The dynamic generator also supports JSON-schema-constrained decoding when wrapped by TabbyAPI, classifier-free guidance, and sampler overrides applied per request through the API.[^13]
The most-cited public head-to-head comparison of EXL2 against other 4-bit quantization formats is Maxime Labonne's November 2023 post on ExLlamaV2[^10] and the oobabooga benchmark published the same month, which evaluated GPTQ, AWQ, EXL2, and llama.cpp's GGUF Q4_K_M / Q4_K_S on Llama 2 13B on the WikiText perplexity task.[^14] Numbers from the latter are reproduced below.
| Format | Variant | WikiText perplexity (lower = better) | VRAM (GB) |
|---|---|---|---|
| EXL2 | 4.900 bpw | 4.31 | (not reported) |
| EXL2 | 4.650 bpw | 4.32 | (not reported) |
| AWQ | 4-bit g32 | 4.33 | 10.6 |
| GGUF (llama.cpp) | Q4_K_M | 4.33 | (not reported) |
| GPTQ | 4-bit g32 act-order | 4.34 | (not reported) |
| GGUF (llama.cpp) | Q4_K_S | 4.34 | 8.6 |
Transformers load_in_4bit | NF4 | 4.36 | (not reported) |
| EXL2 | 4.000 bpw | (not reported) | 7.9 |
| GPTQ | 4-bit g128 act-order | (not reported) | 7.9 |
Source: oobabooga, "GPTQ, AWQ, EXL2, llama.cpp: detailed comparison" (2023).[^14]
On generation speed, the same benchmark reports EXL2 4.250 bpw at about 56.9 tokens per second, GPTQ via ExLlamaV2 at 64.1 tokens per second, and llama.cpp's Q4_K_S at 35.3 tokens per second on the same hardware, with llama.cpp taking roughly 2.22 times longer than ExLlamaV2 to process a 3200-token prompt.[^14] Multiple independent comparisons reach a similar qualitative ranking: when the model fits entirely in GPU VRAM, EXL2 is among the fastest formats available on NVIDIA, but it loses that advantage as soon as part of the model needs to be offloaded, because ExLlamaV2 has no CPU-offloading path.[^15][^16] At 4-bit, the spread between any two well-tuned 4-bit formats on perplexity is generally smaller than the spread between any 4-bit and any 8-bit format.[^14][^15]
TabbyAPI is the official API backend recommended by the ExLlamaV2 project. It is a FastAPI application that wraps ExLlamaV2 (and ExLlamaV3) and exposes an OpenAI-compatible REST API, with additional endpoints for model loading, LoRA management, sampling overrides, and embeddings.[^13] TabbyAPI is maintained by kingbri, Splice86, and turboderp.[^13]
TabbyAPI supports the same model formats as the underlying library, currently EXL3 (recommended) and EXL2 / GPTQ / FP16-BF16 (the latter group marked deprecated as of the current TabbyAPI README).[^13] It includes continuous batching with paged attention on Ampere or newer NVIDIA GPUs, JSON-schema-constrained decoding, multi-LoRA scaling, Jinja2 chat templates compatible with Hugging Face model cards, speculative decoding via a draft model, and tool/function calling compatible with the OpenAI API schema.[^13]
Llama-era open-source local chat UI text-generation-webui (oobabooga's project) shipped an ExLlamaV2 loader as one of several inference backends alongside Hugging Face Transformers, llama.cpp, TensorRT-LLM, and HQQ.[^17] The loader supports both EXL2 and GPTQ checkpoints. The integration includes wrappers (Exllamav2HF) that expose an ExLlamaV2 model as a transformers.PreTrainedModel, so it can be driven via the standard generate() method, and exposes ExLlamaV2's cache-quantization and speculative-decoding options through the UI.[^17]
ExLlamaV2 is also used as a backend in ExUI (turboderp's own minimal web UI), the lollms-webui project, and various community front ends.[^3] A third-party project, exl2-for-all, generalized the EXL2 quantization process to architectures beyond the Llama-style ones natively supported by ExLlamaV2.[^18] The format is widely distributed on Hugging Face: many community quantizers, including LoneStriker, bartowski, and turboderp's own account, publish multiple-bpw EXL2 variants of popular open-weight models.[^11]
Because TabbyAPI exposes the OpenAI Chat Completions schema, any chat client that targets that schema can drive an ExLlamaV2 server with minimal configuration. The most commonly mentioned community clients include SillyTavern (for character-based roleplay, with documented TabbyAPI integration), Open WebUI, and LM Studio for sessions that import an OpenAI-compatible base URL.[^13] The HuggingFace Hub adapter shipped with text-generation-webui can also load EXL2 models directly from a model ID, which lets users move between FP16 Transformers checkpoints and EXL2 quantizations of the same model with a one-line config change.[^17]
EXL2's main practical role is to make large, dense, open-weight LLMs runnable on a single consumer GPU at interactive speeds. The frequently cited reference workload is fitting Llama 2 70B onto one 24 GB card such as an RTX 3090 or 4090 at roughly 2.55 bits per weight, which keeps the model entirely in VRAM and so preserves ExLlamaV2's tokens-per-second advantage.[^1][^11] Mid-range cards (12 to 16 GB) generally use EXL2 for 7B to 13B models at 4 to 6 bits per weight; higher-end multi-GPU rigs use EXL2 to host 70B-class models at 4 to 5 bpw with longer contexts and speculative decoding for further speedups.[^11][^15]
Beyond local chat, the combination of ExLlamaV2, TabbyAPI, and speculative decoding has been adopted by hobbyist and small-team deployments that want an OpenAI-compatible API on their own hardware without operating a heavyweight server such as vLLM or SGLang.[^13][^15] Use cases reported in community write-ups include creative-writing assistants, retrieval-augmented question answering, code completion, and offline research notebooks on workstation GPUs.[^17][^15]
The library is also exposed through framework adapters: LangChain's ExLlamaV2 LLM integration lets Python applications call EXL2 models through LangChain's chains and agents abstractions without writing CUDA, and Hugging Face's Transformers can wrap an ExLlamaV2 model behind its PreTrainedModel interface via the Exllamav2HF shim shipped with text-generation-webui.[^17] Together these adapters let EXL2 fit into pipelines originally designed for FP16 Transformers models, with the quantized backend swapped in transparently.
EXL2 quantized weights are widely distributed on the Hugging Face Hub. Community quantizers including LoneStriker, bartowski, and turboderp's own account publish multiple bits-per-weight variants of new open-weight releases within days of upload, allowing users to pick the highest bpw that fits their card.[^11] Because the format encodes its per-row bit allocation in the file itself, no separate configuration file is needed at load time beyond the standard tokenizer and config files distributed with the model.[^9][^11]
ExLlamaV2 and the EXL2 format have several known limitations.
On the format side, several head-to-head comparisons note that pure 4-bit EXL2 at 4.0 bpw can be a hair behind the best of AWQ and Q4_K_M on Llama-2-13B WikiText perplexity, and that EXL2 only clearly wins on perplexity once the average bit rate is allowed to rise to roughly 4.65 or 4.9 bpw, at which point it also uses correspondingly more VRAM than competing 4-bit formats.[^14][^15]
| Format | Origin | Bit-width strategy | Typical use case |
|---|---|---|---|
| GPTQ | Frantar et al., 2022 (arXiv 2210.17323) | Fixed bit width per layer (commonly 4-bit) | Earlier-generation GPU inference, baseline for derivative work[^8] |
| EXL2 | turboderp / ExLlamaV2, 2023 | Mixed bit width per row, average 2 to 8 bpw | Single-GPU NVIDIA inference at interactive speed[^1][^9] |
| AWQ | Lin et al., 2023 | Activation-aware fixed bit width (commonly 4-bit) | Strong 4-bit perplexity, broad framework support[^14][^15] |
| GGUF (llama.cpp) | ggerganov, 2023 | Family of mixed K-quant schemes (Q2_K through Q8_0, with _S / _M / _L variants) | CPU+GPU hybrid, broad hardware portability[^14][^16] |
| QTIP / EXL3 | Cornell RelaxML, 2024-2025; ExLlamaV3 | Vector quantization with Viterbi decoding | EXL2 successor, NVIDIA inference[^7] |
QLoRA's 4-bit NF4 ("NormalFloat-4") format used by Hugging Face Transformers' load_in_4bit flag is generally categorized as a separate, training-oriented quantization scheme rather than an inference format; in the oobabooga benchmark above it sits at the worst-perplexity end of the table at 4-bit.[^14]
A practical takeaway from the published comparisons is that the choice of format depends more on hardware and deployment shape than on raw perplexity. On a single NVIDIA card with the model fully in VRAM, EXL2 (and its successor EXL3) lead on tokens per second; on heterogeneous CPU-plus-GPU setups or Apple Silicon, GGUF's offloading flexibility is decisive; AWQ is favored by serving stacks that prioritize 4-bit perplexity and broad framework portability; and GPTQ persists mainly as a baseline understood by every quantization-aware stack.[^14][^15][^16] The 2-to-3-bit ultra-low-precision range is the regime where EXL2's mixed bit allocation showed its clearest historical advantage, by letting users target 2.4 to 2.55 bpw while keeping enough higher-bit rows on the most sensitive layers to remain usable, something fixed-bit GPTQ at 2-bit was generally not able to do.[^1][^11]