Optimum-Quanto
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,030 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,030 words
Add missing citations, update stale details, or suggest a clearer explanation.
Optimum Quanto, commonly referred to as Quanto, is a PyTorch-based quantization toolkit developed and maintained by Hugging Face that provides linear weight and activation quantization across multiple devices and modalities.[^1][^2] Originally introduced in March 2024 under the package name quanto, the library was renamed to optimum-quanto in May 2024 and moved into the Hugging Face Optimum ecosystem as a dedicated quantization backend.[^2][^3] Quanto supports weight quantization in two-bit, four-bit, and eight-bit integer formats as well as eight-bit floating point (qint2, qint4, qint8, qfloat8), and activation quantization in qint8 and qfloat8, with execution on CPU, NVIDIA CUDA, Intel XPU, and Apple Silicon Metal Performance Shaders (MPS) backends.[^1][^4] The library is integrated into Hugging Face Transformers through the QuantoConfig class, and into Hugging Face Diffusers for Diffusion Transformer models such as FLUX and Stable Diffusion 3.[^4][^5] As of version 0.2.7 released on 2025-03-06, the project is in maintenance mode, with the Hugging Face team recommending bitsandbytes or torchao for active development.[^3][^6]
| Field | Value |
|---|---|
| Developer | Hugging Face Inc., Special Ops Team[^6] |
| Lead author | David Corvoysier (dacorvo)[^2][^7] |
| Initial blog announcement | 2024-03-18[^2] |
Initial PyPI release of quanto | February 2024 (pre-rename)[^3] |
Rename to optimum-quanto | 2024-05-31 (v0.2.1)[^3] |
| Latest release covered | v0.2.7 (2025-03-06)[^3][^6] |
| Repository | github.com/huggingface/optimum-quanto[^1] |
| License | Apache 2.0[^1][^6] |
| Language | Python (~70%), CUDA (~27%), C++ (~2%)[^1] |
| Minimum PyTorch | 2.6 (as of v0.2.7)[^3] |
| Minimum Python | 3.9[^6] |
| Status | Maintenance mode[^1][^6] |
The broader context for Quanto is the rapid expansion of quantization as a deployment technique for large neural networks during 2022 to 2024. As model parameter counts climbed past tens of billions, fitting weights and activations into accelerator memory in their original fp32 or fp16 form became impractical for most consumer GPUs and many datacenter configurations. Several quantization toolkits emerged in response, each targeting a particular workload: GPTQ (2022) and AWQ (2023) focused on calibration-based weight-only quantization for transformer LLMs; bitsandbytes added 8-bit (LLM.int8) and 4-bit (NF4 and FP4) support optimized for QLoRA fine-tuning; llama.cpp's GGML/GGUF formats provided CPU-friendly k-quants; and PyTorch itself shipped both eager-mode and FX graph-mode quantization in its core library.[^13][^10] By early 2024, Hugging Face had identified a gap in this landscape: there was no PyTorch-native, modality-agnostic quantization library that worked equally well on CPU, CUDA, MPS, and Intel XPU, supported both weight and activation quantization, and integrated cleanly with the Transformers and Diffusers ecosystems. Quanto was designed to fill that gap.[^2][^4]
Quanto was first announced publicly on 2024-03-18 in a Hugging Face blog post titled "Quanto: a PyTorch quantization backend for Optimum," authored by David Corvoysier, Younes Belkada, and Marc Sun.[^2] At the time of announcement the package was distributed on PyPI as plain quanto and lived in the huggingface/quanto repository. David Corvoysier, a Senior Software and Machine Learning Engineer who joined Hugging Face in June 2023 from BrainChip, was the primary author and continues to maintain the project.[^7] Coverage at launch by industry press described Quanto as "a Python quantization toolkit to reduce the computational and memory costs of evaluating deep learning models" and emphasized its support for int2, int4, int8, and float8 weights together with int8 and float8 activations.[^8]
The motivation for building Quanto, articulated in the launch blog post, was that "recent quantization methods appear to be focused on quantizing Large Language Models (LLMs), whereas quanto intends to provide extremely simple quantization primitives for simple quantization schemes (linear quantization, per-group quantization) that are adaptable across any modality."[^2] In other words, where libraries such as GPTQ, AWQ and bitsandbytes were optimized for transformer-based LLMs, Quanto was designed to be modality-agnostic so that the same primitives could be applied to vision models, audio models such as Whisper, and diffusion pipelines.[^2][^4]
The project was renamed and rehomed within the Optimum ecosystem in mid-2024. According to the official release notes, version 0.2.1, published on 2024-05-31, was "the first one with the new package name," at which point the repository moved to huggingface/optimum-quanto and the PyPI package was published under optimum-quanto.[^3] The original quanto package on PyPI was effectively superseded; subsequent imports use the optimum.quanto namespace.[^1][^4]
Several feature-bearing releases followed. Version 0.2.0 (2024-05-24) added a requantize helper, a Stable Diffusion example, an improved linear backward implementation, and AWQ int4 kernels.[^3] Version 0.2.3 (2024-07-25) introduced an HQQ optimizer, a QuantizedModelForCausalLM wrapper, and command-line integration via optimum-cli.[^3]
On 2024-07-30, Sayak Paul and David Corvoysier published a follow-up blog post, "Memory-efficient Diffusion Transformers with Quanto and Diffusers," demonstrating use of Quanto with PixArt-Sigma, Stable Diffusion 3, Aura Flow, Hunyuan DiT, Lumina, and Latte.[^5] The post showed memory reductions from 18.765 GB to roughly 8.2 GB for Stable Diffusion 3 when applying qfloat8 weight quantization to both the transformer backbone and selected text encoders.[^5] These results were obtained on an NVIDIA H100 with CUDA 12.2 and PyTorch 2.4.0.[^5] Subsequently, an official Diffusers QuantoConfig backend was added that allows quantization to be applied at from_pretrained time on Linear layers within DiT models, including the FluxTransformer2DModel for the FLUX.1 family.[^9]
As of late 2024 and through 2025, the project README and the Hugging Face team have publicly stated that Quanto is in maintenance mode. The repository description states that the project "is currently in maintenance mode" and accepts "pull requests only for minor bug fixes, documentation improvements, and other maintenance tasks," with major new features or breaking changes unlikely to be merged.[^1][^6] The maintainers recommend bitsandbytes and torchao as alternatives for production-ready features and active development.[^1] The most recent release at the time of writing, v0.2.7 (2025-03-06), bumped the minimum PyTorch version to 2.6, fixed CUDA extension compilation on non-Linux systems, and resolved state-dictionary access after activation quantization.[^3]
Quanto was designed around two stated goals: versatility and simplicity.[^2][^4] The HF Optimum documentation lists four headline design properties: all features are available in eager mode (so the library works on non-traceable models), the system supports quantization-aware training, quantized models are compatible with torch.compile, and quantized models are device-agnostic across CUDA, XPU, MPS, and CPU.[^4][^9]
Rather than separating dynamic and static quantization with distinct APIs (as is conventional in PyTorch's native quantization workflow), Quanto employs a unified flow in which models are dynamically quantized by default and may later be "frozen" so that weights are stored as their quantized integer or float8 representations.[^2][^8] This avoids the upfront need for graph capture, calibration datasets, or explicit QuantStub/DeQuantStub insertion that is required by some PyTorch FX graph mode quantization workflows.[^2]
Standard PyTorch quantization offers three interfaces: eager mode quantization, FX graph mode quantization, and PyTorch 2 Export quantization. In eager mode, quantization is performed by module swapping and requires the user to manually insert quantization and dequantization stubs in the forward function. In FX graph mode, the tracer inspects executed code and automatically inserts observers, quantize, and dequantize operations.[^10]
Quanto is fundamentally an eager-mode library: it does not require symbolic tracing of the model graph and therefore works with control-flow-heavy or non-traceable architectures, which historically have been awkward in FX graph mode.[^2][^4] At the same time, Quanto automates much of the boilerplate normally associated with eager-mode quantization: a single quantize() call walks the module tree and replaces supported submodules (nn.Linear, nn.Conv2d, nn.LayerNorm) with Quanto's quantization-aware equivalents (QLinear, QConv2D, quantized LayerNorm).[^1][^6] Calls to freeze() then realize the integer or float8 weights, and quantization_map() produces metadata enabling round-trip serialization through standard torch.save/load or Safetensors.[^2]
At the core of Quanto is a custom torch.Tensor subclass that holds quantized data together with a scale (and, for some formats, a zero point or group structure) and projects floating-point values into the destination integer or float8 range so as to minimize saturation and zeroing.[^4][^6] When a QLinear is invoked, the weight is read in its low-precision representation, the matmul is performed with kernels that accept the chosen precision pair (for example bf16-int4 or int8-int8), and the integer accumulator (typically int32) is dequantized back to the activation dtype before being returned.[^1][^2]
The library follows two conventions that match common practice in model compression:[^11]
The current dtype matrix is summarized below; the Python types are exposed as qint2, qint4, qint8, and qfloat8 (with sub-variants qfloat8_e4m3fn and qfloat8_e5m2) under optimum.quanto.[^4][^12]
| Surface | Supported types | Notes |
|---|---|---|
| Weights | qint2, qint4, qint8, qfloat8 | qfloat8 supports E4M3 and E5M2 sub-formats[^12] |
| Activations | qint8, qfloat8 | Per-tensor static scales (default range [-1, 1])[^6] |
| Bias | not quantized | Preserves accumulator accuracy[^6] |
| Modules | nn.Linear, nn.Conv2d, nn.LayerNorm | Replaced by QLinear, QConv2D, quantized LayerNorm[^1][^6] |
CUDA kernels are provided for accelerated matrix multiplications in the combinations int8-int8, fp16-int4, bf16-int8, and bf16-int4.[^1][^4] qfloat8 is not supported on MPS at the time of writing.[^2] qint4 is restricted to bfloat16 activations on H100-class hardware and offers larger memory savings but, lacking native int4 compute on most GPUs, can increase inference latency relative to qint8.[^12]
The canonical Quanto workflow exposed in the README and blog post consists of five steps, of which two are optional:[^1][^2]
from optimum.quanto import quantize, qint8 followed by quantize(model, weights=qint8, activations=qint8) traverses the module tree and inserts quantization-aware modules. After this call the model is "dynamically quantized": floating-point weights remain stored at full precision but are quantized on the fly during forward passes.with Calibration(momentum=0.9): model(samples). The calibration context updates an exponential moving average of activation ranges using representative input batches.[^1]train() mode and back-propagating through output.dequantize(). This can recover accuracy lost in post-training quantization, especially for aggressive qint4 or qint2 settings.[^1][^2]freeze(model) replaces the stored floating-point weights with their quantized integer or float8 representations. After freezing, weight storage shrinks by the bit-width ratio (for example by roughly 4x for qint8 from fp32 or 2x from fp16, and by roughly 8x for qint4 from fp32).[^2]safetensors.torch.save_file(model.state_dict(), "model.safetensors") together with a quantization_map(model) JSON blob recording per-module dtypes and scales, allowing exact restoration on reload.[^2]Quanto plugs into Hugging Face Transformers through the QuantoConfig class, which is then passed to from_pretrained():[^4][^8]
from transformers import AutoModelForCausalLM, QuantoConfig
quant_config = QuantoConfig(weights="int8")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
dtype="auto",
device_map="auto",
quantization_config=quant_config,
)
The weights argument accepts "int8", "int4", "int2", or "float8", and defaults to "int8". An activations argument allows None, "int8", or "float8". However, the Transformers integration historically only exposes weight quantization: the documentation explicitly notes that "the Transformers integration only supports weight quantization. Use the Quanto library directly if you need activation quantization, calibration, or QAT."[^4]
The same pattern works for audio models such as Whisper:[^1]
from transformers import AutoModelForSpeechSeq2Seq, QuantoConfig
import torch
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=QuantoConfig(weights="int8"),
)
Quanto-quantized models are also compatible with torch.compile for additional latency improvements, although on the Diffusers side compilation support is currently limited to int8 weights.[^4][^9]
In Diffusers, Quanto serves as one of several quantization backends for diffusion transformer models. A QuantoConfig (re-exported as diffusers.QuantoConfig) accepts a weights_dtype argument and is passed into model from_pretrained calls:[^9]
from diffusers import FluxTransformer2DModel, QuantoConfig
import torch
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
Diffusers' Quanto backend currently quantizes only nn.Linear modules, even though the underlying library can also handle nn.Conv2d and nn.LayerNorm. A modules_to_not_convert argument allows users to exclude specific layers (commonly proj_out) to preserve image quality, particularly when using qint4.[^9] The backend also supports from_single_file loading, save_pretrained serialization, and integration with PEFT-based LoRA training on large models.[^9]
For static activation quantization, Quanto exposes a Calibration context manager that observes activation tensors during forward passes and updates per-tensor scales using an exponential moving average controlled by a momentum parameter:[^1][^2]
from optimum.quanto import Calibration
with Calibration(momentum=0.9):
model(samples)
Because activations are quantized per-tensor by default, large outlier values can cause significant quantization error; calibration with representative data is therefore particularly important for aggressive activation precisions such as qint8.[^11]
After calibration, fine-tuning under quantization-aware training is performed by running the model in train() mode and propagating gradients through the dequantized outputs:[^1][^2]
model.train()
for data, target in train_loader:
optimizer.zero_grad()
output = model(data).dequantize()
loss = torch.nn.functional.nll_loss(output, target)
loss.backward()
optimizer.step()
This recovers accuracy degradation that often appears with low-bit weight precisions and can be combined with HQQ-style optimizers introduced in version 0.2.3.[^3]
Frozen Quanto models can be saved either via the standard torch.save weight_only path or, more commonly, via Safetensors. Two artifacts are persisted: the quantized state dictionary, which holds integer or float8 weight tensors together with scales and zero points, and a JSON quantization_map describing per-module dtype and configuration. The pair can be restored without re-running quantization, enabling distribution of compressed checkpoints. The Diffusers Quanto blog measured a checkpoint-size reduction from 2.44 GB to 587 MB (roughly a 76% reduction) for the PixArt-Sigma transformer when serialized with qfloat8 weights.[^5]
The same persistence scheme is used by higher-level helpers such as QuantizedModelForCausalLM, introduced in version 0.2.3, which packages a quantized causal language model together with its tokenizer-compatible config and exposes a quantize class method analogous to Transformers' own from_pretrained constructor. A symmetric QuantizedPixArtTransformer2DModel helper exists on the Diffusers side, demonstrated in the official blog post for PixArt-Sigma.[^3][^5]
The introduction blog post reports perplexity benchmarks on Meta-Llama-3.1-8B for several quantization configurations without applying post-training-optimization algorithms such as HQQ or AWQ, alongside latency measurements taken on an NVIDIA A10 GPU.[^2] The takeaway, as stated by the authors, is that "linear quantization for weights (float8, int8, int4, int2) with accuracy very similar to full-precision models" is achievable using only Quanto's default linear quantization scheme.[^4]
For diffusion transformers, the Diffusers integration blog reports the following memory footprints on an H100, with full benchmark scripts published in an accompanying dataset repository:[^5]
| Model | Configuration | GPU memory |
|---|---|---|
| PixArt-Sigma 0.611B | FP16, no quantization | 12.087 GB |
| PixArt-Sigma 0.611B | FP8 weights (transformer only) | 11.548 GB |
| PixArt-Sigma 0.611B | FP8 weights (transformer + text encoder) | 5.364 GB |
| Stable Diffusion 3 2.028B | FP16, no quantization | 18.765 GB |
| Stable Diffusion 3 2.028B | FP8 weights (transformer + TE1 + TE3) | roughly 8.2 GB |
Latency overheads in the same benchmark are minor for FP8 (for example 4.482 seconds versus 5.141 seconds for PixArt-Sigma with both transformer and text encoder quantized).[^5]
CUDA accelerated kernels exist for int8-int8, fp16-int4, bf16-int8, and bf16-int4 matmuls.[^1][^4] On hardware without native int4 compute (which includes most consumer GPUs and pre-H100 datacenter GPUs), int4 quantization saves memory but does not generally improve latency over int8 because the int4 values must be unpacked to a wider dtype before computation.[^12] On Apple Silicon MPS, Quanto runs in int8 and int4 modes but does not currently support qfloat8 (the library raises an error in this configuration).[^2]
The library's accelerated CUDA path was extended over time. Version 0.2.0 added AWQ-style int4 kernels, and subsequent releases tied the Quanto build to native CUDA extensions, which is why v0.2.7 raised the minimum supported PyTorch version to 2.6 and fixed extension compilation on non-Linux systems.[^3] For Intel XPU, the library relies on PyTorch's XPU backend and has shipped testing fixes specifically for this device in the latest release.[^3] The Python composition is approximately 70% Python, 27% CUDA, and 2% C++, reflecting both the high-level orchestration in Python and the kernel-heavy nature of the CUDA dispatch path.[^1]
Quanto is one of several quantization backends shipping with Hugging Face Transformers. The table below summarizes how it compares to the other natively supported backends.[^4][^13]
| Backend | Weight precisions | Activation quantization | Calibration | Modality scope | Devices |
|---|---|---|---|---|---|
| Quanto | int2, int4, int8, float8 | int8, float8 (library only) | Optional, EMA-based | Modality-agnostic[^1] | CPU, CUDA, MPS, XPU[^4] |
| bitsandbytes | 8-bit (LLM.int8), 4-bit (NF4/FP4) | No | Zero-shot, no calibration | Linear layers in any modality | CUDA-focused[^13] |
| AutoGPTQ (GPTQ) | 2 to 8 bits | No | Requires calibration set | Primarily LLMs | CUDA, AMD via ROCm[^13] |
| AutoAWQ (AWQ) | 4-bit | No | Activation-aware calibration | Primarily LLMs | CUDA[^13] |
The key differentiators of Quanto, relative to the four points above, are:
qint8 and qfloat8, though this is currently exposed only via the standalone optimum.quanto API and not via QuantoConfig in Transformers.[^4]torch.compile. This gives it broader compatibility but means it does not match the raw throughput of ExLlamaV2, llama.cpp's GGML tensors, or AWQ's CUDA kernels for pure LLM inference.[^4][^13]Compared with NormalFloat 4-bit (NF4) in bitsandbytes, which targets QLoRA fine-tuning of LLMs, Quanto's qint4 is a linear quantizer rather than a non-uniform one and lacks NF4's optimization for normally-distributed weights, but it is available on CPU and MPS, not only on CUDA.[^4][^13]
Quanto's most common application is weight-only post-training quantization of decoder LLMs. Through the Transformers QuantoConfig path, models such as Llama 3.1, OPT-125m, and Mistral can be loaded in qint8 or qint4 with a single line of code, reducing GPU memory by roughly 2x or 4x relative to fp16 without the need for a calibration dataset.[^4][^8] The blog announcement demonstrated this with Meta-Llama-3.1-8B on a single A10.[^2]
The original launch examples explicitly include Whisper, demonstrating int8 weight quantization of openai/whisper-large-v3 for memory-constrained audio transcription deployments.[^1]
Memory pressure on consumer GPUs is a particular concern for DiT models such as Stable Diffusion 3, FLUX.1, PixArt-Sigma, Hunyuan DiT, Lumina, Aura Flow, and Latte. Quanto's qfloat8 weight quantization, applied to both the transformer backbone and the text encoders, has been shown to bring SD3 inference from 18.765 GB down to roughly 8 GB on an H100 with negligible quality loss in standard sampling.[^5][^9] For more aggressive compression, qint4 with selected layers excluded (commonly proj_out) brings PixArt-Sigma transformer storage down to about 3 GB.[^5]
Because Quanto operates on generic nn.Linear, nn.Conv2d, and nn.LayerNorm modules rather than transformer-specific patterns, it has been applied to convolutional vision backbones such as VGG-19 (though with mixed VRAM-savings results that are tracked as open issues) and is recommended by the Hugging Face docs whenever the task spans more than one modality.[^1][^12]
The MPS backend is one of the design points that distinguishes Quanto from CUDA-only libraries. Because the library uses generic PyTorch tensors and dispatches through standard PyTorch operations, models quantized to qint8 or qint4 weights can be loaded and executed on Apple Silicon hardware through PyTorch's Metal Performance Shaders backend without additional conversion steps. The intended use cases include local LLM inference on Mac laptops and prototyping pipelines on developer machines, complementing Hugging Face's broader push for on-device deployment via projects like SmolLM and the Apple-specific Core ML and MLX ecosystems.[^1][^4]
qint8 and qfloat8 activation quantization with optional calibration and QAT, the QuantoConfig in Transformers exposes only weight quantization.[^4]qint4 produces memory savings but no clear latency gains, because the int4 weights must be unpacked to bf16 or fp16 before matmul.[^12]qfloat8 is not supported on Apple MPS at present and raises an error.[^2]nn.Linear layers, supports torch.compile only for int8 weights, and does not allow loading models that were quantized directly with the standalone Quanto library through ModelMixin.from_pretrained.[^9]Quanto sits alongside several other quantization toolkits in the Hugging Face ecosystem, each with different design tradeoffs:
nn.Linear layer, with strong support for QLoRA fine-tuning on CUDA.[^13]Quanto's distinguishing combination of features is the union of PyTorch-native eager-mode operation, broad device support including CPU and MPS, modality-agnostic primitives, and both weight and activation quantization in the same library.[^1][^4]