Optimum-Quanto

Optimum Quanto, commonly referred to as Quanto, is a PyTorch-based quantization toolkit developed and maintained by Hugging Face that provides linear weight and activation quantization across multiple devices and modalities.[^1][^2] Originally introduced in March 2024 under the package name quanto, the library was renamed to optimum-quanto in May 2024 and moved into the Hugging Face Optimum ecosystem as a dedicated quantization backend.[^2][^3] Quanto supports weight quantization in two-bit, four-bit, and eight-bit integer formats as well as eight-bit floating point (qint2, qint4, qint8, qfloat8), and activation quantization in qint8 and qfloat8, with execution on CPU, NVIDIA CUDA, Intel XPU, and Apple Silicon Metal Performance Shaders (MPS) backends.[^1][^4] The library is integrated into Hugging Face Transformers through the QuantoConfig class, and into Hugging Face Diffusers for Diffusion Transformer models such as FLUX and Stable Diffusion 3.[^4][^5] As of version 0.2.7 released on 2025-03-06, the project is in maintenance mode, with the Hugging Face team recommending bitsandbytes or torchao for active development.[^3][^6]

Infobox

Field	Value
Developer	Hugging Face Inc., Special Ops Team[^6]
Lead author	David Corvoysier (dacorvo)[^2][^7]
Initial blog announcement	2024-03-18[^2]
Initial PyPI release of `quanto`	February 2024 (pre-rename)[^3]
Rename to `optimum-quanto`	2024-05-31 (v0.2.1)[^3]
Latest release covered	v0.2.7 (2025-03-06)[^3][^6]
Repository	github.com/huggingface/optimum-quanto[^1]
License	Apache 2.0[^1][^6]
Language	Python (~70%), CUDA (~27%), C++ (~2%)[^1]
Minimum PyTorch	2.6 (as of v0.2.7)[^3]
Minimum Python	3.9[^6]
Status	Maintenance mode[^1][^6]

Background

The broader context for Quanto is the rapid expansion of quantization as a deployment technique for large neural networks during 2022 to 2024. As model parameter counts climbed past tens of billions, fitting weights and activations into accelerator memory in their original fp32 or fp16 form became impractical for most consumer GPUs and many datacenter configurations. Several quantization toolkits emerged in response, each targeting a particular workload: GPTQ (2022) and AWQ (2023) focused on calibration-based weight-only quantization for transformer LLMs; bitsandbytes added 8-bit (LLM.int8) and 4-bit (NF4 and FP4) support optimized for QLoRA fine-tuning; llama.cpp's GGML/GGUF formats provided CPU-friendly k-quants; and PyTorch itself shipped both eager-mode and FX graph-mode quantization in its core library.[^13][^10] By early 2024, Hugging Face had identified a gap in this landscape: there was no PyTorch-native, modality-agnostic quantization library that worked equally well on CPU, CUDA, MPS, and Intel XPU, supported both weight and activation quantization, and integrated cleanly with the Transformers and Diffusers ecosystems. Quanto was designed to fill that gap.[^2][^4]

History

Origin and initial release

Quanto was first announced publicly on 2024-03-18 in a Hugging Face blog post titled "Quanto: a PyTorch quantization backend for Optimum," authored by David Corvoysier, Younes Belkada, and Marc Sun.[^2] At the time of announcement the package was distributed on PyPI as plain quanto and lived in the huggingface/quanto repository. David Corvoysier, a Senior Software and Machine Learning Engineer who joined Hugging Face in June 2023 from BrainChip, was the primary author and continues to maintain the project.[^7] Coverage at launch by industry press described Quanto as "a Python quantization toolkit to reduce the computational and memory costs of evaluating deep learning models" and emphasized its support for int2, int4, int8, and float8 weights together with int8 and float8 activations.[^8]

The motivation for building Quanto, articulated in the launch blog post, was that "recent quantization methods appear to be focused on quantizing Large Language Models (LLMs), whereas quanto intends to provide extremely simple quantization primitives for simple quantization schemes (linear quantization, per-group quantization) that are adaptable across any modality."[^2] In other words, where libraries such as GPTQ, AWQ and bitsandbytes were optimized for transformer-based LLMs, Quanto was designed to be modality-agnostic so that the same primitives could be applied to vision models, audio models such as Whisper, and diffusion pipelines.[^2][^4]

Rename to optimum-quanto

The project was renamed and rehomed within the Optimum ecosystem in mid-2024. According to the official release notes, version 0.2.1, published on 2024-05-31, was "the first one with the new package name," at which point the repository moved to huggingface/optimum-quanto and the PyPI package was published under optimum-quanto.[^3] The original quanto package on PyPI was effectively superseded; subsequent imports use the optimum.quanto namespace.[^1][^4]

Several feature-bearing releases followed. Version 0.2.0 (2024-05-24) added a requantize helper, a Stable Diffusion example, an improved linear backward implementation, and AWQ int4 kernels.[^3] Version 0.2.3 (2024-07-25) introduced an HQQ optimizer, a QuantizedModelForCausalLM wrapper, and command-line integration via optimum-cli.[^3]

Diffusers integration and DiT support

On 2024-07-30, Sayak Paul and David Corvoysier published a follow-up blog post, "Memory-efficient Diffusion Transformers with Quanto and Diffusers," demonstrating use of Quanto with PixArt-Sigma, Stable Diffusion 3, Aura Flow, Hunyuan DiT, Lumina, and Latte.[^5] The post showed memory reductions from 18.765 GB to roughly 8.2 GB for Stable Diffusion 3 when applying qfloat8 weight quantization to both the transformer backbone and selected text encoders.[^5] These results were obtained on an NVIDIA H100 with CUDA 12.2 and PyTorch 2.4.0.[^5] Subsequently, an official Diffusers QuantoConfig backend was added that allows quantization to be applied at from_pretrained time on Linear layers within DiT models, including the FluxTransformer2DModel for the FLUX.1 family.[^9]

Maintenance status

As of late 2024 and through 2025, the project README and the Hugging Face team have publicly stated that Quanto is in maintenance mode. The repository description states that the project "is currently in maintenance mode" and accepts "pull requests only for minor bug fixes, documentation improvements, and other maintenance tasks," with major new features or breaking changes unlikely to be merged.[^1][^6] The maintainers recommend bitsandbytes and torchao as alternatives for production-ready features and active development.[^1] The most recent release at the time of writing, v0.2.7 (2025-03-06), bumped the minimum PyTorch version to 2.6, fixed CUDA extension compilation on non-Linux systems, and resolved state-dictionary access after activation quantization.[^3]

Technical Details

Design philosophy

Quanto was designed around two stated goals: versatility and simplicity.[^2][^4] The HF Optimum documentation lists four headline design properties: all features are available in eager mode (so the library works on non-traceable models), the system supports quantization-aware training, quantized models are compatible with torch.compile, and quantized models are device-agnostic across CUDA, XPU, MPS, and CPU.[^4][^9]

Rather than separating dynamic and static quantization with distinct APIs (as is conventional in PyTorch's native quantization workflow), Quanto employs a unified flow in which models are dynamically quantized by default and may later be "frozen" so that weights are stored as their quantized integer or float8 representations.[^2][^8] This avoids the upfront need for graph capture, calibration datasets, or explicit QuantStub/DeQuantStub insertion that is required by some PyTorch FX graph mode quantization workflows.[^2]

Eager mode versus FX graph mode

Standard PyTorch quantization offers three interfaces: eager mode quantization, FX graph mode quantization, and PyTorch 2 Export quantization. In eager mode, quantization is performed by module swapping and requires the user to manually insert quantization and dequantization stubs in the forward function. In FX graph mode, the tracer inspects executed code and automatically inserts observers, quantize, and dequantize operations.[^10]

Quanto is fundamentally an eager-mode library: it does not require symbolic tracing of the model graph and therefore works with control-flow-heavy or non-traceable architectures, which historically have been awkward in FX graph mode.[^2][^4] At the same time, Quanto automates much of the boilerplate normally associated with eager-mode quantization: a single quantize() call walks the module tree and replaces supported submodules (nn.Linear, nn.Conv2d, nn.LayerNorm) with Quanto's quantization-aware equivalents (QLinear, QConv2D, quantized LayerNorm).[^1][^6] Calls to freeze() then realize the integer or float8 weights, and quantization_map() produces metadata enabling round-trip serialization through standard torch.save/load or Safetensors.[^2]

Tensor subclass and scale handling

At the core of Quanto is a custom torch.Tensor subclass that holds quantized data together with a scale (and, for some formats, a zero point or group structure) and projects floating-point values into the destination integer or float8 range so as to minimize saturation and zeroing.[^4][^6] When a QLinear is invoked, the weight is read in its low-precision representation, the matmul is performed with kernels that accept the chosen precision pair (for example bf16-int4 or int8-int8), and the integer accumulator (typically int32) is dequantized back to the activation dtype before being returned.[^1][^2]

The library follows two conventions that match common practice in model compression:[^11]

Weights in linear and convolutional layers are quantized per-channel along the output-features axis, so each output feature has its own scale.
Activations are quantized per-tensor with static scales, because most linear-algebra operations downstream of an activation are not compatible with per-axis inputs.
Biases are left in their original precision to preserve arithmetic accuracy.[^6][^11]

Supported quantization types

The current dtype matrix is summarized below; the Python types are exposed as qint2, qint4, qint8, and qfloat8 (with sub-variants qfloat8_e4m3fn and qfloat8_e5m2) under optimum.quanto.[^4][^12]

Surface	Supported types	Notes
Weights	`qint2`, `qint4`, `qint8`, `qfloat8`	`qfloat8` supports E4M3 and E5M2 sub-formats[^12]
Activations	`qint8`, `qfloat8`	Per-tensor static scales (default range [-1, 1])[^6]
Bias	not quantized	Preserves accumulator accuracy[^6]
Modules	`nn.Linear`, `nn.Conv2d`, `nn.LayerNorm`	Replaced by `QLinear`, `QConv2D`, quantized `LayerNorm`[^1][^6]

CUDA kernels are provided for accelerated matrix multiplications in the combinations int8-int8, fp16-int4, bf16-int8, and bf16-int4.[^1][^4] qfloat8 is not supported on MPS at the time of writing.[^2] qint4 is restricted to bfloat16 activations on H100-class hardware and offers larger memory savings but, lacking native int4 compute on most GPUs, can increase inference latency relative to qint8.[^12]

Five-step workflow

The canonical Quanto workflow exposed in the README and blog post consists of five steps, of which two are optional:[^1][^2]

Quantize. from optimum.quanto import quantize, qint8 followed by quantize(model, weights=qint8, activations=qint8) traverses the module tree and inserts quantization-aware modules. After this call the model is "dynamically quantized": floating-point weights remain stored at full precision but are quantized on the fly during forward passes.
Calibrate (optional). When activations are quantized, the static scales used per tensor are recorded by running the model under with Calibration(momentum=0.9): model(samples). The calibration context updates an exponential moving average of activation ranges using representative input batches.[^1]
Tune (optional). Quantization-aware training (QAT) is supported by running the model in train() mode and back-propagating through output.dequantize(). This can recover accuracy lost in post-training quantization, especially for aggressive qint4 or qint2 settings.[^1][^2]
Freeze. freeze(model) replaces the stored floating-point weights with their quantized integer or float8 representations. After freezing, weight storage shrinks by the bit-width ratio (for example by roughly 4x for qint8 from fp32 or 2x from fp16, and by roughly 8x for qint4 from fp32).[^2]
Serialize. Frozen models can be saved with safetensors.torch.save_file(model.state_dict(), "model.safetensors") together with a quantization_map(model) JSON blob recording per-module dtypes and scales, allowing exact restoration on reload.[^2]

Integration with Transformers

Quanto plugs into Hugging Face Transformers through the QuantoConfig class, which is then passed to from_pretrained():[^4][^8]

from transformers import AutoModelForCausalLM, QuantoConfig

quant_config = QuantoConfig(weights="int8")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    dtype="auto",
    device_map="auto",
    quantization_config=quant_config,
)

The weights argument accepts "int8", "int4", "int2", or "float8", and defaults to "int8". An activations argument allows None, "int8", or "float8". However, the Transformers integration historically only exposes weight quantization: the documentation explicitly notes that "the Transformers integration only supports weight quantization. Use the Quanto library directly if you need activation quantization, calibration, or QAT."[^4]

The same pattern works for audio models such as Whisper:[^1]

from transformers import AutoModelForSpeechSeq2Seq, QuantoConfig
import torch

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device_map="cuda",
    quantization_config=QuantoConfig(weights="int8"),
)

Quanto-quantized models are also compatible with torch.compile for additional latency improvements, although on the Diffusers side compilation support is currently limited to int8 weights.[^4][^9]

Integration with Diffusers

In Diffusers, Quanto serves as one of several quantization backends for diffusion transformer models. A QuantoConfig (re-exported as diffusers.QuantoConfig) accepts a weights_dtype argument and is passed into model from_pretrained calls:[^9]

from diffusers import FluxTransformer2DModel, QuantoConfig
import torch

quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

Diffusers' Quanto backend currently quantizes only nn.Linear modules, even though the underlying library can also handle nn.Conv2d and nn.LayerNorm. A modules_to_not_convert argument allows users to exclude specific layers (commonly proj_out) to preserve image quality, particularly when using qint4.[^9] The backend also supports from_single_file loading, save_pretrained serialization, and integration with PEFT-based LoRA training on large models.[^9]

Quantization Workflow Details

Calibration

For static activation quantization, Quanto exposes a Calibration context manager that observes activation tensors during forward passes and updates per-tensor scales using an exponential moving average controlled by a momentum parameter:[^1][^2]

from optimum.quanto import Calibration

with Calibration(momentum=0.9):
    model(samples)

Because activations are quantized per-tensor by default, large outlier values can cause significant quantization error; calibration with representative data is therefore particularly important for aggressive activation precisions such as qint8.[^11]

Quantization-aware training

After calibration, fine-tuning under quantization-aware training is performed by running the model in train() mode and propagating gradients through the dequantized outputs:[^1][^2]

model.train()
for data, target in train_loader:
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()

This recovers accuracy degradation that often appears with low-bit weight precisions and can be combined with HQQ-style optimizers introduced in version 0.2.3.[^3]

Serialization

Frozen Quanto models can be saved either via the standard torch.save weight_only path or, more commonly, via Safetensors. Two artifacts are persisted: the quantized state dictionary, which holds integer or float8 weight tensors together with scales and zero points, and a JSON quantization_map describing per-module dtype and configuration. The pair can be restored without re-running quantization, enabling distribution of compressed checkpoints. The Diffusers Quanto blog measured a checkpoint-size reduction from 2.44 GB to 587 MB (roughly a 76% reduction) for the PixArt-Sigma transformer when serialized with qfloat8 weights.[^5]

The same persistence scheme is used by higher-level helpers such as QuantizedModelForCausalLM, introduced in version 0.2.3, which packages a quantized causal language model together with its tokenizer-compatible config and exposes a quantize class method analogous to Transformers' own from_pretrained constructor. A symmetric QuantizedPixArtTransformer2DModel helper exists on the Diffusers side, demonstrated in the official blog post for PixArt-Sigma.[^3][^5]

Performance

Memory and accuracy

The introduction blog post reports perplexity benchmarks on Meta-Llama-3.1-8B for several quantization configurations without applying post-training-optimization algorithms such as HQQ or AWQ, alongside latency measurements taken on an NVIDIA A10 GPU.[^2] The takeaway, as stated by the authors, is that "linear quantization for weights (float8, int8, int4, int2) with accuracy very similar to full-precision models" is achievable using only Quanto's default linear quantization scheme.[^4]

For diffusion transformers, the Diffusers integration blog reports the following memory footprints on an H100, with full benchmark scripts published in an accompanying dataset repository:[^5]

Model	Configuration	GPU memory
PixArt-Sigma 0.611B	FP16, no quantization	12.087 GB
PixArt-Sigma 0.611B	FP8 weights (transformer only)	11.548 GB
PixArt-Sigma 0.611B	FP8 weights (transformer + text encoder)	5.364 GB
Stable Diffusion 3 2.028B	FP16, no quantization	18.765 GB
Stable Diffusion 3 2.028B	FP8 weights (transformer + TE1 + TE3)	roughly 8.2 GB

Latency overheads in the same benchmark are minor for FP8 (for example 4.482 seconds versus 5.141 seconds for PixArt-Sigma with both transformer and text encoder quantized).[^5]

Hardware notes

CUDA accelerated kernels exist for int8-int8, fp16-int4, bf16-int8, and bf16-int4 matmuls.[^1][^4] On hardware without native int4 compute (which includes most consumer GPUs and pre-H100 datacenter GPUs), int4 quantization saves memory but does not generally improve latency over int8 because the int4 values must be unpacked to a wider dtype before computation.[^12] On Apple Silicon MPS, Quanto runs in int8 and int4 modes but does not currently support qfloat8 (the library raises an error in this configuration).[^2]

The library's accelerated CUDA path was extended over time. Version 0.2.0 added AWQ-style int4 kernels, and subsequent releases tied the Quanto build to native CUDA extensions, which is why v0.2.7 raised the minimum supported PyTorch version to 2.6 and fixed extension compilation on non-Linux systems.[^3] For Intel XPU, the library relies on PyTorch's XPU backend and has shipped testing fixes specifically for this device in the latest release.[^3] The Python composition is approximately 70% Python, 27% CUDA, and 2% C++, reflecting both the high-level orchestration in Python and the kernel-heavy nature of the CUDA dispatch path.[^1]

Variants and Comparison

Quanto is one of several quantization backends shipping with Hugging Face Transformers. The table below summarizes how it compares to the other natively supported backends.[^4][^13]

Backend	Weight precisions	Activation quantization	Calibration	Modality scope	Devices
Quanto	int2, int4, int8, float8	int8, float8 (library only)	Optional, EMA-based	Modality-agnostic[^1]	CPU, CUDA, MPS, XPU[^4]
bitsandbytes	8-bit (LLM.int8), 4-bit (NF4/FP4)	No	Zero-shot, no calibration	Linear layers in any modality	CUDA-focused[^13]
AutoGPTQ (GPTQ)	2 to 8 bits	No	Requires calibration set	Primarily LLMs	CUDA, AMD via ROCm[^13]
AutoAWQ (AWQ)	4-bit	No	Activation-aware calibration	Primarily LLMs	CUDA[^13]

The key differentiators of Quanto, relative to the four points above, are:

Modality and device agnosticism. Unlike GPTQ and AWQ, which were designed around LLM workloads and lean on calibration over text token streams, Quanto's primitives are not LLM-specific and run on CPU, CUDA, MPS, and XPU.[^1][^4]
Activation quantization. Among the native Transformers backends, only Quanto supports activation quantization in qint8 and qfloat8, though this is currently exposed only via the standalone optimum.quanto API and not via QuantoConfig in Transformers.[^4]
Eager-mode workflow. Quanto does not require symbolic tracing or a separate calibration dataset for weight-only quantization, which simplifies application to control-flow-heavy or non-traceable models.[^2][^10]
No specialized inference engine. Quanto does not ship its own runtime; it produces standard PyTorch modules that work with torch.compile. This gives it broader compatibility but means it does not match the raw throughput of ExLlamaV2, llama.cpp's GGML tensors, or AWQ's CUDA kernels for pure LLM inference.[^4][^13]

Compared with NormalFloat 4-bit (NF4) in bitsandbytes, which targets QLoRA fine-tuning of LLMs, Quanto's qint4 is a linear quantizer rather than a non-uniform one and lacks NF4's optimization for normally-distributed weights, but it is available on CPU and MPS, not only on CUDA.[^4][^13]

Applications

Large language models

Quanto's most common application is weight-only post-training quantization of decoder LLMs. Through the Transformers QuantoConfig path, models such as Llama 3.1, OPT-125m, and Mistral can be loaded in qint8 or qint4 with a single line of code, reducing GPU memory by roughly 2x or 4x relative to fp16 without the need for a calibration dataset.[^4][^8] The blog announcement demonstrated this with Meta-Llama-3.1-8B on a single A10.[^2]

Speech models

The original launch examples explicitly include Whisper, demonstrating int8 weight quantization of openai/whisper-large-v3 for memory-constrained audio transcription deployments.[^1]

Diffusion transformers

Memory pressure on consumer GPUs is a particular concern for DiT models such as Stable Diffusion 3, FLUX.1, PixArt-Sigma, Hunyuan DiT, Lumina, Aura Flow, and Latte. Quanto's qfloat8 weight quantization, applied to both the transformer backbone and the text encoders, has been shown to bring SD3 inference from 18.765 GB down to roughly 8 GB on an H100 with negligible quality loss in standard sampling.[^5][^9] For more aggressive compression, qint4 with selected layers excluded (commonly proj_out) brings PixArt-Sigma transformer storage down to about 3 GB.[^5]

Vision and other modalities

Because Quanto operates on generic nn.Linear, nn.Conv2d, and nn.LayerNorm modules rather than transformer-specific patterns, it has been applied to convolutional vision backbones such as VGG-19 (though with mixed VRAM-savings results that are tracked as open issues) and is recommended by the Hugging Face docs whenever the task spans more than one modality.[^1][^12]

On-device and Apple Silicon inference

The MPS backend is one of the design points that distinguishes Quanto from CUDA-only libraries. Because the library uses generic PyTorch tensors and dispatches through standard PyTorch operations, models quantized to qint8 or qint4 weights can be loaded and executed on Apple Silicon hardware through PyTorch's Metal Performance Shaders backend without additional conversion steps. The intended use cases include local LLM inference on Mac laptops and prototyping pipelines on developer machines, complementing Hugging Face's broader push for on-device deployment via projects like SmolLM and the Apple-specific Core ML and MLX ecosystems.[^1][^4]

Limitations

Maintenance mode. The most prominent limitation is project status. The README states the library "is currently in maintenance mode" and that "for production-ready quantization features or active development, alternative projects such as bitsandbytes or torchao are recommended."[^1][^6]
No activation quantization in Transformers. While the standalone library exposes qint8 and qfloat8 activation quantization with optional calibration and QAT, the QuantoConfig in Transformers exposes only weight quantization.[^4]
No native int4 hardware on most GPUs. Outside of H100-class hardware, qint4 produces memory savings but no clear latency gains, because the int4 weights must be unpacked to bf16 or fp16 before matmul.[^12]
MPS gaps. qfloat8 is not supported on Apple MPS at present and raises an error.[^2]
Outlier sensitivity in activations. Per-tensor activation quantization to int8 can yield large quantization errors when tensors contain heavy-tailed outliers, a known limitation of straight linear quantization that more advanced techniques such as SmoothQuant attempt to mitigate.[^11]
Diffusers caveats. The Diffusers Quanto backend currently quantizes only nn.Linear layers, supports torch.compile only for int8 weights, and does not allow loading models that were quantized directly with the standalone Quanto library through ModelMixin.from_pretrained.[^9]
Pre-alpha development status. The PyPI metadata classifies the package as "Development Status: Pre-Alpha (2)," signaling that interface guarantees are limited and that the public API may change between minor releases.[^6]
No specialized inference engine. Quanto does not provide a custom runtime, which simplifies integration but limits raw throughput compared with libraries that bundle hand-tuned kernels for specific quantization formats.[^4]

Quanto sits alongside several other quantization toolkits in the Hugging Face ecosystem, each with different design tradeoffs:

bitsandbytes. Zero-calibration int8 (LLM.int8) and 4-bit (NF4, FP4) weight quantization for any nn.Linear layer, with strong support for QLoRA fine-tuning on CUDA.[^13]
AutoGPTQ. Calibration-based n-bit quantization (2 to 8) for LLMs with optimized CUDA and ROCm kernels.[^13]
AutoAWQ. Activation-aware 4-bit weight quantization for LLMs.[^13]
torchao. PyTorch's own quantization library, now recommended by the Hugging Face team as a Quanto alternative for active development.[^1]
GGML/GGUF. Quantization formats consumed by llama.cpp and related runtimes.[^13]
ExLlamaV2 / EXL2. GPTQ-style quantization with custom 2 to 8 bit kernels for high-throughput LLM inference.[^13]

Quanto's distinguishing combination of features is the union of PyTorch-native eager-mode operation, broad device support including CPU and MPS, modality-agnostic primitives, and both weight and activation quantization in the same library.[^1][^4]

References

Infobox

Background

History

Origin and initial release

Rename to optimum-quanto

Diffusers integration and DiT support

Maintenance status

Technical Details

Design philosophy

Eager mode versus FX graph mode

Tensor subclass and scale handling

Supported quantization types

Five-step workflow

Integration with Transformers

Integration with Diffusers

Quantization Workflow Details

Calibration

Quantization-aware training

Serialization

Performance

Memory and accuracy

Hardware notes

Variants and Comparison

Applications

Large language models

Speech models

Diffusion transformers

Vision and other modalities

On-device and Apple Silicon inference

Limitations

Related Work

See also

References

Improve this article

Infobox

Background

History

Origin and initial release

Rename to optimum-quanto

Diffusers integration and DiT support

Maintenance status

Technical Details

Design philosophy

Eager mode versus FX graph mode

Tensor subclass and scale handling

Supported quantization types

Five-step workflow

Integration with Transformers

Integration with Diffusers

Quantization Workflow Details

Calibration

Quantization-aware training

Serialization

Performance

Memory and accuracy

Hardware notes

Variants and Comparison

Applications

Large language models

Speech models

Diffusion transformers

Vision and other modalities

On-device and Apple Silicon inference

Limitations

Related Work

See also

References