LLM.int8()
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,559 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,559 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLM.int8() is an 8-bit matrix multiplication scheme for large language model inference that preserves accuracy across models up to 175 billion parameters by combining vector-wise quantization with a mixed-precision decomposition that isolates a small number of outlier feature dimensions and computes them in 16-bit precision while the remaining ~99.9% of values are multiplied in 8-bit integer arithmetic.[1] The method was introduced by Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer in the paper "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", submitted to arXiv on 15 August 2022 and published at NeurIPS 2022.[1][2] It was released as part of the bitsandbytes library and integrated into Hugging Face Transformers and Accelerate in mid-2022, where it is exposed through the load_in_8bit=True flag (later through BitsAndBytesConfig).[3][4] LLM.int8() was the first 8-bit quantization scheme that worked on transformer language models above roughly 6.7 billion parameters without measurable accuracy degradation, and it became the default 8-bit inference path in the Hugging Face ecosystem.[3][4]
By 2022, the largest publicly available transformer language models, including OPT-175B from Meta and BLOOM-176B from BigScience, contained between 175 and 176 billion parameters.[1][3] Storing these weights in 16-bit floating point required around 352 gigabytes of memory, which exceeded the capacity of any single GPU and forced inference onto multi-GPU servers built around eight NVIDIA A100 80 GB accelerators.[3] Reducing the weight precision to 8-bit integer would halve the storage to roughly 176 gigabytes and allow the same models to run on four A100 80 GB GPUs, a substantial cost reduction for inference.[3]
Prior post-training quantization work had shown that small transformers and convolutional networks could be quantized to 8-bit integers with negligible accuracy loss, and several earlier papers explored sub-8-bit weight quantization for models below 1 billion parameters.[1] Naive int8 quantization, in which a single per-tensor scale converts both activations and weights to signed integers in the range [-127, 127], failed to scale: it preserved accuracy up to around 2.7 billion parameters and then degraded sharply.[1] By 6.7 billion parameters, naive int8 destroyed model quality enough that perplexity diverged on standard benchmarks.[1][5] Dettmers and collaborators identified this failure mode as the key obstacle to 8-bit deployment of frontier-scale models and the motivation for LLM.int8().[1][5]
The post-training quantization line of work has continued in parallel with LLM.int8(). GPTQ, introduced by Frantar, Ashkboos, Hoefler, and Alistarh in October 2022, uses approximate second-order information to quantize 175-billion-parameter models to 3-4 bits in roughly four GPU hours, requiring a small calibration set.[6] AWQ, introduced by Lin and collaborators in June 2023, applies uniform 4-bit weight quantization while scaling the salient channels identified through activation statistics, and won the MLSys 2024 Best Paper Award.[7] SmoothQuant, by Xiao, Lin, Seznec, Wu, Demouth, and Han, takes a different approach by migrating the quantization difficulty from activations to weights through a per-channel rescaling step, enabling W8A8 quantization at lower hardware cost.[8] LLM.int8() differs from these methods in that it is a pure-int8 inference scheme that requires no calibration data and operates online during the forward pass.[1][4]
The central empirical observation behind LLM.int8() is that hidden states in transformer language models contain a small number of dimensions with magnitudes far larger than the rest, and the prevalence and magnitude of these outlier dimensions grows rapidly with model scale.[1][5] Dettmers reported that for transformer models below approximately 2.7 billion parameters, hidden state values are typically distributed in the range [-3.5, 3.5], while at 6.7 billion parameters and above, certain dimensions exhibit values ranging from [-60, 6] or [6, 60].[4][5]
The paper characterizes the emergence as a phase transition. In smaller models, outlier dimensions appear in a few layers and behave probabilistically. At and above the 6.7-billion-parameter threshold, 100% of layers route their outliers through the same coordinated set of hidden dimensions, and the outliers expand from approximately 15 affected dimensions in 6-billion-parameter models to around 60 in 13-billion-parameter models.[5] Dettmers framed this as an "emergent feature" of scale, with outliers concentrating into specific dimensions that participate in highly sparse, almost discrete attention patterns and in feature-removal functions implemented by the feedforward sublayers.[5]
These outlier features dominate predictive performance. Although they occupy only about 0.1% of the total feature dimensions in a typical hidden state, removing them or rounding them through standard int8 quantization causes catastrophic accuracy loss across language modeling benchmarks.[1] The dynamic range mismatch is the underlying cause: a per-tensor int8 scale calibrated against a maximum of 60 leaves only 4 representable integer steps for the typical activations in the [-3.5, 3.5] range, which destroys the precision of the bulk of the computation.[4] Conversely, a scale calibrated to the typical range clips the outliers, and outlier clipping cascades through subsequent layers because the same dimensions reliably re-emerge as outliers in the next block.[1][5]
LLM.int8() handles the bulk of the matrix multiplication in int8 and the outlier dimensions in fp16, then sums the two contributions to produce an fp16 result.[1][4]
For an inner-product computation between hidden states X of shape (batch, in_features) and a weight matrix W of shape (in_features, out_features), naive quantization assigns a single scalar scale per tensor. Vector-wise quantization instead assigns one absolute-maximum scale per row of X and one absolute-maximum scale per column of W.[4] Each row of X is divided by its absolute maximum and rounded to int8, each column of W is divided by its absolute maximum and rounded to int8, the matrix product is computed in int32 on int8 tensor cores, and the result is dequantized by multiplying by the outer product of the two scale vectors and dividing by 127.[4] The granularity ensures that rows or columns containing larger values do not consume the entire dynamic range available to other rows or columns.[1][4]
Vector-wise quantization alone is insufficient at scale because a single outlier within a row still forces the scale to accommodate it, leaving the bulk of values in that row with poor effective precision.[1] Mixed-precision decomposition partitions the input columns of X and the corresponding rows of W based on whether the column magnitude exceeds a threshold, set to 6.0 by default in the public implementation.[4]
Letting O denote the set of column indices in X whose absolute maximum across the batch exceeds the threshold, the computation is decomposed into two terms. The first term performs the matrix product between the outlier columns of X and the outlier rows of W directly in fp16. The second term performs the matrix product between the non-outlier columns of X and the non-outlier rows of W after row-wise quantization of X to int8 and column-wise quantization of W to int8, then dequantizes the int32 accumulator back to fp16 using the outer product of the two scale vectors.[1][4] The two partial results are summed to obtain the final fp16 output.[4] Because outliers occupy only a small fraction of the hidden dimensions, the fp16 term covers fewer than 0.1% of values for transformer models at the 175-billion-parameter scale, and the paper reports that "more than 99.9% of values are multiplied in 8-bit".[1][2]
After the two matrix products are computed, the int8 partial result is dequantized by elementwise multiplication with the outer product of the row and column scales and division by 127, then added to the fp16 outlier partial result.[4] The full computation is differentiable with respect to the input but quantization-aware fine-tuning is not the primary use case; LLM.int8() is intended for inference and for fine-tuning only the unquantized layers added on top.[4]
The default outlier threshold of 6.0 was selected empirically. The Hugging Face documentation notes that 8-bit quantization works well for hidden values around 5 in magnitude but beyond that there is a significant performance penalty, and that the threshold can be lowered for smaller or less stable models or set as low as 0.0 to speed up inference at some accuracy cost.[4]
Worked example of the per-token routing illustrates the structure. For an input batch of shape (seq_len, hidden_dim) passing through a linear layer of shape (hidden_dim, ffn_dim), the algorithm first computes the column-wise absolute maxima of the input. If k columns exceed the threshold, those k columns are extracted, the corresponding k rows of the weight are extracted in fp16, and their fp16 product of shape (seq_len, ffn_dim) is computed conventionally. The remaining hidden_dim - k columns are quantized row-wise to int8 with one scale per token, the remaining hidden_dim - k rows of the weight are quantized column-wise to int8 with one scale per output feature, the int8 matmul is performed, and the int32 accumulator is dequantized via the outer product of token scales and feature scales divided by 127.[1][4] Because k is typically less than 0.1% of hidden_dim at the 175B scale, the fp16 partial matmul is far smaller than a naive fp16 matmul, and the bulk of arithmetic remains on int8 tensor cores.[1]
LLM.int8() is implemented in the open-source bitsandbytes library, which Dettmers originally released to accompany the paper.[9] The library was renamed from bitsandbytes to the bitsandbytes-foundation organization on GitHub and is currently sponsored by Hugging Face and Intel.[9] As of the 0.49.2 release on 16 February 2026, bitsandbytes provides three primary capabilities: 8-bit optimizers based on block-wise quantization, LLM.int8() inference via the Linear8bitLt module, and 4-bit QLoRA quantization via the Linear4bit module.[9]
The Hugging Face Transformers integration was announced in a joint blog post by Younes Belkada and Tim Dettmers on 17 August 2022, "A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes".[3] The integration replaces every torch.nn.Linear module in a loaded model with a bitsandbytes.nn.Linear8bitLt module, with the exception of the language modeling head, and quantizes the weights to int8 the first time the model is moved to GPU memory.[3] Originally the user-facing flag was load_in_8bit=True passed to from_pretrained; subsequent Transformers releases moved the parameter into a BitsAndBytesConfig object that supports finer-grained control through llm_int8_threshold, llm_int8_enable_fp32_cpu_offload, and llm_int8_skip_modules.[4]
The library also ships an 8-bit Adam optimizer that quantizes optimizer state to 8 bits using block-wise quantization while maintaining 32-bit-equivalent convergence.[9] The 8-bit optimizers run on NVIDIA Pascal-generation GPUs and newer, while the LLM.int8() inference path requires Turing-generation tensor cores (RTX 20 series, T4, A100) because it uses the int8 tensor core path on those architectures.[4][9] More recent bitsandbytes versions extend backend coverage to AMD ROCm, Intel XPU, Intel Gaudi 2 and Gaudi 3 accelerators, and CPU implementations using AVX2 or AVX512.[9]
The original paper evaluated LLM.int8() on language modeling and zero-shot reasoning benchmarks across the full OPT family from 125 million to 175 billion parameters and the BLOOM family up to 176 billion parameters, comparing against fp16 baselines.[1][3]
For OPT-175B, the Hugging Face blog reported HellaSwag normalized accuracy of 0.7849 in int8 versus 0.7849 in fp16 (a difference of 0), HellaSwag raw accuracy of 0.5921 versus 0.5931, PIQA accuracy of 0.7965 versus 0.7959, LAMBADA perplexity of 3.0142 versus 3.0152, and Winogrande accuracy of 0.7174 versus 0.7245, with all differences falling within standard error margins.[3] For BLOOM-176B, HellaSwag normalized accuracy was 0.7274 in int8 versus 0.7303 in bf16, PIQA accuracy was 0.7835 versus 0.7884, and LAMBADA accuracy was 0.6808 versus 0.6718.[3] The authors characterized the differences as unmeasurable for large models given the noise floor of these benchmarks.[3]
The paper also reported memory savings: BLOOM-176B requires approximately 352 GB in bf16, reducible to 176 GB in int8, allowing it to run on four A100 80 GB GPUs rather than eight.[3] Smaller models showed proportional savings: T5-11B drops from 42 GB in fp16 to 11 GB in int8, which makes it fit on a single Google Colab T4 GPU.[3]
Inference latency in the original implementation was slower than fp16 for small batch sizes. On BLOOM-176B the slowdown was reported at 15-23% versus fp16; on T5-11B at batch size 1 the int8 path took 25 ms per token in the optimized version versus 11.7 ms for fp16, and on T5-3B the gap was larger because the matrix dimensions involved are too small to fully amortize the overhead of the mixed-precision decomposition.[3] The authors noted that further optimizations were possible and that the speedup of int8 tensor cores becomes apparent only at sufficient batch sizes and matrix sizes.[3]
Following LLM.int8(), Dettmers extended the quantization toolkit toward 4-bit weight formats for fine-tuning rather than only inference. QLoRA, introduced by Dettmers, Pagnoni, Holtzman, and Zettlemoyer in the paper "QLoRA: Efficient Finetuning of Quantized LLMs" submitted to arXiv on 23 May 2023, combines a new NormalFloat 4-bit (NF4) data type with double quantization of the quantization constants and paged optimizers to enable fine-tuning of 65-billion-parameter models on a single 48 GB GPU.[10] The NF4 data type was designed to be information-theoretically optimal for normally distributed weights, exploiting the fact that pretrained transformer weights are approximately Gaussian after layer-wise normalization.[10] The Guanaco model family released with the paper reached 99.3% of the performance of ChatGPT in head-to-head GPT-4 evaluation after 24 hours of fine-tuning on a single GPU.[10]
Double quantization quantizes the per-block scale factors themselves, saving an additional 0.4 bits per parameter on average, and paged optimizers use unified memory transfers between GPU and CPU to handle gradient checkpointing memory spikes without out-of-memory failures.[10] These features were added to bitsandbytes through the Linear4bit module and exposed in Transformers through BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True).[4]
The 4-bit path largely supplanted LLM.int8() as the default quantization choice for memory-constrained fine-tuning and consumer-grade inference because it offers a further 2x memory reduction while preserving most of the quality.[4] LLM.int8() remained the choice when the absolute highest accuracy is required and when 8-bit memory is sufficient.[4]
LLM.int8() differs from the major contemporary post-training quantization methods along several axes.
GPTQ, introduced by Frantar, Ashkboos, Hoefler, and Alistarh in October 2022 and accepted at ICLR 2023, is a one-shot weight-only post-training quantization method that uses approximate second-order information to quantize weights to 3 or 4 bits.[6] GPTQ requires a calibration dataset of a few hundred examples and operates offline, producing a smaller model that can then be served by a compatible inference engine, whereas LLM.int8() performs its mixed-precision decomposition at runtime on the activations and requires no calibration data.[1][6] GPTQ targets lower bit widths but does not handle activation quantization, so activations remain in fp16 during inference.[6]
AWQ, introduced by Lin and collaborators in June 2023, also performs weight-only quantization to 4 bits but identifies a small set of salient channels by examining activation statistics and scales those channels through equivalent transformations to preserve precision without backpropagation.[7] AWQ won the MLSys 2024 Best Paper Award and is widely deployed in inference engines such as TensorRT-LLM and vLLM.[7] Like GPTQ, AWQ is offline and weight-only, while LLM.int8() is online and quantizes both activations and weights for the non-outlier portion of the computation.[1][7]
SmoothQuant, introduced by Xiao, Lin, Seznec, Wu, Demouth, and Han, is a W8A8 post-training quantization scheme that migrates the quantization difficulty from activations to weights through a mathematically equivalent per-channel scaling transformation, allowing both activations and weights to be quantized to 8-bit integers with simpler kernels.[8] SmoothQuant addresses the same outlier problem identified by LLM.int8() but takes a different route: rather than decomposing the matrix multiplication into mixed-precision components, it preconditions the model so that the outliers become more uniformly distributed across activations and weights, enabling unified int8 arithmetic at higher throughput.[8] SmoothQuant requires a calibration step, while LLM.int8() does not.[1][8]
In summary, LLM.int8() is unique among these methods in being a runtime mixed-precision scheme that requires no calibration but does require dynamic outlier detection on every forward pass.[1][4] This makes the LLM.int8() kernels more complex than uniform W8A8 kernels and harder to fuse into existing inference engines, which is part of the reason that subsequent deployment-focused work has converged on weight-only schemes such as GPTQ and AWQ for production serving.[6][7]
A second axis of distinction is bit width. LLM.int8() preserves an 8-bit weight format with 16-bit handling of outlier columns, yielding roughly 2x memory reduction versus fp16. GPTQ and AWQ target 4-bit weights, yielding roughly 4x reduction at the cost of more elaborate offline preparation and reduced robustness when activations are out of calibration distribution.[6][7] SmoothQuant targets 8-bit weights and 8-bit activations together, yielding similar memory savings to LLM.int8() but with simpler runtime kernels.[8] For models above approximately 100 billion parameters, the relative quality preservation of LLM.int8() at 8-bit was a major reason it was the first scheme adopted at the BLOOM-176B and OPT-175B scale.[1][3]
LLM.int8() was integrated into Hugging Face Transformers and Accelerate in August 2022 and rapidly became the default option for loading large models in reduced precision on the Hugging Face platform.[3][4] As of the 2026 documentation, the bitsandbytes integration page lists LLM.int8() alongside QLoRA as the two primary quantization features exposed through BitsAndBytesConfig, with load_in_8bit=True as the canonical example for halving model memory.[4]
The integration extended to the parameter-efficient fine-tuning library PEFT, which supports adapter fine-tuning of LLM.int8()-loaded base models such as google/flan-t5-large and facebook/opt-6.7b.[4] Users can also push 8-bit quantized models to the Hugging Face Hub via push_to_hub once a quantization config has been attached, and load them later without re-specifying the quantization configuration.[4]
The bitsandbytes library has been adopted by other model serving and fine-tuning frameworks. The Hugging Face TGI inference server supports loading bitsandbytes-quantized models, and frameworks like Axolotl, LLaMA-Factory, and Unsloth use bitsandbytes for memory-efficient training. LLM.int8() is the bitsandbytes-supported path for 8-bit inference on NVIDIA Turing-generation and newer hardware.[9]
Beyond the Hugging Face ecosystem, the underlying ideas of LLM.int8() (per-token or per-channel quantization granularity and outlier-aware kernels) influenced later quantization libraries and inference engines. Engines such as vLLM and TensorRT-LLM later added support for AWQ and GPTQ as their preferred weight-only quantization schemes, while LLM.int8() remained associated with the bitsandbytes implementation accessible through Hugging Face APIs.[4][9]
The principal limitations of LLM.int8() in 2026 relate to performance, hardware coverage, and the rise of competing methods.
Inference latency is the most discussed limitation. The mixed-precision decomposition requires three kernels per linear layer: an outlier extraction and routing kernel, an int8 matmul on the non-outlier slice, and an fp16 matmul on the outlier slice, plus the dequantization and summation. On batch sizes below the threshold at which the int8 tensor cores fully amortize their fixed overhead, the int8 path can be slower than the fp16 baseline.[3] The Hugging Face blog originally reported a 15-23% slowdown on BLOOM-176B and a larger relative gap on smaller models such as T5-3B, though subsequent kernel optimizations have narrowed these gaps.[3]
On NVIDIA Hopper (H100) and newer architectures with mature fp8 tensor cores and on Ada Lovelace consumer GPUs, the relative benefit of int8 over fp16 is reduced because fp16 throughput is already very high and fp8 paths offer a more straightforward route to 8-bit acceleration.[4][9] As a result, deployment-focused users have largely migrated to weight-only 4-bit schemes (GPTQ, AWQ) for serving and to LLM.int8() or QLoRA primarily for fine-tuning on memory-constrained hardware.[4][6][7]
Hardware coverage was historically limited. LLM.int8() required NVIDIA Turing tensor cores or newer, excluding Pascal-generation GPUs such as the GTX 1080 and excluding CPUs.[4] Recent bitsandbytes releases expanded the supported backends to include AMD ROCm, Intel XPU, Intel Gaudi 2 and Gaudi 3, and CPU paths, but the original LLM.int8() path was tied to NVIDIA hardware for the first several years.[9]
Finally, the outlier threshold parameter requires some tuning for the most stable behavior. The default value of 6.0 works well for most models in the OPT and BLOOM families, but smaller models or fine-tuned models can benefit from a lower threshold, and the Hugging Face documentation explicitly recommends experimentation with the llm_int8_threshold parameter for unstable cases.[4]