SpinQuant

AI Infrastructure Machine Learning

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,852 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

SpinQuant is a post-training quantization method for large language models that inserts learned rotation matrices into a transformer network to make its weights, activations, and KV cache easier to represent at 4 bits. It was introduced in the paper "SpinQuant: LLM Quantization with Learned Rotations" by Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort, all at Meta. The paper first appeared on arXiv on May 26, 2024, and was published at the International Conference on Learning Representations (ICLR) in 2025 ^[1]. Reference code is released by Facebook Research ^[2].

The central observation behind SpinQuant is that multiplying the hidden states of a transformer by an orthogonal rotation matrix can be made mathematically invisible to the network output, yet the rotation reshapes the distribution of values so that large outliers are spread out and quantization becomes far less lossy. Random rotations already help, but they vary widely in quality. SpinQuant instead treats the rotations as trainable parameters and optimizes them directly against a quantization objective using Cayley stochastic gradient descent (Cayley SGD) on the Stiefel manifold of orthogonal matrices. On LLaMA-2 7B with all of weights, activations, and KV cache quantized to 4 bits, SpinQuant closes the average zero-shot accuracy gap to full precision to 2.9 points ^[1]. The method was later used to ship quantized Llama 3.2 models for on-device inference ^[5].

Background: outliers and rotation

Quantizing only the weights of an LLM to 4 bits is relatively well understood, but quantizing the activations (and the KV cache) is much harder. The difficulty comes from activation outliers: a small number of feature channels in transformer hidden states take on values that are orders of magnitude larger than the rest. Because uniform integer quantization assigns the same step size across an entire tensor (or group), these outliers stretch the numeric range and force most ordinary values into only a few quantization levels, destroying accuracy. Earlier methods such as SmoothQuant addressed this by migrating scale between activations and weights ^[7], but outliers in the largest models remained a bottleneck for fully 4-bit (weight, activation, and KV) inference.

A different line of work attacks the problem geometrically. If a hidden vector is multiplied by an orthogonal (rotation) matrix, its individual coordinates are mixed together. Concentrated outlier energy gets redistributed across many channels, producing a more uniform, lower-magnitude, and therefore more quantization-friendly distribution. This connects to the notion of incoherence from QuIP, which showed that pre- and post-multiplying weight matrices by random orthogonal matrices makes them "incoherent," bounding the per-element magnitude and improving low-bit weight quantization ^[4]. The key practical question is how to insert such rotations without changing what the network computes, and which rotations to use.

How SpinQuant works

Rotation invariance

SpinQuant relies on the fact that an orthogonal matrix R satisfies R times R-transpose equals the identity, so a rotation can always be undone. If a rotation R is applied to the output of one weight matrix and its inverse R-transpose is applied to the input of the next, the two cancel and the end-to-end function is unchanged. This is the same "computational invariance" property used by SliceGPT and QuaRot ^[3]^[6]: a carefully paired rotation can be folded (merged) into the surrounding weight matrices ahead of time, so at inference the rotated network has exactly the same architecture and cost as the original, but its intermediate tensors are easier to quantize.

SpinQuant identifies four rotation insertion points in a LLaMA-style transformer, labelled R1 through R4:

R1 is a single rotation on the residual stream (of size equal to the embedding dimension). It is merged into every weight matrix that reads from or writes to the residual path, including embeddings, attention and feed-forward input projections, and output projections.
R2 is a head-wise rotation (of size equal to the attention head dimension) applied to the value projection in each attention block and undone at the output projection. It is also absorbed into the weights.
R3 is an online rotation applied to the queries and keys inside attention. Because rotary position embeddings (RoPE) are applied between the projection and the attention scores, R3 cannot be merged into a weight and must be executed at runtime; it helps quantize the KV cache.
R4 is an online rotation applied to the activations entering the feed-forward down-projection, where the gated SwiGLU activation produces strong outliers.

R1 and R2 are absorbable rotations that disappear into the weights and add no inference cost. R3 and R4 are kept as fast Hadamard transforms (Hadamard matrix multiplications) executed online, adding a small runtime overhead but no learned parameters.

Learned rotations via Cayley SGD

The novelty of SpinQuant is that the mergeable rotations R1 and R2 are not fixed at random but learned. The authors observed that different random rotations, including random Hadamard and random orthogonal matrices, can differ by several points of accuracy on the same model, so the choice of rotation matters and can be optimized. SpinQuant minimizes the network's quantization loss (the cross-entropy of the quantized model on a small calibration set) with respect to R1 and R2 while keeping the model weights frozen.

Because R1 and R2 must remain orthogonal throughout training, ordinary gradient descent would break the constraint. SpinQuant therefore optimizes on the Stiefel manifold, the space of orthonormal matrices, using Cayley SGD. The Cayley update maps the gradient into a skew-symmetric matrix and applies a Cayley transform of the form (I minus alpha/2 times Y) inverse times (I plus alpha/2 times Y) times R, which is guaranteed to stay orthogonal as long as R started orthogonal ^[1]. Calibration is cheap: roughly 800 WikiText-2 samples and 100 optimization iterations, about 1.3 hours on a single NVIDIA A100 node for LLaMA-2 7B. SpinQuant can be combined with a weight-only quantizer such as GPTQ; in that configuration the rotations are optimized with only activation quantization active, and GPTQ then handles the weight rounding, which the authors found improves the final result ^[1].

Relationship to QuaRot and Hadamard methods

SpinQuant is closely related to QuaRot (Ashkboos et al., 2024), a concurrent rotation-based scheme that also rotates weights, activations, and the KV cache for end-to-end 4-bit inference ^[3]. The main difference is that QuaRot uses fixed random Hadamard rotations and does not learn them, whereas SpinQuant optimizes its mergeable rotations. Both build on the incoherence-processing idea from QuIP and QuIP-sharp ^[4] and the computational-invariance idea from SliceGPT ^[6].

The two methods also differ in inference cost and robustness. SpinQuant uses two online Hadamard rotations per transformer block (R3 and R4), fewer than QuaRot, and because its rotations are tuned it is less sensitive to the unlucky-random-rotation problem. Empirically, SpinQuant reduces the quantization accuracy gap relative to QuaRot, with the advantage largest on harder-to-quantize models: on LLaMA-3 8B it cuts the gap to full precision by up to 45.1 percent relative to QuaRot, and roughly 1.5 points on the easier LLaMA-2 7B ^[1]. The trade-off is that SpinQuant requires a short optimization step, while QuaRot is purely closed-form.

Results

SpinQuant was evaluated mainly in the W4A4KV4 setting (4-bit weights, 4-bit activations, and 4-bit KV cache), reporting average zero-shot accuracy across a suite of commonsense-reasoning benchmarks. The table below summarizes the headline numbers reported in the paper ^[1].

Model	SpinQuant W4A4KV4 avg zero-shot accuracy	Gap to full precision
LLaMA-2 7B	64.0	2.9 points
LLaMA-2 13B	66.9	1.4 points
LLaMA-2 70B	71.2	1.7 points
LLaMA-3 8B	65.2	4.4 points
LLaMA-3 70B	69.3	5.2 points

On LLaMA-2 7B in the 4-4-4 setting, SpinQuant outperformed the quantization-aware training baseline LLM-QAT by 19.1 points and SmoothQuant by 25.0 points ^[1]. The paper also reports that learned rotations substantially outperform random rotations, confirming that the optimization, not merely the presence of a rotation, drives the gains. LLaMA-3 models are harder to quantize than LLaMA-2 models, reflected in their larger residual gaps (4.4 to 5.2 points versus 1.4 to 2.9 points).

Deployment in quantized Llama models

SpinQuant moved from research into production quickly. On October 24, 2024, Meta released quantized versions of Llama 3.2 1B and Llama 3.2 3B aimed at phones and other edge devices ^[5]. Meta shipped each model with two quantization recipes: a quantization-aware training plus LoRA adaptor pipeline that prioritizes accuracy, and SpinQuant, the post-training method that prioritizes portability because it needs no access to the original training data or pipeline. The on-device scheme quantizes the linear layers in the transformer blocks to 4-bit groupwise weights (group size 32) with 8-bit per-token dynamic activations, rather than the fully 4-bit-activation setting used for the research benchmarks. Meta reported average reductions of about 56 percent in model size and 41 percent in memory use, with 2 to 4 times faster inference on mobile hardware ^[5]. SpinQuant was also part of a live on-device demonstration at Meta Connect in October 2024.

Limitations

SpinQuant has several caveats. As a calibration-based post-training method, its learned rotations are fit to a small set of calibration text, and the authors note that real-world deployment may involve different data and activation distributions, so generalization beyond the calibration set warrants further study ^[1]. The online rotations R3 and R4 cannot be merged into weights and add a fast-Hadamard-transform cost at inference, however small. The mergeable rotations R1 and R2 are learned, but R3 and R4 remain fixed Hadamard transforms rather than optimized, and ablations show that the bulk of the benefit comes from a subset of the rotations while some add little incremental accuracy. Fully 4-bit activation quantization still leaves a measurable gap on the hardest models, which is why production releases used the more conservative 4-bit-weight, 8-bit-activation scheme and offered a quantization-aware-training alternative when accuracy mattered most ^[5]. Finally, the method is specialized to the transformer structure it analyzes (LLaMA and similar decoder-only models), and applying it to other architectures requires re-deriving where invariant rotations can be inserted.

References

Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, T. "SpinQuant: LLM Quantization with Learned Rotations." arXiv:2405.16406, May 26, 2024; published at ICLR 2025. https://arxiv.org/abs/2405.16406 ↩
Facebook Research. "SpinQuant" (code repository). https://github.com/facebookresearch/SpinQuant ↩
Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." arXiv:2404.00456, 2024; NeurIPS 2024. https://arxiv.org/abs/2404.00456 ↩
Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." arXiv:2307.13304, 2023. https://arxiv.org/abs/2307.13304 ↩
Meta AI. "Introducing quantized Llama models with increased speed and a reduced memory footprint." October 24, 2024. https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/ ↩
Ashkboos, S., Croci, M. L., Nascimento, M. G., Hoefler, T., and Hensman, J. "SliceGPT: Compress Large Language Models by Deleting Rows and Columns." arXiv:2401.15024, 2024; ICLR 2024. https://arxiv.org/abs/2401.15024 ↩
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." arXiv:2211.10438, 2022; ICML 2023. https://arxiv.org/abs/2211.10438 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

KV-cache quantization Quantization-Aware Training (QAT)

Overview

Background: outliers and rotation

How SpinQuant works

Rotation invariance

Learned rotations via Cayley SGD

Relationship to QuaRot and Hadamard methods

Results

Deployment in quantized Llama models

Limitations

References

Improve this article

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here