SpinQuant
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,852 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,852 words
Add missing citations, update stale details, or suggest a clearer explanation.
SpinQuant is a post-training quantization method for large language models that inserts learned rotation matrices into a transformer network to make its weights, activations, and KV cache easier to represent at 4 bits. It was introduced in the paper "SpinQuant: LLM Quantization with Learned Rotations" by Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort, all at Meta. The paper first appeared on arXiv on May 26, 2024, and was published at the International Conference on Learning Representations (ICLR) in 2025 [1]. Reference code is released by Facebook Research [2].
The central observation behind SpinQuant is that multiplying the hidden states of a transformer by an orthogonal rotation matrix can be made mathematically invisible to the network output, yet the rotation reshapes the distribution of values so that large outliers are spread out and quantization becomes far less lossy. Random rotations already help, but they vary widely in quality. SpinQuant instead treats the rotations as trainable parameters and optimizes them directly against a quantization objective using Cayley stochastic gradient descent (Cayley SGD) on the Stiefel manifold of orthogonal matrices. On LLaMA-2 7B with all of weights, activations, and KV cache quantized to 4 bits, SpinQuant closes the average zero-shot accuracy gap to full precision to 2.9 points [1]. The method was later used to ship quantized Llama 3.2 models for on-device inference [5].
Quantizing only the weights of an LLM to 4 bits is relatively well understood, but quantizing the activations (and the KV cache) is much harder. The difficulty comes from activation outliers: a small number of feature channels in transformer hidden states take on values that are orders of magnitude larger than the rest. Because uniform integer quantization assigns the same step size across an entire tensor (or group), these outliers stretch the numeric range and force most ordinary values into only a few quantization levels, destroying accuracy. Earlier methods such as SmoothQuant addressed this by migrating scale between activations and weights [7], but outliers in the largest models remained a bottleneck for fully 4-bit (weight, activation, and KV) inference.
A different line of work attacks the problem geometrically. If a hidden vector is multiplied by an orthogonal (rotation) matrix, its individual coordinates are mixed together. Concentrated outlier energy gets redistributed across many channels, producing a more uniform, lower-magnitude, and therefore more quantization-friendly distribution. This connects to the notion of incoherence from QuIP, which showed that pre- and post-multiplying weight matrices by random orthogonal matrices makes them "incoherent," bounding the per-element magnitude and improving low-bit weight quantization [4]. The key practical question is how to insert such rotations without changing what the network computes, and which rotations to use.
SpinQuant relies on the fact that an orthogonal matrix R satisfies R times R-transpose equals the identity, so a rotation can always be undone. If a rotation R is applied to the output of one weight matrix and its inverse R-transpose is applied to the input of the next, the two cancel and the end-to-end function is unchanged. This is the same "computational invariance" property used by SliceGPT and QuaRot [3][6]: a carefully paired rotation can be folded (merged) into the surrounding weight matrices ahead of time, so at inference the rotated network has exactly the same architecture and cost as the original, but its intermediate tensors are easier to quantize.
SpinQuant identifies four rotation insertion points in a LLaMA-style transformer, labelled R1 through R4:
R1 and R2 are absorbable rotations that disappear into the weights and add no inference cost. R3 and R4 are kept as fast Hadamard transforms (Hadamard matrix multiplications) executed online, adding a small runtime overhead but no learned parameters.
The novelty of SpinQuant is that the mergeable rotations R1 and R2 are not fixed at random but learned. The authors observed that different random rotations, including random Hadamard and random orthogonal matrices, can differ by several points of accuracy on the same model, so the choice of rotation matters and can be optimized. SpinQuant minimizes the network's quantization loss (the cross-entropy of the quantized model on a small calibration set) with respect to R1 and R2 while keeping the model weights frozen.
Because R1 and R2 must remain orthogonal throughout training, ordinary gradient descent would break the constraint. SpinQuant therefore optimizes on the Stiefel manifold, the space of orthonormal matrices, using Cayley SGD. The Cayley update maps the gradient into a skew-symmetric matrix and applies a Cayley transform of the form (I minus alpha/2 times Y) inverse times (I plus alpha/2 times Y) times R, which is guaranteed to stay orthogonal as long as R started orthogonal [1]. Calibration is cheap: roughly 800 WikiText-2 samples and 100 optimization iterations, about 1.3 hours on a single NVIDIA A100 node for LLaMA-2 7B. SpinQuant can be combined with a weight-only quantizer such as GPTQ; in that configuration the rotations are optimized with only activation quantization active, and GPTQ then handles the weight rounding, which the authors found improves the final result [1].
SpinQuant is closely related to QuaRot (Ashkboos et al., 2024), a concurrent rotation-based scheme that also rotates weights, activations, and the KV cache for end-to-end 4-bit inference [3]. The main difference is that QuaRot uses fixed random Hadamard rotations and does not learn them, whereas SpinQuant optimizes its mergeable rotations. Both build on the incoherence-processing idea from QuIP and QuIP-sharp [4] and the computational-invariance idea from SliceGPT [6].
The two methods also differ in inference cost and robustness. SpinQuant uses two online Hadamard rotations per transformer block (R3 and R4), fewer than QuaRot, and because its rotations are tuned it is less sensitive to the unlucky-random-rotation problem. Empirically, SpinQuant reduces the quantization accuracy gap relative to QuaRot, with the advantage largest on harder-to-quantize models: on LLaMA-3 8B it cuts the gap to full precision by up to 45.1 percent relative to QuaRot, and roughly 1.5 points on the easier LLaMA-2 7B [1]. The trade-off is that SpinQuant requires a short optimization step, while QuaRot is purely closed-form.
SpinQuant was evaluated mainly in the W4A4KV4 setting (4-bit weights, 4-bit activations, and 4-bit KV cache), reporting average zero-shot accuracy across a suite of commonsense-reasoning benchmarks. The table below summarizes the headline numbers reported in the paper [1].
| Model | SpinQuant W4A4KV4 avg zero-shot accuracy | Gap to full precision |
|---|---|---|
| LLaMA-2 7B | 64.0 | 2.9 points |
| LLaMA-2 13B | 66.9 | 1.4 points |
| LLaMA-2 70B | 71.2 | 1.7 points |
| LLaMA-3 8B | 65.2 | 4.4 points |
| LLaMA-3 70B | 69.3 | 5.2 points |
On LLaMA-2 7B in the 4-4-4 setting, SpinQuant outperformed the quantization-aware training baseline LLM-QAT by 19.1 points and SmoothQuant by 25.0 points [1]. The paper also reports that learned rotations substantially outperform random rotations, confirming that the optimization, not merely the presence of a rotation, drives the gains. LLaMA-3 models are harder to quantize than LLaMA-2 models, reflected in their larger residual gaps (4.4 to 5.2 points versus 1.4 to 2.9 points).
SpinQuant moved from research into production quickly. On October 24, 2024, Meta released quantized versions of Llama 3.2 1B and Llama 3.2 3B aimed at phones and other edge devices [5]. Meta shipped each model with two quantization recipes: a quantization-aware training plus LoRA adaptor pipeline that prioritizes accuracy, and SpinQuant, the post-training method that prioritizes portability because it needs no access to the original training data or pipeline. The on-device scheme quantizes the linear layers in the transformer blocks to 4-bit groupwise weights (group size 32) with 8-bit per-token dynamic activations, rather than the fully 4-bit-activation setting used for the research benchmarks. Meta reported average reductions of about 56 percent in model size and 41 percent in memory use, with 2 to 4 times faster inference on mobile hardware [5]. SpinQuant was also part of a live on-device demonstration at Meta Connect in October 2024.
SpinQuant has several caveats. As a calibration-based post-training method, its learned rotations are fit to a small set of calibration text, and the authors note that real-world deployment may involve different data and activation distributions, so generalization beyond the calibration set warrants further study [1]. The online rotations R3 and R4 cannot be merged into weights and add a fast-Hadamard-transform cost at inference, however small. The mergeable rotations R1 and R2 are learned, but R3 and R4 remain fixed Hadamard transforms rather than optimized, and ablations show that the bulk of the benefit comes from a subset of the rotations while some add little incremental accuracy. Fully 4-bit activation quantization still leaves a measurable gap on the hardest models, which is why production releases used the more conservative 4-bit-weight, 8-bit-activation scheme and offered a quantization-aware-training alternative when accuracy mattered most [5]. Finally, the method is specialized to the transformer structure it analyzes (LLaMA and similar decoder-only models), and applying it to other architectures requires re-deriving where invariant rotations can be inserted.