PiSSA
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,166 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,166 words
Add missing citations, update stale details, or suggest a clearer explanation.
PiSSA (Principal Singular values and Singular vectors Adaptation) is a parameter-efficient fine-tuning method for large language models that initializes LoRA-style low-rank adapter matrices from the dominant components of a singular value decomposition of the pretrained weight, rather than from random noise and zeros.[^1] The method was introduced in April 2024 by Fanxu Meng, Zhaohui Wang, and Muhan Zhang at the Institute for Artificial Intelligence of Peking University, and accepted as a Spotlight paper at NeurIPS 2024.[^1][^2] PiSSA shares the architecture and trainable-parameter count of LoRA but, by training the directions that already carry the most signal in the pretrained weight matrix, converges faster and reaches higher accuracy on benchmarks such as GSM8K, MATH, and HumanEval.[^1] An implementation has been merged into the Hugging Face PEFT library as an option of the init_lora_weights argument.[^3]
Low-rank adaptation, introduced by Edward Hu and colleagues at Microsoft in 2021, freezes the weights of a pretrained model and adds, to each chosen weight matrix, a trainable low-rank update written as the product of two small matrices.[^4] If the frozen weight is W in R^(m x n), LoRA learns a delta of the form delta W = B A, where A is r-by-n and B is m-by-r with r much smaller than min(m, n).[^4] To ensure that the model behaves identically to the base model at the start of training, the canonical LoRA recipe initializes A with a Kaiming or Gaussian distribution and B with zeros, so that BA is the zero matrix and the effective weight is W + 0 = W.[^4][^3]
This zero-and-noise scheme has two consequences. First, immediately after initialization the gradient with respect to A is also zero, because the chain rule passes through B which is zero; only B receives an update from the first step.[^1] Second, the random orientation of A means early updates are sent in arbitrary directions of the weight space, rather than along the directions of W that already encode the model's most useful behavior.[^1] These observations motivated multiple lines of follow-up work, including PiSSA, that change how LoRA is initialized while keeping its inference-time form intact.[^3]
A separate line of evidence comes from the "intrinsic dimensionality" literature, which argues that fine-tuning of large pretrained models effectively takes place on a low-dimensional manifold within the much larger parameter space.[^1] PiSSA's authors connect this observation to the structure of W itself: if a small number of singular directions of the pretrained weight carry most of its operator norm, then a small adapter aligned with those directions can express most of the variation that any low-rank adapter could ever produce.[^1] Aligning the adapter at the start, rather than allowing it to drift toward those directions during training, removes the wasted initial steps that an unaligned random initialization would incur.[^1]
The paper was authored by Fanxu Meng, Zhaohui Wang, and Muhan Zhang, all at the Institute for Artificial Intelligence of Peking University at the time of submission, with Muhan Zhang holding a joint affiliation at the State Key Laboratory of General Artificial Intelligence.[^1] Fanxu Meng is a doctoral student in Muhan Zhang's group whose dissertation work focuses on parameter-efficient fine-tuning and efficient long-context inference; PiSSA is one of several papers from the group exploring low-rank and quantization-aware decompositions of pretrained models.[^10] A first preprint was posted to arXiv on April 3, 2024 (v1), with revisions on April 14, 2024 (v2), May 28, 2024 (v3), and April 9, 2025 (v4); the conference version appeared at NeurIPS 2024 with Spotlight status, placing it in roughly the top 2% of accepted submissions according to the venue's announcement.[^1][^2][^10]
For each linear layer to be adapted, with pretrained weight W in R^(m x n), PiSSA computes a thin singular value decomposition W = U Sigma V^T, where U is m-by-min(m, n), Sigma is a diagonal matrix of singular values arranged in descending order, and V^T is min(m, n)-by-n.[^1] A user-chosen rank r partitions the decomposition into two parts. The "principal" part keeps the top r singular triples, giving W_pri = U_r Sigma_r V_r^T with U_r in R^(m x r), Sigma_r in R^(r x r), and V_r in R^(n x r). The "residual" part is the remainder, W_res = U_(>r) Sigma_(>r) V_(>r)^T, and is held frozen during fine-tuning.[^1]
The principal component is then represented in factored form as the product of two trainable matrices A and B, defined with the singular values split symmetrically between them. Following the formulation in the paper and the implementation in PEFT, the down-projection A and up-projection B are[^1][^3]
A = U_r * Sigma_r^(1/2) in R^(m x r)
B = Sigma_r^(1/2) * V_r^T in R^(r x n)
At initialization A B = U_r Sigma_r V_r^T = W_pri, so the model's effective weight is W_res + W_pri = W and the forward pass exactly matches the base model.[^1] During training only A and B receive gradient updates; W_res is treated as a non-trainable buffer of the same dtype as the rest of the frozen base.[^1][^3]
The PiSSA forward pass for an input X is
Y = X (W_res + B A) = X W_res + (X B) A,
which is structurally identical to LoRA's forward except that the frozen weight is the residual W_res rather than the full W, and the adapter is initialized so that B A equals W_pri.[^1] Because A and B both start at non-zero, principal-direction values, the gradients of the loss with respect to both factors are non-zero from the first step, and they point along the dominant directions of the pretrained weight.[^1] The paper reports that, in practice, the gradient norm of a PiSSA adapter through early training tracks the gradient norm of a full fine-tuning run more closely than LoRA's, which sits at a noticeably lower magnitude in the same window.[^1]
More formally, the LoRA loss gradient with respect to A at initialization is the product of the upstream gradient with the zero matrix B, which yields zero, while the gradient with respect to B is the product of the upstream gradient with a random matrix A and therefore points in an essentially arbitrary direction.[^1] In the PiSSA initialization both factors are non-zero, so neither gradient vanishes, and both products propagate signal through subspaces spanned by the principal singular vectors of W.[^1] The argument is heuristic rather than a formal convergence bound, but it offers a clear explanation for the empirical observation that PiSSA's training-loss curve overlaps with full fine-tuning for the first several hundred steps while LoRA's curve sits visibly above both.[^1]
Computing a thin SVD of every adapted layer of a multi-billion-parameter model with the standard LAPACK routine is expensive enough to dominate the cost of short fine-tuning runs.[^3] To address this, PiSSA uses the randomized SVD algorithm of Halko, Martinsson, and Tropp, which approximates the leading r singular triples through a small number of subspace iterations.[^1] The PEFT integration exposes the number of subspace iterations through the option name pissa_niter_<k>, where k controls the trade-off between speed and approximation error: pissa uses the full SVD (minutes per model), while pissa_niter_4 typically returns in seconds with negligible loss on downstream metrics.[^3] The paper reports that a full SVD pass over the adapted layers of a 7B-scale model takes on the order of 434 seconds on the authors' hardware, while the randomized variant with niter = 1 finishes in roughly 5 to 12 seconds and the niter = 4 setting reaches an initialization approximation error in the range of about 1.26 x 10^-4 to 2.00 x 10^-3 in operator norm, depending on rank.[^1] The initialization error decreases monotonically with the number of subspace iterations, allowing users to dial in the speed-accuracy point appropriate to their hardware budget.[^1]
Because the inference-time form W_res + B A is identical in shape to a LoRA-adapted layer, a PiSSA-trained adapter can be redistributed in standard LoRA format without shipping a custom residual base. The PEFT integration provides a conversion path that, given the original W, recovers a LoRA delta delta W = (A' B' minus A B) such that loading the converted weights on top of the unmodified base reproduces the trained model.[^3] This avoids storing W_res separately for each fine-tune and preserves compatibility with serving infrastructure that expects LoRA-format checkpoints.[^3] The conversion procedure stores a pair of matrices of shape (m, 2r) and (2r, n) rather than (m, r) and (r, n); the doubling captures the difference between the trained and the initial adapter so that no separate residual file is needed at deployment.[^3]
The cost of this convenience is that the converted LoRA delta typically has rank up to 2r rather than r, slightly increasing the storage footprint and the per-step inference overhead. For deployments that prefer minimum-size adapters and are willing to ship a custom residual, the original PiSSA format (separate W_res plus rank-r A, B) avoids this doubling.[^3]
The PiSSA paper reports results across 12 base models ranging from 184M to 70B parameters and 13 downstream tasks, comprising five natural language generation benchmarks (including GSM8K, MATH, and HumanEval) and eight natural language understanding tasks drawn from the GLUE benchmark.[^1] Across these settings PiSSA consistently outperforms LoRA at matched rank and training budget.[^1] Selected head-to-head numbers from the paper's main result tables are reproduced below.
| Base model | Benchmark | LoRA | PiSSA | Delta (pp) |
|---|---|---|---|---|
| Mistral 7B | GSM8K | 67.7% | 72.86% | +5.16 |
| Mistral 7B | MATH | 19.68% | 21.54% | +1.86 |
| Mistral 7B | HumanEval | 43.90% | 46.95% | +3.05 |
| LLaMA 2 7B | GSM8K | 42.30% | 53.07% | +10.77 |
| LLaMA 2 7B | MATH | 5.50% | 7.44% | +1.94 |
| LLaMA 2 7B | HumanEval | 18.29% | 21.95% | +3.66 |
| Gemma 7B | GSM8K | 74.90% | 77.94% | +3.04 |
| Gemma 7B | MATH | 31.28% | 31.94% | +0.66 |
| Gemma 7B | HumanEval | 53.66% | 54.27% | +0.61 |
Sources: PiSSA paper main tables.[^1]
The largest absolute gains are seen on the LLaMA 2 7B base, where PiSSA's GSM8K accuracy is more than 10 percentage points above LoRA's at matched rank. The smallest gains appear on already-strong bases such as Gemma 7B on HumanEval, where both LoRA and PiSSA sit close to the ceiling for the rank tested.[^1] The paper additionally presents convergence curves showing that, on the same training data and learning rate schedule, the cross-entropy loss of a PiSSA-adapted model drops more rapidly in the first several hundred steps than a LoRA-adapted model of the same rank, and that the final loss reached by PiSSA is below LoRA's at the end of training.[^1] These curves are presented as evidence for the qualitative argument that aligning the adapter with the principal directions of W from step zero shortens the initial transient in which LoRA's random-direction updates contribute little.[^1]
The paper sweeps the adapter rank r through the values 1, 4, 8, 16, 32, 64, and 128 and reports that PiSSA's accuracy advantage over LoRA holds at every value tested, although the size of the advantage varies.[^1] At very small ranks (r in the low single digits) the gap is largest in relative terms because the principal-direction initialization captures a disproportionate share of W's spectrum that random initialization would need many steps to find. At the largest ranks tested (r = 128) the two methods come closer together, since both are now expressive enough to recover most of the relevant subspace within the training budget.[^1] The number of trainable parameters across the tested configurations is reported in the range of roughly 168M to 320M, depending on rank and model size, and is matched between LoRA and PiSSA at every comparison point.[^1]
The reported experiments span LLaMA 2, LLaMA 3, Mistral 7B, Gemma 7B, Qwen 1.5 7B, the Yi-1.5 34B model, the DeepSeek-MoE 16B model, Mixtral 8x7B, and the DeBERTa v3 base model at the small-scale end.[^1][^2] PiSSA is applied to all linear layers of the attention and MLP blocks, mirroring the broader "apply LoRA everywhere" recipe that became the default in PEFT around 2024.[^1][^3] The breadth of the benchmark sweep is among the larger ones for a 2024 PEFT method and is one of the reasons cited by the NeurIPS reviewers when accepting the paper as a Spotlight.[^2]
PiSSA combines with weight-only quantization of the residual matrix in a manner analogous to QLoRA, producing a variant the authors call QPiSSA.[^1] The structural argument is straightforward: because the largest singular components have been pulled out of W into the trainable factors A and B, the residual W_res has a narrower spectrum and a distribution that is closer to the Gaussian assumption used by the NF4 4-bit data type introduced in QLoRA.[^1][^5] Quantizing W_res therefore introduces less reconstruction error than quantizing the full W, which in turn translates into a smaller post-quantization performance gap between the fine-tuned model and its full-precision counterpart.[^1]
The paper reports that, on LLaMA 2 7B, QPiSSA reduces the 4-bit quantization error relative to QLoRA by roughly 18.97% in operator-norm terms, and that on LLaMA 3 70B the GSM8K accuracy after fine-tuning rises from 81.73% with QLoRA to 86.05% with QPiSSA, a gap of 4.32 percentage points at the same rank and bit-width.[^1] At the 7B scale on GSM8K, QPiSSA reaches 49.13% accuracy, ahead of both QLoRA at 39.8% and LoftQ at 40.71% under matched 4-bit precision and matched training budget.[^1] QPiSSA also includes an optional iterative refinement, denoted by a T-iterations parameter, that recomputes the SVD of W minus the NF4-quantized residual and reduces the residual quantization error further at the cost of a few additional initialization seconds.[^1]
For comparison, the closely related LoftQ method (LoftQ, October 2023) addresses the same quantization-and-fine-tuning gap by alternating between quantization of W and a low-rank correction that approximates the difference between W and its quantization.[^6] PiSSA differs in that the correction is the principal part of W rather than the residual of quantization, and that the trainable adapter retains the principal structure throughout fine-tuning rather than being chosen specifically to absorb quantization noise.[^1] The two methods can be viewed as complementary points in a design space that the PEFT documentation now organizes as a family of "smart" LoRA initializations.[^3]
The optional QPiSSA-T-iters procedure proceeds as follows. After the initial SVD-based decomposition of W into a residual W_res and principal factors A_0, B_0, the residual is quantized to NF4. The next iteration recomputes the SVD on W minus nf4(W_res) and assigns the resulting principal factors as the new A_t, B_t; the residual is then re-derived as W minus A_(t-1) B_(t-1)^T, requantized, and the loop repeats for T steps.[^1] The paper reports that running for T = 5 reduces the residual quantization error by an additional factor of approximately two relative to T = 1, with negligible additional wall-clock cost given that each iteration involves a randomized SVD that finishes in seconds.[^1]
PiSSA is available in the official Hugging Face PEFT library as an option to the init_lora_weights argument of LoraConfig.[^3] Passing init_lora_weights="pissa" triggers a full SVD per adapted layer at the time get_peft_model is called; passing init_lora_weights="pissa_niter_4" (or another iteration count) instead invokes the randomized SVD variant.[^3] No other parts of the LoRA configuration need to change; the same rank, alpha, dropout, and target-module settings apply unchanged, which makes PiSSA a drop-in replacement for the default initialization in existing fine-tuning pipelines.[^3] The PEFT documentation lists PiSSA alongside other principled initialization options (OLoRA, EVA, CorDA) that have been added since 2024.[^3]
To avoid recomputing the SVD for every fine-tune, the PiSSA authors publish pre-decomposed base models on the Hugging Face Hub for several popular open releases, including variants of LLaMA 2, LLaMA 3, and the Qwen 2 family.[^7] These checkpoints package the residual W_res together with the principal factors A and B, allowing users to load a ready-to-train PiSSA configuration without ever invoking the SVD locally.[^7] The official repository at MuLabPKU/PiSSA on GitHub maintains training scripts, DeepSpeed configurations, and links to the published checkpoints.[^7]
The original code release is licensed under CC BY 4.0 and includes training scripts for math, code, and conversational benchmarks, as well as scripts that demonstrate conversion of a trained PiSSA adapter back into LoRA-format checkpoints for serving.[^7] Beyond the upstream repository, PiSSA has been incorporated into community fine-tuning toolkits including LLaMA Factory and PaddleNLP, in addition to its inclusion in the main branch of PEFT.[^10]
PiSSA's first author Fanxu Meng and several collaborators have published follow-up work that extends the principal-direction-initialization idea to new settings.[^10] HD-PiSSA, presented at EMNLP 2025 as an oral, distributes different principal components of the pretrained weights across multiple GPUs in a data-parallel training setup so that the effective rank of the adapter spans the union of the per-GPU components, achieving more than 16 times the effective updated rank of conventional data-parallel LoRA or PiSSA when training on eight GPUs.[^8] Other related work in the cluster includes TransMLA, a method for converting standard multi-head attention into multi-head latent attention that, like PiSSA, leans on low-rank factorizations of pretrained weights to preserve behavior while reducing serving cost.[^10] These projects share Muhan Zhang's group at the Institute for Artificial Intelligence at Peking University as their academic home.[^10]
PiSSA sits inside a cluster of 2023-2024 PEFT methods that share LoRA's two-matrix shape but differ in how they initialize, decompose, or constrain the factors.
| Method | Year | Initialization strategy | Notes |
|---|---|---|---|
| LoRA | 2021 | Random Gaussian for A, zeros for B | Original method, widely deployed[^4] |
| QLoRA | 2023 | LoRA initialization atop a 4-bit NF4 quantized base | Enables 65B fine-tuning on a single 48 GB GPU[^5] |
| LoftQ | 2023 | Alternates quantization of W with a LoRA correction approximating W minus its quantization | Lowers quantization-induced fine-tuning gap[^6] |
| DoRA | 2024 | LoRA on direction with a separately trained magnitude vector | Decomposes pretrained weight into magnitude and direction[^9] |
| PiSSA | 2024 | A and B set from the top r SVD components of W; residual frozen | Faster convergence and better final accuracy than LoRA at matched rank[^1] |
DoRA in particular is reported by the PiSSA authors to be compatible with PiSSA: stacking PiSSA initialization atop DoRA's magnitude-direction decomposition gives an additional improvement on GSM8K with Mistral 7B in their ablations.[^1][^9] The PEFT documentation describes PiSSA, CorDA, OLoRA, and EVA as members of a broader family of "smart" LoRA initializations that exploit structural information about either the weights or the data distribution.[^3]
PiSSA's most-discussed limitations come from the paper's own conclusions and from comparisons against the methods it builds on.
The SVD initialization is, although fast, an additional upfront step that scales with the total number of adapted parameters. For very large bases with broad target-module coverage, the full SVD path can take minutes; the randomized variant is recommended in most production settings, but it produces an approximation rather than an exact decomposition.[^1][^3] The choice of rank r is more consequential than in LoRA, because r determines how much of W's spectrum is moved into the trainable factors. The paper notes that, while PiSSA outperforms LoRA at every rank tested, the relative improvement varies with both r and the base model, and the authors do not provide a closed-form recipe for choosing r beyond reporting that ranks similar to those used in the LoRA literature transfer to PiSSA.[^1]
PiSSA is not strictly better than LoRA in every reported setting. The paper presents results for the DeBERTa v3 base on the GLUE benchmark, where the gap to LoRA narrows substantially compared with the math and code benchmarks, and on some configurations the methods are statistically similar.[^1] The paper's own limitations section also notes that the method has been evaluated almost exclusively on linear layers in transformer-based large language models, and that the effect on convolutional layers or multimodal architectures is not yet well characterized.[^1] A related open question is whether PiSSA benefits from adaptive-rank methods such as AdaLoRA, where the per-layer rank is learned during training; the paper reports a preliminary positive ablation, but the interaction has not been studied in detail outside the original work.[^1]
A more theoretical concern is that the analysis relating gradient alignment to convergence speed is intuitive rather than rigorous: the paper does not prove a bound on convergence rate as a function of singular-value spectrum, and so the "principal directions move first" argument remains a heuristic supported by curves rather than by a formal optimization result.[^1] This has been noted in the NeurIPS reviews accompanying the paper and is identified by the authors themselves as a direction for future theoretical analysis.[^2]
PiSSA's contribution is narrow in scope but practical in impact. It changes a single line of a standard LoRA training script (the initialization) and leaves the trainable-parameter count, optimizer setup, and forward and backward passes structurally unchanged.[^1][^3] In return it reports consistent improvements on math and code benchmarks across base models from 184M to 70B parameters, including in combination with 4-bit quantization where the gap to full-precision fine-tuning has historically been hardest to close.[^1] Inclusion in the main branch of PEFT gives the method a low-friction adoption path: users who already deploy LoRA can switch to PiSSA by changing one configuration field, and back-converting trained adapters to LoRA format means the serving stack does not need to change.[^3] In the broader landscape of 2024 PEFT methods, PiSSA is one of a small number whose authors have published pre-decomposed checkpoints, which further lowers the entry cost for downstream users.[^7]
PiSSA also has methodological significance beyond its raw benchmark numbers. By foregrounding the role of initialization in low-rank adapters, it helped catalyze a small wave of follow-up work that reuses the same architectural shape as LoRA but exploits structural information from the pretrained weights or from data activations to choose the initial subspace.[^3] Methods in this cluster include LoftQ (initialization aware of subsequent quantization), DoRA (decomposition into magnitude and direction), OLoRA (initialization via QR decomposition of the pretrained weight), CorDA (context-aware decomposition of the weight using a downstream-task covariance estimate), and EVA (initialization from the right-singular vectors of layer input activations).[^3][^6][^9] The PEFT documentation organizes these as members of a broader family of "smart" LoRA initializations, of which PiSSA was one of the earliest to be merged.[^3]
The combination of strong empirical numbers, an integration into the dominant open-source PEFT library, pre-decomposed checkpoints on the Hugging Face Hub, and adoption in third-party toolchains such as LLaMA Factory and PaddleNLP appears to have made PiSSA one of the more widely cited 2024 LoRA variants by the start of 2026.[^7][^10] Its choice as a NeurIPS Spotlight, with the venue's own description placing it in the top 2% or so of accepted papers, is the clearest formal indicator of community reception.[^2][^10]