VeRA (Vector-based Random Matrix Adaptation)
Vector-based Random Matrix Adaptation (VeRA) is a parameter-efficient fine-tuning (PEFT) method for adapting large pretrained neural networks to downstream tasks while training only a tiny fraction of the parameters required by LoRA.[^1] Rather than learning a separate pair of low-rank matrices for every adapted layer, VeRA freezes a single pair of randomly initialized projection matrices A and B and shares them across all adapted layers, learning only two small per-layer scaling vectors d and b that selectively re-weight the rows and columns of these frozen projections.[^1] On the GLUE benchmark with RoBERTa-large the method reaches LoRA-parity accuracy with roughly ten times fewer trainable parameters, and on instruction tuning of 7B and 13B LLaMA models it approaches LoRA-quality MT-Bench scores while reducing the trainable-parameter count by approximately a hundred-fold.[^1] VeRA was introduced by Dawid Kopiczko, Tijmen Blankevoort, and Yuki M. Asano in an October 2023 arXiv preprint, accepted as a poster at ICLR 2024, and merged into the HuggingFace PEFT library as a first-class tuner in May 2024.[^1][^2][^3][^4]
Infobox
| Property | Value |
|---|
| Full name | Vector-based Random Matrix Adaptation |
| Method type | Parameter-efficient fine-tuning, low-rank reparameterization |
| Authors | Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano |
| Affiliations | QUVA Lab, University of Amsterdam; Qualcomm AI Research |
| First public release | arXiv 2310.11454, 17 October 2023[^1] |
| Final paper version | v2, 16 January 2024[^1] |
| Venue | ICLR 2024 (poster) |
| OpenReview ID | NjNfLdxr3A[^2] |
| Reference implementation | peft.VeraConfig / peft.VeraModel[^3] |
| First PEFT release | v0.11.0, 16 May 2024[^4] |
| Headline result | 10x fewer trainable parameters than LoRA on RoBERTa-large GLUE at parity accuracy[^1] |
Background and Motivation
The growth of Transformer-based foundation models created a tension between adaptation flexibility and storage. Full fine-tuning of a multi-billion-parameter model demands storing one independent checkpoint per task or per user, which is impractical for cloud services that maintain millions of customized adapters. The 2021 LoRA proposal by Hu et al. addressed this cost by reparameterizing each adapted weight matrix W as the sum of the frozen pretrained weight and a low-rank product BA with rank r much smaller than the layer dimensions, training only A and B.[^1] LoRA reduced both memory during training (no optimizer state for frozen weights) and storage at deployment time (adapters often a few megabytes), while introducing no additional inference latency because the trained delta can be merged into W at deployment.
Even so, LoRA's footprint scales unfavorably for cloud-scale personalization. The VeRA authors illustrate with a representative example: applying LoRA with rank 16 to only the query and value projection layers of GPT-3 175B consumes about 288 MB of single-precision storage per adapter, which becomes roughly 275 TB across one million per-user adapters.[^1] At the same time, prior work on intrinsic dimensionality of fine-tuning by Aghajanyan, Gupta, and Zettlemoyer reported that the 90%-of-full-task-performance intrinsic dimension d_90 for RoBERTa-base is only 896, an order of magnitude below the roughly 0.3 million parameters typically used in a LoRA adapter for that model.[^1] This gap suggested that a substantially smaller trainable subspace might suffice.
A second line of evidence motivated VeRA: a body of work showing that randomly initialized, frozen neural network components can be surprisingly competent. The Lottery Ticket Hypothesis of Frankle and Carbin identified high-performing subnetworks inside random initializations; Ramanujan et al. demonstrated that high-accuracy subnetworks exist without any training; Frankle et al. showed that training only batch-normalization parameters atop frozen random convolutional weights yields nontrivial accuracy; and Aghajanyan et al. showed that training a small number of parameters projected back into the full weight space via fixed random matrices can recover roughly 90% of full fine-tuning accuracy.[^1] Combining the intrinsic-dimensionality argument with these random-feature results, the VeRA authors hypothesized that a single pair of frozen random low-rank projections shared across layers, modulated by tiny trainable scaling vectors, could match LoRA at far lower parameter cost.
Author Affiliations and Publication History
VeRA originated as a joint project between the Qualcomm AI Research center in Amsterdam and the QUVA Lab at the University of Amsterdam, a public-private research collaboration that has produced numerous compression and efficient-inference papers.[^1] Dawid J. Kopiczko was a PhD student at QUVA Lab and was the primary first author; Tijmen Blankevoort was then a director at Qualcomm AI Research focusing on neural network efficiency; Yuki M. Asano was a faculty member at the University of Amsterdam working on self-supervised learning and representation learning.[^1] The paper was submitted to arXiv as preprint 2310.11454 on 17 October 2023, with a revised v2 posted on 16 January 2024 that aligned with the camera-ready submission.[^1] The work was accepted to ICLR 2024 as a poster presentation, with OpenReview thread NjNfLdxr3A; final OpenReview status was updated on 21 April 2024.[^2]
How It Works
LoRA recap
For a frozen pretrained weight matrix W_0 in R^(m x n), LoRA introduces two trainable matrices A in R^(r x n) and B in R^(m x r), where r is much smaller than min(m, n) (the bottleneck rank).[^1] During the forward pass, a hidden representation x is mapped to
h = W_0 x + BA x.
Only A and B receive gradients; W_0 is held frozen. After training, the product BA can be added to W_0, so deployed models incur no extra latency.[^1] Each adapted layer carries its own independent A and B, so the total trainable parameter count scales as roughly 2 * L_tuned * d_model * r, where L_tuned is the number of adapted layers and d_model is the hidden size.
The VeRA reparameterization
VeRA modifies LoRA in two coordinated ways. First, A in R^(r x n) and B in R^(m x r) are randomly initialized once, then frozen and shared across every adapted layer, so a single A and a single B serve all layers regardless of depth. Second, for each adapted layer the method introduces two trainable scaling vectors, d in R^r and b in R^m, represented as diagonal matrices Lambda_d and Lambda_b. The forward pass becomes
h = W_0 x + Lambda_b B Lambda_d A x,
with only Lambda_b and Lambda_d updated by gradient descent.[^1] Conceptually, Lambda_d scales rows of A and Lambda_b scales rows of the product B Lambda_d A, so the per-layer scaling vectors can effectively switch off or amplify any row or column of the shared frozen projections. Because A and B are static and never need to be stored as trained parameters, they need not be low-rank in the strict sense that LoRA requires; the rank r controls only the dimension of Lambda_d (and hence the size of the per-layer trainable vector for the d coordinate).
The number of trainable parameters in VeRA satisfies
|Theta_VeRA| = L_tuned * (d_model + r),
contrasted with LoRA's
|Theta_LoRA| = 2 * L_tuned * d_model * r.[^1]
For r = 1, VeRA already roughly halves LoRA's trainable parameters; the gap widens linearly with r, because raising the rank in VeRA only adds L_tuned new entries per increment (one for each layer's d vector), while in LoRA it adds 2 * L_tuned * d_model parameters.[^1] For tall and deep models such as GPT-3 with 96 attention blocks and d_model = 12288, this difference becomes dramatic.
The paper's Table 1 makes the contrast concrete. Adapting only query and key layers of GPT-3 with rank 16 requires about 75.5 million trainable LoRA parameters (288 MB single-precision) versus about 2.8 million trainable VeRA parameters (10.5 MB), and at rank 256 the contrast widens to 1.21 billion LoRA versus 8.7 million VeRA parameters (4.6 GB vs 33 MB).[^1]
Storage and the PRNG seed trick
Because A and B are frozen and produced from a deterministic random number generator, a trained VeRA adapter does not need to store the projection matrices at all: it only needs the per-layer d and b vectors plus a single integer PRNG seed used to regenerate A and B from a fixed initialization rule.[^1] Reload simply re-samples A and B with the saved seed and the chosen initialization scheme. This shrinks the on-disk footprint of an adapter to little more than the size of the d and b vectors. The HuggingFace PEFT implementation exposes this option as save_projection=False on VeraConfig and stores projection_prng_key alongside the learned vectors; the default save_projection=True is more portable across PyTorch versions and devices because it serializes A and B rather than relying on RNG reproducibility.[^3]
Initialization
The paper recommends Kaiming initialization (also known as He initialization) for the shared matrices A and B, in particular the PyTorch implementation of the uniform Kaiming variant, because scaling by matrix dimensions keeps the variance of the product B A invariant to rank choice and removes the need to retune the learning rate when r changes.[^1] The scaling vector b is initialized to zero, mirroring LoRA's zero initialization for B, so that the residual delta vanishes on the first forward pass and the pretrained behavior is preserved at the start of training.[^1] The vector d is initialized to a single non-zero constant across all elements; in the main GLUE experiments this constant is 0.1, which the authors treat as a tunable hyperparameter.[^1] Ablations report that initializing d at 1.0 is markedly worse than at 10^-1 or 10^-7, presumably because smaller initial magnitudes give the optimizer more freedom for early sign changes in the modulated rows of the frozen projections.[^1]
Different layer shapes
A practical complication is that not all adapted linear layers have the same input or output dimension. The HuggingFace PEFT documentation explains the convention adopted in its implementation: A and B are allocated to the largest required size across all adapted layers, and for any specific layer the relevant submatrices are sliced out of the shared buffers during the forward pass.[^3] For instance, when adapting one layer of shape (100, 20) and one of shape (80, 50) at rank r, the buffer A has shape (r, 50) and B has shape (100, r); the (100, 20) layer slices its A from the first 20 columns of the shared A and its B from the first 100 rows of the shared B.[^3]
Experimental Results
The original paper evaluates VeRA in four settings: GLUE natural language understanding with RoBERTa, E2E natural language generation with GPT-2, instruction tuning of 7B and 13B LLaMA and Llama 2 models, and image classification with the Vision Transformer (ViT).[^1] Throughout, the method is compared to LoRA at matched or larger rank, plus several adapter and BitFit baselines.
GLUE (RoBERTa)
On the GLUE benchmark, VeRA is applied to the query and value projection matrices of every self-attention block in RoBERTa-base and RoBERTa-large; the classification head is trained in full following the LoRA protocol.[^1] The shared matrices are initialized with Kaiming uniform; d_init = 0.1; VeRA rank is 1024 for RoBERTa-base and 256 for RoBERTa-large.[^1] Per Hu et al.'s setup, the time-consuming MNLI and QQP tasks are omitted, so no MNLI-pretrained initialization is used for MRPC, RTE, or STS-B.[^1] Each configuration is run five times with different seeds and the median best-epoch result is reported.[^1]
Key results from Table 2 of the paper.[^1]
| Model | Method | Trainable params | SST-2 | MRPC | CoLA | QNLI | RTE | STS-B | Avg. |
|---|
| RoBERTa-base | Full FT | 125M | 94.8 | 90.2 | 63.6 | 92.8 | 78.7 | 91.2 | 85.2 |
| RoBERTa-base | BitFit | 0.1M | 93.7 | 92.7 | 62.0 | 91.8 | 81.5 | 90.8 | 85.4 |
| RoBERTa-base | LoRA | 0.3M | 95.1 | 89.7 | 63.4 | 93.3 | 86.6 | 91.5 | 86.6 |
| RoBERTa-base | VeRA | 0.043M | 94.6 | 89.5 | 65.6 | 91.8 | 78.7 | 90.7 | 85.2 |
| RoBERTa-large | LoRA | 0.8M | 96.2 | 90.2 | 68.2 | 94.8 | 85.2 | 92.3 | 87.8 |
| RoBERTa-large | VeRA | 0.061M | 96.1 | 90.9 | 68.0 | 94.4 | 85.9 | 91.7 | 87.8 |
On RoBERTa-large VeRA matches LoRA's average score of 87.8 with about 7.6% as many trainable parameters (0.061M versus 0.8M, a roughly thirteen-fold reduction), and slightly outperforms LoRA on MRPC and RTE.[^1] On RoBERTa-base, the average score (85.2) trails LoRA (86.6) by 1.4 points but matches full fine-tuning, again at roughly 14% of LoRA's trainable parameter count.[^1] The authors summarize this regime as parity at "an order of magnitude fewer parameters."[^1]
E2E (GPT-2)
The E2E dataset by Novikova et al. is a constrained natural language generation benchmark mapping restaurant-domain meaning representations to surface utterances. Following Hu et al., VeRA is applied to GPT-2 Medium and Large.[^1] Selected results from Table 3 of the paper:
| Model | Method | Trainable params | BLEU | NIST | METEOR | ROUGE-L | CIDEr |
|---|
| GPT-2 Medium | Full FT | 354.92M | 68.2 | 8.62 | 46.2 | 71.0 | 2.47 |
| GPT-2 Medium | LoRA | 0.35M | 68.9 | 8.69 | 46.4 | 71.3 | 2.51 |
| GPT-2 Medium | VeRA | 0.098M | 70.1 | 8.81 | 46.6 | 71.5 | 2.50 |
| GPT-2 Large | LoRA | 0.77M | 70.1 | 8.80 | 46.7 | 71.9 | 2.52 |
| GPT-2 Large | VeRA | 0.17M | 70.3 | 8.85 | 46.9 | 71.6 | 2.54 |
On GPT-2 Medium VeRA exceeds LoRA across BLEU, NIST, METEOR, and ROUGE-L with about 28% of LoRA's parameters; on GPT-2 Large VeRA matches or exceeds LoRA across all metrics with roughly a fourfold reduction.[^1] The paper headlines this regime as "VeRA outperforms LoRA with 3 and 4 times less trainable parameters, for GPT2 Medium and Large respectively."[^1]
Instruction tuning (LLaMA and Llama 2)
Following the QLoRA setup of Dettmers et al., both LoRA and VeRA are applied to every linear layer except the top one of LLaMA 7B and 13B and Llama 2 7B and 13B, quantization is enabled for single-GPU training, and the cleaned Alpaca dataset (51K instruction-response pairs) is used for one epoch following a learning rate sweep.[^1] LoRA uses rank 64 (159.9M trainable parameters for 7B, 250.3M for 13B); VeRA uses rank 1024 (1.6M for 7B, 2.4M for 13B).[^1] Models are evaluated on MT-Bench, a multi-turn benchmark with 80 questions judged by GPT-4 on a 10-point scale.[^1]
Results from Table 4 of the paper:
| Model | Method | Trainable params | MT-Bench score |
|---|
| LLaMA 13B (base) | - | - | 2.61 |
| LLaMA 7B | LoRA | 159.9M | 5.03 |
| LLaMA 7B | VeRA | 1.6M | 4.77 |
| LLaMA 13B | LoRA | 250.3M | 5.31 |
| LLaMA 13B | VeRA | 2.4M | 5.22 |
| Llama 2 7B | LoRA | 159.9M | 5.19 |
| Llama 2 7B | VeRA | 1.6M | 5.08 |
| Llama 2 13B | LoRA | 250.3M | 5.77 |
| Llama 2 13B | VeRA | 2.4M | 5.93 |
Despite an approximately hundred-fold reduction in trainable parameters, VeRA matches LoRA within 0.26 points across the four models and beats LoRA on Llama 2 13B by 0.16 points.[^1] On a Vicuna-Eval comparison reported in Appendix F (a precursor to MT-Bench), VeRA scored 7.48 versus LoRA's 7.36 on instruction-tuned LLaMA 7B trained on 10K Alpaca examples, with adapter storage requirement of 5.4 MB rather than 610 MB for the LoRA counterpart.[^1]
Image classification (ViT)
The method was also evaluated on image classification by adapting ViT Base and Large variants pretrained on ImageNet-21k to four downstream datasets: CIFAR-100, Food-101, Flowers-102, and RESISC45, training on only 10 samples per class for 10 epochs.[^1] LoRA uses rank 8, VeRA uses rank 256, both applied to the query and value layers.[^1] Selected results from Table 5:
| Model | Method | Trainable params | CIFAR100 | Food101 | Flowers102 | RESISC45 |
|---|
| ViT-B | Head only | - | 77.7 | 86.1 | 98.4 | 67.2 |
| ViT-B | Full FT | 85.8M | 86.5 | 90.8 | 98.9 | 78.9 |
| ViT-B | LoRA | 294.9K | 85.9 | 89.9 | 98.8 | 77.7 |
| ViT-B | VeRA | 24.6K | 84.8 | 89.0 | 99.0 | 77.0 |
| ViT-L | Head only | - | 79.4 | 76.5 | 98.9 | 67.8 |
| ViT-L | Full FT | 303.3M | 86.8 | 78.7 | 98.8 | 79.0 |
| ViT-L | LoRA | 786.4K | 87.0 | 79.5 | 99.1 | 78.3 |
| ViT-L | VeRA | 61.4K | 87.5 | 79.2 | 99.2 | 78.6 |
For ViT-Base, VeRA stays within roughly one accuracy point of LoRA on three datasets and slightly beats LoRA on Flowers102, at roughly 8% of LoRA's parameter count.[^1] For ViT-Large, VeRA beats LoRA on CIFAR100, Flowers102, and RESISC45, again at about 8% of LoRA's parameter count.[^1]
Scaling rank and parameter count
To probe the rank-versus-accuracy tradeoff, the authors sweep rank r in {1, 4, 16, 64, 256, 1024} for VeRA and {1, 2, 4, 8, 16, 32, 64} for LoRA on the GLUE RTE task using RoBERTa-large.[^1] In the parameter-matched regime, VeRA at higher rank beats LoRA by approximately four accuracy points on RTE.[^1] The trend in Figure 2 of the paper shows VeRA's accuracy curve dominates LoRA's at every fixed trainable-parameter budget below LoRA's smallest configuration, illustrating that VeRA accesses a useful low-parameter regime that LoRA cannot reach.
Ablations
The paper isolates several design choices.
- Single scaling vector. Removing either d or b and keeping the other degrades accuracy, with the only d configuration outperforming only b. On RTE this gap is large: only d scores 67.0 +/- 13.9 versus only b at 64.3 +/- 11.5, while the full VeRA reaches 85.8 +/- 0.7.[^1] The authors attribute the asymmetry to d's effect on rows of both shared matrices, while b only scales rows of their product.
- Initialization of shared matrices. Kaiming uniform and Kaiming normal both significantly outperform uniform sampling from [0, 0.1]; for example on RTE, Kaiming uniform reaches 85.8, Kaiming normal 82.6, and uniform [0, 0.1] only 53.1.[^1]
- Initialization of d. Values of 10^-1 and 10^-7 outperform 1.0; on MRPC, d_init = 10^-7 narrowly leads with 90.8 versus 90.5 for 10^-1 and 70.3 for 1.0.[^1]
- Shared versus unique frozen matrices. Sharing one A and B across all layers gives essentially identical accuracy to sampling unique random matrices per layer on RTE (84.6 vs 84.6), STS-B (91.5 vs 91.5), and within roughly half a point on MRPC and CoLA.[^1] Because shared matrices are far cheaper to store, the shared configuration is the default.
- Adaptation magnitude. The largest learned changes in d concentrate in the query projections rather than the value projections, and in later layers more than earlier layers, consistent with patterns observed in AdaLoRA and P-tuning v2.[^1]
Training-time and memory cost
In Appendix C, the paper benchmarks LoRA versus VeRA at identical rank (64) fine-tuning LLaMA 7B on the same dataset and reports VeRA at 578 minutes versus LoRA at 568 minutes (a 1.8% increase in wall-clock training time) and VeRA at 21.69 GB versus LoRA at 23.42 GB of GPU memory (a 7.4% reduction).[^1] The extra multiplications introduced by Lambda_b and Lambda_d in the forward pass produce a modest compute overhead, while the omission of optimizer state and gradients for the shared matrices reduces memory.[^1]
Cosine similarity to LoRA
Appendix D compares LoRA's learned delta with VeRA's learned delta layer by layer through cosine similarity at rank 64.[^1] The average pairwise similarity is approximately 2e-3 between LoRA and VeRA, versus -8e-5 between LoRA and a random projection.[^1] The similarity is largest in later layers, again echoing the magnitude pattern.
Reference Implementation in HuggingFace PEFT
VeRA was added to the open-source HuggingFace PEFT library as a first-class tuner and announced in the v0.11.0 release on 16 May 2024.[^4] In the same release, two other PEFT methods were added: BOFT (Butterfly Orthogonal Fine-Tuning) and PiSSA (an SVD-based LoRA initialization), and HQQ and EETQ quantization were extended to LoRA.[^4]
VeraConfig exposes the central knobs of the method.[^3]
r (int, default 256): the VeRA "rank" controlling the dimension of the d vector. The PEFT documentation recommends choosing higher values than LoRA ranks because VeRA's parameter count is dominated by per-layer scalars rather than per-layer matrices.[^3]
target_modules: which modules to wrap. Only nn.Linear layers are supported in PEFT's VeRA implementation as of v0.19.0.[^3]
projection_prng_key (int, default 0): the PyTorch PRNG seed used to regenerate vera_A and vera_B when loading a checkpoint without serialized projections.[^3]
save_projection (bool, default True): whether to save the random projection matrices in the checkpoint state dict alongside the per-layer scaling vectors. Setting it to False relies on projection_prng_key for reload and shrinks the checkpoint, at the cost of reproducibility guarantees across PyTorch versions and devices.[^3]
vera_dropout (float): dropout probability applied to the VeRA path during training.[^3]
d_initial (float, default 0.1): initial value for vera_lambda_d, matching the d_init = 0.1 used in the GLUE experiments of the paper.[^1][^3]
init_weights, fan_in_fan_out, bias, and standard PEFT hooks for layer enumeration and dispatch.[^3]
To handle the heterogeneous-shape case described above, the implementation allocates A and B at the maximum required size across all targeted layers and slices submatrices per layer on the forward pass.[^3] The PEFT codebase tracks several known limitations: VeRA does not interoperate cleanly with quantized base models (see GitHub issues #1872 and #1921) because the shared projection logic conflicts with bitsandbytes layer wrappers that LoRA supports natively, and multi-GPU support required additional fixes (PR #2163).[^3]
Position in the PEFT Method Landscape
VeRA belongs to the broader family of LoRA-style reparameterization methods (sometimes called LoRA variants) that have proliferated since 2022. A simplified taxonomy:
- Per-layer low-rank adapters: LoRA itself, AdaLoRA, and QLoRA keep the per-layer A_i and B_i pattern, with AdaLoRA dynamically reallocating ranks across layers and QLoRA combining LoRA with 4-bit quantization of the base model.[^1]
- Decomposition-style adapters: DoRA decomposes the weight update into a magnitude vector and a direction obtained from a LoRA-style low-rank product, intending to better match full fine-tuning dynamics. DoRA was added to PEFT in v0.9.0 (February 2024).[^4]
- Selective-tuning adapters: GaLore projects gradients into a low-rank subspace updated periodically, rather than parameterizing weights.
- Shared / random-projection adapters: VeRA is the canonical instance, with later work such as VB-LoRA (NeurIPS 2024) and RandLoRA pushing the idea further by sharing vector banks or learnable mixtures over random projections. PEFT issue trackers note that RandLoRA, added in PR #2464, is "similarly to VeRA" based on non-learnable random low-rank matrices combined through learnable matrices.
A second class of PEFT methods does not modify weights at all but instead inserts new modules or vectors:
- Adapter modules (Houlsby et al. 2019, Lin et al. 2020, Pfeiffer et al. 2021, Ruckle et al. 2021): two small fully-connected layers inserted between transformer sublayers. Adapters add inference latency that LoRA, DoRA, and VeRA do not.[^1]
- Prefix Tuning (Li and Liang 2021): a learned sequence of "virtual tokens" prepended to every layer's keys and values.
- Prompt Tuning and P-Tuning v2 (Lester et al. 2021, Liu et al. 2021): learned soft prompts at the input embedding layer or every layer.
- IA3 (Liu et al. 2022): scales the keys, values, and intermediate activations of attention and feed-forward layers by trained 1D vectors, conceptually a vector-only modulation that VeRA generalizes by combining vector modulation with a shared random projection.
- BitFit (Zaken et al. 2022): trains only the bias parameters.[^1]
Comparison summary
The following table summarizes per-adapter trainable-parameter counts and reported headline results for VeRA against a few comparable methods, drawn from numbers explicitly reported in the VeRA paper.[^1]
| Method | Trainable params (RoBERTa-large GLUE) | Inference latency overhead | Mechanism |
|---|
| Full FT | 355M | none | all weights |
| BitFit | 0.1M (RoBERTa-base) | none | biases only |
| Adapter^H | 6M | yes (extra layers) | inserted MLPs |
| Adapter^L | 0.8M | yes | inserted MLPs |
| LoRA | 0.8M | none (merge) | per-layer low-rank |
| LoRA-FA | 3.7M | none | freezes A, learns B |
| DoRA | comparable to LoRA | none | magnitude + direction |
| VeRA | 0.061M | none | shared random A, B + per-layer d, b |
Across the methods compared, VeRA achieves the smallest trainable-parameter count among approaches that add zero inference latency and stay competitive on the headline GLUE average.[^1] It is, however, complementary rather than strictly superior to LoRA-style methods: per the paper's discussion and ablations, the dramatic parameter savings depend on choosing a much larger rank r than LoRA would typically use, and on hyperparameter retuning relative to LoRA defaults.[^1]
Significance and Applications
The chief consequence of VeRA is that the storage cost of customization grows almost linearly with the number of adapted layers rather than with the product of layers, hidden size, and rank. The authors emphasize three concrete settings where this matters.[^1]
Per-user personalization at cloud scale. Cloud-hosted assistants that maintain a customized adapter per end user can fit hundreds of times more adapters in a fixed memory budget. The Appendix F instruction-tuning example illustrates the magnitude: VeRA's 5.4 MB per-adapter footprint versus LoRA's 610 MB at comparable Vicuna-Eval scores translates into roughly 110 times more concurrently-served users per GPU.[^1]
Edge and on-device inference. On-device deployment scenarios, including Edge AI and mobile inference, place hard limits on storage for fine-tuned models. Shrinking the adapter to a few megabytes while maintaining performance opens room for application-specific or even per-installation tuning on consumer hardware. The Qualcomm AI Research affiliation of one of the authors reflects this practical motivation; the company manufactures inference accelerators where adapter footprint is a real constraint.
Frequent task switching. Inference systems that switch between many fine-tuned models in flight (multi-tenant serving, agent routing, MoE-style routed expertise) benefit from VeRA's smaller per-adapter state because more adapters can be kept in fast memory simultaneously.
VeRA also has a more diffuse research effect: by demonstrating that fine-tuning can be done in the regime of dozens of kilobytes rather than megabytes per layer, it strengthens the empirical claim that the effective intrinsic dimensionality of task adaptation is much lower than LoRA's parameter count and motivates further work on extreme PEFT methods such as VB-LoRA and LoRA-XS.
Limitations and Criticisms
Several caveats apply to VeRA in practice.
Higher rank requires retuning. Because the parameter cost of raising r in VeRA is so small (one extra entry per layer), the recommended rank values are much larger than for LoRA. PEFT's default r=256 versus LoRA's typical 8 or 16 reflects this.[^3] Users porting LoRA recipes to VeRA must reset rank, learning rate, and d_init, often through a sweep.
No quantization interop yet. As of the most recent PEFT releases, VeRA does not support quantized base models in the same first-class way LoRA does. GitHub issues #1872 and #1921 document failures when combining VeRA with bitsandbytes 4-bit and 8-bit linear layers, while LoRA + QLoRA is the canonical efficient-fine-tuning recipe for 7B-and-up models.[^3] This restricts VeRA's deployment to quantization-free settings where the base model can be kept in higher precision or to architectures small enough that quantization is not strictly required.
Linear layers only. The PEFT implementation accepts only nn.Linear targets, ruling out direct adaptation of nn.Embedding, nn.Conv2d, or nn.LayerNorm parameters that some LoRA variants and DoRA support.[^3]
Hyperparameter sensitivity in low-data regimes. Although the GLUE average matches LoRA on RoBERTa-large, the only b and only d ablations show that VeRA's variance can blow up dramatically when key components are removed, with standard deviations above 10 accuracy points on RTE.[^1] In low-resource scenarios where running a full hyperparameter sweep is expensive, the upside of using a tiny VeRA adapter has to be weighed against the risk of unstable convergence.
Modest GLUE-base regression. On the smaller RoBERTa-base model, VeRA underperforms LoRA on average (85.2 vs 86.6), and the paper does not claim parity at the base scale.[^1] The "matches LoRA" headline strictly applies to the larger model where intrinsic-dimension arguments are most favorable.
Limited architectural coverage at publication. The original paper studies only Transformer encoders (RoBERTa), decoders (GPT-2, LLaMA), and Vision Transformers. Generalization to convolutional networks, diffusion models, speech models, or sentence encoders was left to future work.[^1]
Reception and Subsequent Work
VeRA was accepted as a poster at ICLR 2024 (April 2024) and has been cited as a parameter-efficiency benchmark by later PEFT papers and surveys.[^2] Notable follow-up work that explicitly extends or compares to VeRA:
- VB-LoRA (NeurIPS 2024) by Yang Li et al. generalizes the shared-projection idea further by storing a vector bank from which per-layer vectors are sampled.
- LoRA-XS (Babakniya et al. 2024) reduces parameter count below LoRA in a different way, by absorbing SVD-derived basis matrices into a small trainable matrix.
- RandLoRA (PEFT contribution, 2024) takes a closely related approach: non-learnable random low-rank matrices combined through learnable mixing matrices, an idea cited in the PEFT changelog as "similarly to VeRA."
- Several PEFT surveys (for example, the 2024 PEFT taxonomy by Mao et al. and the 2025 survey by Han et al.) cite VeRA in their LoRA-variant taxonomies.
The arXiv paper page on HuggingFace had attracted multiple community implementations and analyses prior to the official PEFT integration; the LinkedIn post by Kopiczko announcing the work garnered widespread engagement in the PEFT research community.
VeRA's nearest predecessor is LoRA itself, which fixes the methodology of decomposing a fine-tuning delta into a per-layer low-rank product BA. QLoRA combines that decomposition with 4-bit base-model quantization, and VeRA's instruction-tuning experiments inherit QLoRA's quantized-training recipe almost verbatim, only swapping the LoRA layers for VeRA layers.[^1] DoRA, accepted at ICML 2024 and added to PEFT one minor release before VeRA, decomposes a LoRA update into magnitude and direction components and outperforms LoRA in some low-rank settings; it represents an orthogonal axis of variation to VeRA, focusing on what is parameterized rather than how parameters are shared.[^4] GaLore, accepted at ICML 2024 by Zhao et al., applies a low-rank projection to gradients during full-parameter optimization rather than to the weights themselves, addressing a related but distinct objective of training-time memory reduction. Low-rank adaptation is the umbrella term for this whole family of methods. Instruction tuning, in turn, is the primary downstream regime in which VeRA's parameter savings translate to deployable per-user adapters.
See also
References