DoRA (Weight-Decomposed Low-Rank Adaptation)

Machine Learning Training & Optimization

24 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 4,816 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DoRA (Weight-Decomposed Low-Rank Adaptation) is a parameter-efficient fine-tuning (PEFT) method for large neural networks introduced in February 2024 by researchers from NVIDIA, the Hong Kong University of Science and Technology, and National Taiwan University.^[1] The method decomposes each pre-trained weight matrix into a column-wise magnitude vector and a directional matrix, then applies the standard LoRA low-rank update only to the directional component while training the magnitude vector with ordinary gradient descent.^[1]^[2] As the paper states, "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters."^[1] In benchmarks reported by the original authors, DoRA consistently outperforms vanilla LoRA on commonsense reasoning, multi-turn instruction following, and image and video-text understanding tasks across LLaMA, LLaMA 2, LLaMA 3, LLaVA, and VL-BART, often closing or exceeding the gap to full fine-tuning at comparable parameter budgets.^[1]^[3] The method was presented as an Oral paper at the 41st International Conference on Machine Learning (ICML 2024) and is implemented in the Hugging Face PEFT library, Axolotl, and other downstream training frameworks.^[4]^[5]^[6] This article describes the parameter-efficient fine-tuning method and is unrelated to the Noetix Robotics Dora humanoid robot.

What is DoRA?

DoRA is a drop-in upgrade to LoRA, the most widely used parameter-efficient fine-tuning technique. Where LoRA learns a single low-rank update added to the frozen pre-trained weights, DoRA first splits each weight matrix into a magnitude (how large each column is) and a direction (where each column points), then learns those two quantities separately: the magnitude through a small free parameter vector and the direction through a LoRA-style low-rank update.^[1] The motivation is empirical: the authors' weight-decomposition analysis found that full fine-tuning and LoRA change weights in structurally different ways, and that decoupling magnitude from direction lets a low-cost adapter behave much more like full fine-tuning.^[1]^[3] Because the two components can be recombined into a single matrix after training, DoRA preserves LoRA's defining property of zero inference overhead.^[1]^[3]

Origins and authorship

The DoRA paper lists seven authors split between NVIDIA Research and academic collaborators: Shih-Yang Liu (jointly affiliated with NVIDIA Research and the Hong Kong University of Science and Technology and listed as first author), Chien-Yi Wang and Hongxu Yin of NVIDIA Research, Pavlo Molchanov leading the NVIDIA Research efficient deep learning group, Yu-Chiang Frank Wang of NVIDIA Research Taiwan and National Taiwan University, Kwang-Ting Cheng of HKUST, and Min-Hung Chen of NVIDIA Research Taiwan (corresponding author).^[1]^[2] The work was performed primarily at NVIDIA's Taiwan research group and submitted to arXiv on February 14, 2024 as version 1 of arXiv:2402.09353; the paper went through revisions culminating in version 6 on July 9, 2024 to incorporate camera-ready changes and additional LLaMA 3 experiments.^[1]

How was DoRA received at ICML 2024?

DoRA was accepted to ICML 2024 with an Oral presentation slot, a tier the conference's program committee reports represented roughly 1.5% of submissions that year.^[6]^[2] The acceptance and Oral designation triggered substantial visibility for the method during the spring and summer of 2024, including coverage in the NVIDIA Developer Technical Blog (June 28, 2024) and a wave of LoRA-variant follow-up papers that adopted DoRA either as a baseline to beat or as a component in larger PEFT systems.^[3]

What problem does DoRA solve?

Parameter-efficient fine-tuning methods adapt large pre-trained models to downstream tasks by training only a small fraction of the weights, dramatically reducing memory and compute requirements relative to full fine-tuning.^[7] LoRA, introduced by Hu et al. in 2021, freezes the pre-trained weights and trains a low-rank update of the form $W' = W_0 + BA$ , where B and A are low-rank matrices of inner dimension r much smaller than the matrix dimensions.^[7] Because the low-rank update can be merged into the base weights after training, LoRA adds no inference latency relative to the original model.^[7]

Despite its popularity, LoRA frequently leaves a residual accuracy gap relative to full fine-tuning, particularly on harder tasks and at lower ranks.^[1]^[3] The DoRA paper frames this directly: "there still often exists an accuracy gap between these methods and full fine-tuning (FT)."^[1] Practitioners and researchers have proposed numerous LoRA variants attempting to close this gap, including approaches that rescale the low-rank factors, allocate ranks per layer, or combine LoRA with quantization (as in QLoRA).^[7] DoRA approaches the problem from a different angle: rather than tweaking the form of the low-rank update, it asks whether LoRA and full fine-tuning differ in the kinds of weight changes they produce, and uses that analysis to motivate a new parameterization.^[1]

Weight decomposition and weight normalization

The decomposition that motivates DoRA has a long history in neural network optimization. Salimans and Kingma's 2016 weight normalization technique reparameterizes each weight vector w as $w = g \frac{v}{\lVert v \rVert}$ , where g is a scalar gain and v is a learned direction.^[8] DoRA generalizes this idea to entire weight matrices used in attention and feed-forward layers of transformer models: each pre-trained matrix W is written as the product of a learnable magnitude vector m (one scalar per column) and a unit-normalized directional matrix $V / \lVert V \rVert_c$ , where $\lVert V \rVert_c$ denotes the vector of column-wise Euclidean norms.^[1]

By construction, this decomposition is exact for any pre-trained W (set m to $\lVert W \rVert_c$ and V to W).^[1] The key choice DoRA makes is how to update m and V during fine-tuning: m is trained as a small free parameter vector, while V is updated through a low-rank LoRA-style increment, keeping the total trainable parameter count close to LoRA.^[1]

How does the weight-decomposition analysis motivate DoRA?

Section 3 of the DoRA paper introduces a weight-decomposition analysis tool that examines how weight magnitude and direction change during training under different fine-tuning regimes.^[1] For each weight matrix, the authors compute the change in magnitude ( $\Delta M$ ) and the change in direction ( $\Delta D$ ) at each training step relative to the pre-trained initialization, then plot the two quantities against each other for individual layers.^[1] The change in direction is measured as one minus the cosine similarity between the corresponding column vectors of the fine-tuned and pre-trained matrices, while the change in magnitude is measured as the absolute difference between the column-wise norms.^[1]

Two patterns emerge: full fine-tuning produces a roughly negative correlation between magnitude change and direction change, meaning that layers that shift their directions tend to keep similar magnitudes (and vice versa).^[1]^[3] LoRA, by contrast, exhibits a positive correlation: magnitude and direction tend to move together, suggesting that the low-rank parameterization is unable to disentangle the two kinds of update that full fine-tuning naturally performs.^[1] The authors interpret this as evidence that LoRA's learning capacity is restricted not just by rank but by the implicit coupling between magnitude and direction in the low-rank update form.^[1]

This observation reframes the LoRA-versus-full-fine-tuning gap as not purely a question of how many parameters are trainable, but of which directions in weight space the parameterization can independently access.^[1]^[3] If full fine-tuning's success depends partly on being able to scale a direction without rotating it (or vice versa), then any update parameterization that ties scaling to rotation will lose accuracy, even at parameter budgets that might otherwise be sufficient.^[1]

DoRA is designed to break this coupling explicitly: by giving the magnitude its own learned vector and confining the low-rank update to the directional part, the method recovers a magnitude-direction update pattern much closer to full fine-tuning's, while keeping the parameter count near LoRA's.^[1]^[3] According to NVIDIA, "DoRA demonstrates the ability to make only substantial directional adjustments with relatively minimal changes in magnitude or the reverse, while showing learning patterns closer to FT. This signifies its superior learning capacity over LoRA."^[3] In the analysis figures reported by the authors, DoRA's magnitude/direction correlation closely tracks that of full fine-tuning rather than that of LoRA.^[1]^[3] This visual diagnostic is one of the paper's most-cited contributions independently of the DoRA method itself, since it provides a low-cost tool for inspecting any new PEFT parameterization against full fine-tuning as a reference.^[3]

How does DoRA work?

Formulation

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d_{out} \times d_{in}}$ , DoRA writes the fine-tuned weight as:

W' = m \, \frac{V + \Delta V}{\lVert V + \Delta V \rVert_c}

where:

m is a learnable vector in $\mathbb{R}^{1 \times d_{in}}$ (one scalar per input feature column), initialized to the column-wise Euclidean norms of W_0,
V is a frozen matrix initialized to W_0,
$\Delta V = BA$ is a low-rank update with $B \in \mathbb{R}^{d_{out} \times r}$ , $A \in \mathbb{R}^{r \times d_{in}}$ , and rank $r$ typically chosen in $\{4, 8, 16, 32\}$ ,
$\lVert \cdot \rVert_c$ denotes column-wise $\ell_2$ norm and the division and multiplication are broadcast column-wise.^[1]

At initialization ( $\Delta V = 0$ and m equal to the original column norms), the decomposition reproduces W_0 exactly, so the network's output is unchanged.^[1] During training, gradients flow through both m and the low-rank factors A and B, while V and the rest of the base model remain frozen.^[1] The total number of trainable parameters per adapted matrix is $r(d_{in} + d_{out}) + d_{in}$ , which is dominated by the LoRA term and only marginally larger than vanilla LoRA's $r(d_{in} + d_{out})$ .^[1]

Gradient detachment trick

A naive implementation of the formula above is more expensive than LoRA because the column-wise norm operation introduces additional gradient computations through V.^[1] The DoRA paper proposes detaching $\lVert V + \Delta V \rVert_c$ from the autograd graph when computing gradients, treating it as a constant for backpropagation purposes.^[1] This eliminates the second-order gradient contribution while empirically having minimal effect on training quality, and reduces DoRA's training-time memory overhead to roughly that of LoRA plus the magnitude vector.^[1]

Does DoRA add inference overhead?

No. Once training is complete, DoRA's update can be merged back into a single full-rank weight matrix $W' = m (V + BA) / \lVert V + BA \rVert_c$ , identical in shape and computational cost to the original W_0.^[1]^[3] As the NVIDIA technical blog puts it, "DoRA can be considered a costless replacement for LoRA, as its decomposed magnitude and direction components can be merged back into the pretrained weight after the training, ensuring that there is no extra inference overhead."^[3] This means deployment infrastructure that loads a merged checkpoint sees no architectural difference between DoRA-finetuned and base-model inference, preserving LoRA's "zero inference overhead" property.^[1]^[3]

If adapters are kept un-merged (for example, to swap between multiple task-specific adapters on a shared base model), the per-forward overhead includes the column-norm computation and the broadcast scaling, which the Hugging Face PEFT documentation reports adds noticeable time relative to LoRA: in benchmarks on a Llama 3.1 8B model, un-merged DoRA without caching ran approximately 139% slower than LoRA and used about 4% more memory; with the library's DoRA caching helper enabled in evaluation mode, the time penalty fell to about 17% but memory usage rose by 41%.^[5] For production inference, the PEFT team recommends merging the DoRA adapter into the base weights to eliminate both penalties.^[5]

Which layer types does DoRA support?

DoRA is defined for any weight matrix and has been extended to several common layer types. The reference NVIDIA implementation supports Linear, Conv1d, Conv2d, and bitsandbytes-quantized linear layers.^[2] The Hugging Face PEFT implementation initially supported only non-quantized linear layers in v0.9.0 (released February 28, 2024), then added Conv2D support and DoRA with bitsandbytes quantization ("QDoRA") in v0.10.0 (released March 21, 2024).^[4]^[9] PEFT documentation notes that DoRA in PEFT supports embedding, linear, and Conv2d layers and is incompatible with a small number of more complex PEFT features such as VeLoRA composition.^[5]

How does DoRA compare to LoRA?

The paper reports extensive evaluations on language and vision-language benchmarks, primarily on the LLaMA family for language tasks and on LLaVA and VL-BART for multimodal tasks.^[1]^[3] NVIDIA summarizes the headline gains as DoRA outperforming LoRA "such as common-sense reasoning (+3.7/+1.0 on Llama 7B/13B, +2.9 on Llama 2 7B, and +4.4 on Llama 3 8B)."^[3]

Commonsense reasoning

The most cited benchmark in the paper is an eight-task commonsense reasoning suite assembled by the authors: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC-easy (ARC-e), ARC-challenge (ARC-c), and OpenBookQA (OBQA).^[1]^[3] Models are fine-tuned on a Commonsense170K instruction dataset and evaluated zero-shot on each downstream task; reported scores are mean accuracy across the eight tasks.^[1]

Headline results, with parameter percentages quoted relative to the base model:

Model	Method	Trainable params	Average accuracy
LLaMA-7B	LoRA	0.83%	74.7%
LLaMA-7B	DoRA	0.84%	78.4%
LLaMA-13B	LoRA	0.67%	80.5%
LLaMA-13B	DoRA	0.68%	81.5%
LLaMA2-7B	LoRA	0.83%	77.6%
LLaMA2-7B	DoRA	0.84%	79.7%
LLaMA3-8B	LoRA	0.70%	80.8%
LLaMA3-8B	DoRA	0.71%	85.2%

Source: arXiv:2402.09353 Table 1 and the official project page.^[1]^[3]

Across the four base models, DoRA's average accuracy exceeds LoRA's by between 1.0 and 4.4 percentage points while training only a marginal additional fraction of parameters (the magnitude vector).^[1]^[3] The gap is largest on LLaMA 3 8B (+4.4) and smallest on LLaMA-13B (+1.0).^[1]^[3]

Halved-rank experiments (DoRA†)

The authors additionally report a "DoRA dagger" variant in which DoRA uses approximately half the LoRA rank, intended to test whether the method's gains come purely from added parameters or from improved learning capacity.^[1] This ablation is important because DoRA's full rank-16 or rank-32 configuration adds the magnitude vector on top of an unchanged LoRA, so a skeptical reader might wonder whether any benefit is simply due to the extra trainable scalars rather than to the structural advantages of the magnitude/direction split.^[1]^[3] By halving the rank, the authors construct DoRA configurations that have fewer total trainable parameters than the LoRA baseline while still maintaining the magnitude/direction decomposition; if DoRA's improvements survive this stricter parameter accounting, they cannot be explained away as a simple capacity argument.^[1]

In these experiments, DoRA at roughly half the rank still matches or exceeds LoRA at full rank:^[1]^[3]

Model	Method	Trainable params	Average accuracy
LLaMA-7B	LoRA (r=32)	0.83%	74.7%
LLaMA-7B	DoRA† (r=16)	0.43%	77.5%
LLaMA-13B	LoRA (r=32)	0.67%	80.5%
LLaMA-13B	DoRA† (r=16)	0.35%	80.8%
LLaMA2-7B	LoRA (r=32)	0.83%	77.6%
LLaMA2-7B	DoRA† (r=16)	0.43%	80.5%
LLaMA3-8B	LoRA (r=32)	0.70%	80.8%
LLaMA3-8B	DoRA† (r=16)	0.35%	85.0%

These halved-rank results are the basis for the paper's claim that DoRA provides genuine extra learning capacity rather than simply benefiting from additional trainable parameters.^[1]^[3] In several cases (notably LLaMA 2 7B and LLaMA 3 8B), the rank-halved DoRA configuration even nearly matches or outperforms the full-rank DoRA configuration, suggesting that the magnitude/direction split is doing most of the heavy lifting and that the additional rank is contributing only marginally on top of it.^[1]^[3]

Multimodal and vision-language tasks

On image and video-text understanding tasks with VL-BART, the NVIDIA technical blog reports DoRA improving average benchmark scores by 0.9 to 1.9 points relative to LoRA at matched parameter budgets.^[3] On visual instruction tuning with LLaVA 1.5 7B, DoRA improves average scores by approximately 0.6 points over LoRA.^[3] On the Multi-Turn Benchmark (MT-Bench) for instruction tuning, DoRA improves by 0.3 to 0.4 points over LoRA on Llama variants.^[3]

DVoRA: DoRA combined with VeRA

The DoRA repository includes "DVoRA," a variant in which the directional low-rank update is replaced with a Vector-based Random matrix Adaptation (VeRA) parameterization, which shares random projection matrices across layers and trains only small scaling vectors.^[2] On MT-Bench using cleaned Alpaca data, DVoRA achieves an average score of 5.0 versus VeRA's 4.3 while adding approximately 0.02% additional trainable parameters relative to the full model.^[2]

Where can you use DoRA? Implementations and adoption

Reference implementations

Two official implementations are publicly available: the NVIDIA NVlabs/DoRA repository (released under the NVIDIA Source Code License-NC, non-commercial) and a companion nbasyl/DoRA repository maintained by the lead author Shih-Yang Liu.^[2]^[10] The NVlabs repo includes fine-tuning recipes for LLaMA, LLaMA 2, LLaMA 3, LLaVA 1.5, and VL-BART, and references Stable Diffusion integration via Hugging Face Diffusers.^[2]

Hugging Face PEFT

DoRA is supported in the Hugging Face PEFT library starting with version 0.9.0 (released February 28, 2024), enabled by setting use_dora=True in a LoraConfig.^[4] PEFT v0.10.0 (March 21, 2024) added Conv2D support, plus quantized linear layers via bitsandbytes (the integration the PEFT team refers to as "QDoRA").^[9] Later PEFT releases added a DoraCaching helper context and an ephemeral_gpu_offload runtime option to reduce DoRA's training-time speed penalty by reusing intermediate computations when models are in evaluation mode or when dropout is zero.^[5]

The PEFT documentation lists three caveats: DoRA in PEFT supports only embedding, linear, and Conv2D layers; un-merged DoRA introduces nontrivial overhead and should be merged before inference; and QDoRA in combination with DeepSpeed ZeRO-2 has been reported to cause issues.^[5]

Other frameworks

DoRA is supported in Axolotl via a peft_use_dora configuration option and in the LLaMA-Factory toolkit, both of which build on Hugging Face PEFT.^[11] Community guides describe DoRA alongside LoRA, QLoRA, and Spectrum as standard PEFT methods for adapting open-weight models on consumer hardware.^[11] Unsloth and similar high-performance fine-tuning libraries have also added DoRA support as it propagated through the PEFT ecosystem.^[5]

NVIDIA's June 2024 technical blog announced planned support for DoRA across the company's NIM, TensorRT, NeMo, and Metropolis product platforms, positioning it as a "costless replacement for LoRA" for customers building on NVIDIA inference and training infrastructure.^[3]

What is QDoRA? 4-bit quantized DoRA

A particularly visible downstream application is QDoRA, a 4-bit quantized variant of DoRA developed in collaboration between NVIDIA and Answer.AI in 2024.^[3]^[12] QDoRA applies the DoRA decomposition on top of 4-bit NF4-quantized base weights (the same quantization scheme used by QLoRA), so that the magnitude vector and low-rank directional update are trained in higher precision while the bulk of the network remains in 4 bits.^[12]

Answer.AI's April 22, 2024 technical post by Kerem Turgutlu reported Orca-Math fine-tuning experiments on Llama 2 7B and Llama 3 8B and 70B comparing full fine-tuning, QLoRA, and QDoRA with Fully Sharded Data Parallel (FSDP) training.^[12] On 10,000 Orca-Math samples, the reported exact-match scores were 0.182 for full fine-tuning, 0.176 for QDoRA, and 0.098 for QLoRA; on 100,000 samples, the scores were 0.260, 0.312, and 0.118 respectively.^[12] At the larger data scale, QDoRA exceeded full fine-tuning's exact-match score while training approximately 2% of the model parameters and using roughly an order of magnitude less peak GPU memory than full fine-tuning.^[12]

Diffusion and image personalization

DoRA has been used for text-to-image personalization through the Hugging Face Diffusers library. The NVIDIA blog and project page describe DreamBooth-style fine-tuning of Stable Diffusion and SDXL models in which DoRA replaces LoRA as the parameterization for the cross-attention adapters, reporting better identity preservation and prompt fidelity in qualitative comparisons.^[3] Personalization workflows benefit from DoRA in two distinct ways: the magnitude/direction decomposition appears to help retain pre-trained knowledge of the subject's surrounding context (lighting, pose, scene composition) while specializing the directional component for the new identity, and the merged-checkpoint inference path means that personalized DoRA adapters can be deployed in production diffusion pipelines (such as commercial image-generation services) with no inference cost penalty relative to a plain Stable Diffusion checkpoint.^[2]^[3]

Why does DoRA matter? Applications and significance

Practical benefits

The headline practical claim for DoRA is "LoRA-like cost, full-fine-tuning-like quality."^[1]^[3] In the regimes studied by the original paper and follow-up reports, DoRA closes most of the residual accuracy gap between LoRA and full fine-tuning at parameter budgets indistinguishable from LoRA's, and in some settings (commonsense reasoning on LLaMA 3 8B, Orca-Math with QDoRA) exceeds full fine-tuning while training a small fraction of the parameters.^[1]^[3]^[12]

For practitioners, this matters because LoRA is the de facto standard PEFT method for adapting open-weight LLMs and vision-language models on consumer or single-node hardware, and a drop-in replacement that improves quality with minimal change to training recipes is unusually low-friction.^[5]^[11] In Hugging Face PEFT, switching from LoRA to DoRA literally requires only adding use_dora=True to an existing LoraConfig.^[4]^[5]

Implications for PEFT research

DoRA's weight-decomposition analysis has also influenced how the broader PEFT literature reasons about why LoRA-family methods underperform full fine-tuning.^[1] Several follow-up papers have extended the magnitude/direction lens to other PEFT designs, proposed alternative decompositions (for example, separating layer norms or biases as their own learnable scalars), or combined DoRA with rank-allocation strategies, adaptive rank pruning, and other PEFT variants.^[3] In the Hugging Face PEFT library, DoRA sits alongside a growing set of LoRA variants that target the same fine-tuning-gap problem from different angles.^[5]

Scope of applicability

Because DoRA modifies the LoRA reparameterization rather than introducing a new architecture, it is in principle applicable wherever LoRA is applicable: transformer-based large language models and vision-language models, convolutional models (via the Conv2D and Conv1D variants), Stable Diffusion and SDXL image generators, and any other architecture where dense weight matrices dominate the parameter count.^[2]^[5]

In practice, the choice of which matrices to apply DoRA to mirrors LoRA conventions: most published results target query and value projections of the attention sublayers (and increasingly key, output, gate, up, and down projections in the feed-forward sublayers) of decoder-only large language models such as the LLaMA family.^[1]^[2] For convolutional and diffusion architectures, applying DoRA to the cross-attention projection matrices typically yields the largest quality improvements per added parameter.^[2]^[3]

How does DoRA compare to other LoRA improvements?

A number of LoRA-improvement papers were proposed in 2023 and 2024 that targeted similar accuracy gaps from different angles. AdaLoRA adaptively allocates rank across layers based on the importance of each weight matrix; LoRA+ uses different learning rates for the A and B factors; PiSSA initializes the low-rank factors from the dominant singular vectors of the pre-trained matrix; and rsLoRA rescales the LoRA output by $1/\sqrt{r}$ instead of $1/r$ to stabilize training at high ranks.^[5] DoRA's contribution is orthogonal to these in the sense that the magnitude/direction split could in principle be combined with any of them (and some Hugging Face PEFT documentation suggests trying DoRA alongside LoRA+ or rsLoRA in particular), though the original DoRA paper does not present such joint experiments.^[5]

What are DoRA's limitations?

Training-time overhead

DoRA is not free: each forward and backward pass through a DoRA-adapted layer computes a column-wise norm and a broadcast division, both of which cost more than LoRA's plain matrix product.^[1]^[5] As noted above, Hugging Face PEFT benchmarks on a Llama 3.1 8B model show that un-merged DoRA training takes roughly 2.4 times as long as LoRA without optimizations, falling to roughly 1.2 times with the DoRA caching helper enabled.^[5] On the same benchmark, an optimized DoRA training run achieved 2.292 train samples per second versus 1.779 without the optimization, a 29% speedup from the in-library optimizations alone.^[5]

The training-time gap matters for very large models or when fine-tuning experiments are wall-clock-bound. For inference, the gap disappears as long as the adapter is merged before deployment.^[1]^[3]

Memory overhead

DoRA's column-wise norm requires retaining additional activations during backpropagation, so peak training memory is somewhat higher than for LoRA at the same rank.^[5] With caching enabled, PEFT reports approximately 41% additional memory consumption relative to LoRA, which can be the binding constraint when fine-tuning the largest open-weight models on single GPUs.^[5]

Implementation maturity and compatibility

As a younger method than LoRA, DoRA has narrower ecosystem support. PEFT documentation lists explicit incompatibilities with some composition features (VeLoRA), and Axolotl issue reports flag specific configurations such as 8-bit DoRA training with FSDP that have not worked reliably.^[5]^[11] QDoRA in combination with DeepSpeed ZeRO-2 has also been reported as problematic.^[5]

Licensing

The reference NVlabs/DoRA implementation is released under the NVIDIA Source Code License-NC, which prohibits commercial use of that codebase.^[2] The Hugging Face PEFT reimplementation is permissively licensed and is the practical path for commercial deployment.^[4]^[5] Users building products on DoRA need to be aware of which implementation they are pulling code from.

Benchmark coverage

While DoRA reports strong results on commonsense reasoning, vision-language tasks, and a handful of instruction-tuning evaluations, its public benchmark coverage is narrower than LoRA's, which has been studied across thousands of downstream tasks in the broader PEFT literature.^[1]^[7] Independent replications on additional benchmarks have generally confirmed DoRA's advantages on the tasks measured, though, like any new method, performance can vary by task and base model.^[3]^[12]

DoRA sits within a broader ecosystem of low-rank and decomposition-based PEFT methods. Notable adjacent approaches include:

LoRA (Low-Rank Adaptation): the parent method that DoRA modifies, freezing pre-trained weights and adding a trainable low-rank update.^[7]
QLoRA: combines LoRA with 4-bit NF4 quantization of the base weights, enabling fine-tuning of very large models on a single GPU. QDoRA is the DoRA analog of QLoRA.^[3]^[12]
VeRA (Vector-based Random matrix Adaptation): shares fixed random projection matrices across layers and trains only small scaling vectors. DVoRA combines DoRA's magnitude/direction decomposition with VeRA's directional update.^[2]
Weight Normalization (Salimans and Kingma, 2016): the historical predecessor of DoRA's magnitude/direction decomposition, applied at the level of individual neuron weight vectors rather than full matrices.^[8]
Low-rank adaptation as a general technique class, encompassing LoRA, DoRA, and many other variants.^[7]

Method	Decomposition	Trainable params	Inference overhead
Full fine-tuning	None (all weights)	100%	None
LoRA	$W_0 + BA$	$r(d_{in} + d_{out})$	None (merged)
DoRA	$m (V + BA) / \lVert V + BA \rVert_c$	$r(d_{in} + d_{out}) + d_{in}$	None (merged)
QLoRA	$\mathrm{NF4}(W_0) + BA$	$r(d_{in} + d_{out})$	None (merged)
QDoRA	$m (\mathrm{NF4}(V) + BA) / \lVert \mathrm{NF4}(V) + BA \rVert_c$	$r(d_{in} + d_{out}) + d_{in}$	None (merged)
VeRA	shared random A, B + scalars	small scalar vectors	None (merged)
DVoRA	DoRA with VeRA-style update	small scalar vectors + m	None (merged)

Sources: original papers and the NVlabs DoRA README.^[1]^[2]^[7]

Disambiguation

The acronym "Dora" appears in several unrelated contexts in artificial intelligence and robotics. The most prominent collision is with the Noetix Robotics Dora humanoid robot, a separate product unrelated to the parameter-efficient fine-tuning method described here. Earlier academic uses of the name include a non-related "DoRA" referring to "Distributed Optimization Reinforcement Algorithm" and various proper names. When citing this method, the canonical reference is Liu et al., "DoRA: Weight-Decomposed Low-Rank Adaptation," arXiv:2402.09353.^[1]

References

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen, "DoRA: Weight-Decomposed Low-Rank Adaptation", arXiv, 2024-02-14 (v1; revised through v6 on 2024-07-09). https://arxiv.org/abs/2402.09353. Accessed 2026-05-20. ↩
NVIDIA Labs, "NVlabs/DoRA: [ICML2024 (Oral)] Official PyTorch implementation of DoRA", GitHub, 2024. https://github.com/NVlabs/DoRA. Accessed 2026-05-20. ↩
Shih-Yang Liu, Chien-Yi Wang, Min-Hung Chen and colleagues, "Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning", NVIDIA Technical Blog, 2024-06-28. https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/. Accessed 2026-05-20. ↩
Hugging Face, "Release v0.9.0: Merging LoRA weights, new quantization options, DoRA support, and more", GitHub, 2024-02-28. https://github.com/huggingface/peft/releases/tag/v0.9.0. Accessed 2026-05-20. ↩
Hugging Face, "LoRA developer guide (PEFT documentation)", Hugging Face Docs, accessed 2026-05-20. https://huggingface.co/docs/peft/main/en/developer_guides/lora. Accessed 2026-05-20. ↩
ICML, "DoRA: Weight-Decomposed Low-Rank Adaptation (Oral)", International Conference on Machine Learning 2024 program, 2024. https://icml.cc/virtual/2024/oral/35576. Accessed 2026-05-20. ↩
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, "LoRA: Low-Rank Adaptation of Large Language Models", arXiv, 2021-06-17. https://arxiv.org/abs/2106.09685. Accessed 2026-05-20. ↩
Tim Salimans and Diederik P. Kingma, "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks", arXiv, 2016-02-25. https://arxiv.org/abs/1602.07868. Accessed 2026-05-20. ↩
Hugging Face, "Release v0.10.0: Fine-tune larger QLoRA models with DeepSpeed and FSDP, layer replication, enhance DoRA", GitHub, 2024-03-21. https://github.com/huggingface/peft/releases/tag/v0.10.0. Accessed 2026-05-20. ↩
Shih-Yang Liu, "nbasyl/DoRA: Official implementation of DoRA: Weight-Decomposed Low-Rank Adaptation", GitHub, 2024. https://github.com/nbasyl/DoRA. Accessed 2026-05-20. ↩
Superteams.ai, "A Definitive Guide to Fine-Tuning LLMs Using Axolotl and Llama-Factory", Superteams blog, 2024. https://www.superteams.ai/blog/a-definitive-guide-to-fine-tuning-llms-using-axolotl-and-llama-factory. Accessed 2026-05-20. ↩
Kerem Turgutlu (Answer.AI), "Efficient finetuning of Llama 3 with FSDP QDoRA", Answer.AI blog, 2024-04-22. https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Abbreviations LoftQ LongLoRA NormalFloat 4-bit (NF4)QLoRA VeRA (Vector-based Random Matrix Adaptation)rsLoRA (Rank-Stabilized LoRA)

What is DoRA?

Origins and authorship

How was DoRA received at ICML 2024?

What problem does DoRA solve?

Weight decomposition and weight normalization

How does the weight-decomposition analysis motivate DoRA?

How does DoRA work?

Formulation

Gradient detachment trick

Does DoRA add inference overhead?

Which layer types does DoRA support?

How does DoRA compare to LoRA?

Commonsense reasoning

Halved-rank experiments (DoRA†)

Multimodal and vision-language tasks

DVoRA: DoRA combined with VeRA

Where can you use DoRA? Implementations and adoption

Reference implementations

Hugging Face PEFT

Other frameworks

What is QDoRA? 4-bit quantized DoRA

Diffusion and image personalization

Why does DoRA matter? Applications and significance

Practical benefits

Implications for PEFT research

Scope of applicability

How does DoRA compare to other LoRA improvements?

What are DoRA's limitations?

Training-time overhead

Memory overhead

Implementation maturity and compatibility

Licensing

Benchmark coverage

Related work

Disambiguation

See also

References

Improve this article

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here