DARE (Drop And REscale)
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,870 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,870 words
Add missing citations, update stale details, or suggest a clearer explanation.
DARE (Drop And REscale) is a training-free preprocessing technique for model merging that sparsifies the parameter changes introduced by fine-tuning before those changes are combined across several models. It was introduced by Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li, with co-authors affiliated with Alibaba Group, in the paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch." The work first appeared as an arXiv preprint on November 6, 2023, and was published at the International Conference on Machine Learning (ICML) in 2024. [1]
The method rests on a single empirical observation: the delta parameters produced by supervised fine-tuning, meaning the elementwise difference between a fine-tuned model and its pre-trained base, are highly redundant. Because of that redundancy, a large fraction of them can be reset to zero without measurably degrading the fine-tuned model. DARE turns this observation into a recipe: randomly drop a proportion p of the delta parameters using a Bernoulli mask, then rescale the survivors by a factor of 1 / (1 - p) so that their expected magnitude is preserved. The sparsified deltas from several homologous models can then be added together, by simple averaging, task arithmetic, or TIES-Merging, with far less destructive interference than merging the dense deltas directly. [1][2]
The goal DARE addresses is fusing the abilities of multiple specialist models, for example one tuned for mathematics, one for code, and one for instruction following, into a single model without any additional training, gradient computation, or GPUs. All of the models must be homologous, meaning they were fine-tuned from the same pre-trained checkpoint and therefore share an identical architecture and initialization. This shared origin is what makes the per-parameter differences comparable and additive. [1]
DARE itself is not a merging algorithm. It is a sparsification step applied to each model's delta independently, and it is designed to be composed with an existing merging operator. The headline empirical claim of the paper is that for typical supervised fine-tuning deltas, DARE can eliminate 90 percent or even 99 percent of the parameters with negligible effect on the individual model, and that doing so before merging reduces the interference that normally causes merged models to lose capability. The authors also report that the benefit grows with model scale, to the point where a merged model can exceed the performance of any of its individual source models. [1][2]
A fine-tuned model can be described by its base weights plus a difference vector. If the pre-trained weights are written as a vector and the fine-tuned weights as another, the delta parameters are their elementwise subtraction. This same object is called a task vector in the task arithmetic literature, where adding or subtracting task vectors is used to add or remove capabilities. [1]
The core finding behind DARE is that these supervised fine-tuning deltas occupy a very small numerical range, typically within about 0.002 in absolute value, and that the information they carry is spread across the parameters with heavy redundancy. The authors demonstrate this by showing that randomly zeroing the overwhelming majority of delta entries leaves benchmark accuracy essentially intact, provided the survivors are rescaled to compensate. The paper frames this as evidence that supervised fine-tuning mostly unlocks abilities already latent in the pre-trained model rather than writing substantial new information into the weights. A practical corollary, confirmed in the paper's analysis, is that the redundancy is specific to lightweight supervised fine-tuning: when models are produced by continued pre-training, the deltas have much larger magnitude and variance, and aggressively dropping them is far more damaging. [1][6]
DARE is applied to one model's delta at a time and consists of two operations whose initials give the method its name. [1]
First, Drop. Each delta parameter is independently set to zero with probability p, the drop rate, using a random Bernoulli mask. Equivalently, each parameter survives with probability 1 - p. This step is unstructured and data-free: it does not look at the magnitude of a weight, at training data, or at any importance score, which distinguishes it from magnitude-based pruning that deterministically keeps the largest weights.
Second, REscale. The surviving delta parameters are divided by (1 - p), equivalently multiplied by 1 / (1 - p). Because a parameter survives with probability 1 - p, the expected value of each rescaled delta entry equals its original value, so the sparsified delta is an unbiased estimate of the dense delta and the model's expected outputs are approximately preserved. This is the same correction used by inverted dropout, with the important difference that DARE applies it once to the weight differences at merge time rather than stochastically to activations during training.
After both operations, the sparsified and rescaled delta is added back onto the shared base weights. When merging several models, DARE is run on each model's delta separately, and the resulting sparse deltas are combined by the chosen merging operator. In the simplest case this is an unweighted average or a scaled sum (task arithmetic), but DARE can equally feed a more elaborate operator such as TIES-Merging. The merged weights are the base weights plus the combined, sparsified deltas. Because most entries in each delta are now zero, two models rarely try to modify the same coordinate at once, which is the mechanism by which DARE reduces interference. [1][2]
The drop rate p is the method's main hyperparameter. For the small supervised fine-tuning deltas studied in the paper, very high rates such as 0.9 to 0.99 work well. In general-purpose merging toolkits the retained fraction (called density, equal to 1 - p) is often set more conservatively, for example keeping 20 to 50 percent of the deltas, because models in the wild are not always pure supervised fine-tunes. [2][3]
The paper validates DARE on both encoder and decoder language models, with the most striking results on decoder-based large language models in the 7B to 13B range. The merged models are built from publicly released specialists such as WizardLM (instruction following), WizardMath (mathematical reasoning), and a WizardCoder Python variant (code generation), all derived from the LLaMA family, and are evaluated on benchmarks including GSM8K and MATH for mathematics, HumanEval and MBPP for code, and AlpacaEval for instruction following. [1][2]
A representative result is that merging WizardLM with WizardMath raises the GSM8K accuracy attributed to the instruction model from 2.2 to 66.3 while the merged model retains its instruction-following ability, demonstrating that a capability can be absorbed from a sibling model essentially for free. Using DARE as a preprocessing step, the authors produced a merged 7B model that ranked first among 7B-parameter models on the Open LLM Leaderboard at the time. The paper reports that this absorb-abilities effect is more pronounced at larger scale, where the merged model can surpass every individual source model. [1][2]
The authors also document where the technique breaks down. DARE degrades when the drop rate is pushed so high that the 1 / (1 - p) rescaling factor becomes very large and amplifies noise, and when the underlying deltas are themselves large in magnitude, as happens with continued pre-training rather than light supervised fine-tuning. A 2024 follow-up study, "DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models," analyzed these failure modes in detail and proposed adjustments for very high pruning rates. [1][6]
DARE is deliberately modular: it replaces the dense delta of each model with a sparse, rescaled one and then hands off to a separate merging operator. In the original paper it is composed with average merging, task arithmetic, TIES-Merging, Fisher merging, and RegMean, and in most cases it improves the merged result over running those operators on dense deltas. [1]
The relationship with TIES-Merging is the most commonly cited. TIES-Merging (Yadav et al., 2023) reduces interference through its own pipeline of trimming small-magnitude entries, electing a single sign per parameter across models, and averaging only the agreeing values. DARE's random drop step can be substituted for or layered before TIES trimming, which is why the popular open-source toolkit mergekit exposes the combination directly. [3][5]
| mergekit method | DARE drop and rescale | TIES sign election | Combination of surviving deltas |
|---|---|---|---|
| dare_linear | yes | no | linear weighted sum or average |
| dare_ties | yes | yes | sign-elected, agreement-weighted sum |
| ties | no (uses magnitude trim) | yes | sign-elected, agreement-weighted sum |
| task_arithmetic | no | no | scaled sum of dense deltas |
In mergekit terminology the retained fraction is the density parameter (density = 1 - p), and each model can be given its own weight in the final sum. The dare_ties method applies DARE's stochastic drop and rescale and then performs the TIES sign-election step, while dare_linear omits sign election and simply adds the sparsified, rescaled deltas. [3]
DARE is conceptually related to two older ideas. Against pruning, it differs by dropping parameters at random and rescaling rather than keeping the largest-magnitude weights, and the paper shows that this random strategy is competitive with or better than magnitude pruning for sparsifying supervised fine-tuning deltas. Against dropout, it borrows the exact 1 / (1 - p) magnitude-preserving rescale, but applies it once to weight-space differences at merge time instead of repeatedly to activations during training. [1]
DARE's primary application is no-cost capability fusion: taking several specialist checkpoints that share a base model and producing one model that inherits their combined skills without any retraining. This has made it a standard building block in the model-merging tooling that proliferated across the open-source community, where DARE-based recipes (especially dare_ties) are frequently used to assemble high-ranking community models on public leaderboards. Because the procedure needs only the model weights, it is inexpensive and reproducible, and it is often combined with other merge operators rather than used alone. [2][3]
More broadly, DARE provided influential evidence for the view that supervised fine-tuning largely surfaces pre-existing abilities rather than encoding dense new information, a point that has informed later work on delta-parameter compression, parameter-efficient distribution of fine-tuned models, and the interference dynamics of model merging. Its drop-and-rescale primitive continues to appear as a configurable sparsification option in merging libraries and in subsequent research that revisits how aggressively fine-tuning deltas can be pruned. [1][3][6]