TIES-Merging
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,906 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,906 words
Add missing citations, update stale details, or suggest a clearer explanation.
TIES-Merging is a training-free model merging method that combines several models fine-tuned from a shared pre-trained checkpoint into a single multitask model while explicitly resolving the interference between their parameter updates [1]. The name is an acronym for TrIm, Elect Sign, and Merge, the three operations the method applies in sequence. It was introduced in the paper "TIES-Merging: Resolving Interference When Merging Models" by Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, published at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) [1]. The authors were affiliated with the University of North Carolina at Chapel Hill, with Choshen additionally at IBM Research and MIT [1].
The method targets a failure mode of simpler merging schemes such as task arithmetic: when the parameter changes from different fine-tuned models are summed, redundant updates and conflicting signs partially cancel useful information, shrinking the magnitude of influential parameters in the merged model and widening the gap to a multitask-trained model as more checkpoints are combined [1]. TIES-Merging inserts two cleanup operations, trimming and sign election, before averaging. Across language and vision benchmarks it reports consistent gains over task arithmetic, simple weight averaging, Fisher Merging, and RegMean [1].
Model merging builds a single set of weights by combining multiple existing models in parameter space, with no further gradient training and, in the common case, no access to the original training data. It works because models fine-tuned from the same pre-trained initialization tend to lie in a connected, low-loss region of weight space, so their weights can be interpolated or added without catastrophic loss [1]. Merging is used to create one multitask model from many single-task models, to improve robustness and out-of-domain generalization, and to compress a collection of checkpoints into one artifact.
A useful abstraction for merging fine-tuned models is the task vector, introduced by Ilharco et al. for task arithmetic [2]. For a model fine-tuned on task t, the task vector is the elementwise difference between the fine-tuned weights and the initialization: tau_t = theta_ft,t - theta_init. The task vector isolates the change produced by fine-tuning. Task arithmetic merges n models by summing their task vectors and adding the result back to the initialization with a scaling coefficient: theta_merged = theta_init + lambda * sum_t tau_t [2]. Simple weight averaging, as used by model soups, is the special case in which the task vectors are averaged rather than summed [3]. More elaborate baselines include Fisher Merging, which weights each parameter by an approximate Fisher information estimate of its importance [5], and RegMean, which solves a closed-form least-squares problem per linear layer to match the merged model's activations to those of the individual models [6].
TIES-Merging identifies two sources of interference that these methods leave unaddressed [1]:
TIES-Merging first decomposes each task vector into a sign vector and a magnitude vector: tau_t = gamma_t (elementwise product) mu_t, where gamma_t = sgn(tau_t) holds values in {+1, 0, -1} and mu_t = |tau_t| holds the absolute magnitudes [1]. Given n task vectors, the method then applies the following steps, summarized by Algorithm 1 of the paper.
For each task vector independently, keep only the entries with the largest magnitude and reset the rest to zero. The paper keeps the top-k% of entries by magnitude and trims the bottom (100 - k)%, producing a sparsified task vector tau_hat_t [1]. Setting trimmed entries to zero is equivalent to resetting those parameters to their pre-trained values, so it does not perturb the initialization. The authors find that keeping only the top 20% of entries (k = 20) leaves single-task performance essentially unchanged, confirming that the discarded changes are largely redundant [1].
Resolve sign disagreement before any averaging. For each parameter, the method sums the signed trimmed task vectors across all models and takes the sign of that sum, electing the direction with the greatest total magnitude across models: gamma_m = sgn(sum_t tau_hat_t) [1]. The resulting aggregate sign vector gamma_m records, for every parameter, whether the dominant movement across the merged models is positive or negative.
For each parameter, average only the trimmed values whose sign agrees with the elected sign, ignoring the disagreeing entries and the zeros left by trimming. Formally, for parameter p the agreeing set is A_p = {t : sgn(tau_hat_t,p) = gamma_m,p}, and the merged value is the disjoint mean tau_m,p = (1 / |A_p|) * sum over t in A_p of tau_hat_t,p [1]. Because the mean is taken only over the agreeing, non-zero entries, a parameter influenced by a single model is not diluted by averaging in many near-zero or opposing values, which is the key difference from a plain sum or average.
The merged task vector tau_m is finally scaled and added to the initialization to produce the merged model: theta_m = theta_init + lambda * tau_m, where lambda is a scaling hyperparameter, as in task arithmetic [1][2]. When a validation set is available, the authors tune k and lambda on it. When no validation set is available, they recommend a fixed recipe of keeping the top 20% of entries and setting lambda = 1, which can be applied in any setting without per-task tuning [1].
The paper evaluates TIES-Merging across modalities, model families, model sizes, and both full fine-tuning and parameter-efficient fine-tuning (PEFT) [1]. The PEFT setting merges (IA)^3 adapters on the T0-3B model over 11 tasks; the NLP setting merges full T5-base and T5-large models over 7 tasks; and the vision setting merges CLIP-style ViT-B/32 and ViT-L/14 image encoders over 8 image-classification datasets [1].
The table below reports average in-domain accuracy (%) when a validation set is used to tune hyperparameters [1]. The fine-tuned and multitask rows are reference points, not merging methods: the former is a per-task upper bound, the latter jointly trains on all tasks.
| Method | (IA)^3, 11 tasks | T5-base, 7 tasks | T5-large, 7 tasks | ViT-B/32, 8 tasks | ViT-L/14, 8 tasks |
|---|---|---|---|---|---|
| Fine-tuned (per-task upper bound) | 71.4 | 82.8 | 88.8 | 90.5 | 94.2 |
| Multitask training | 73.1 | 83.6 | 88.1 | 88.9 | 93.5 |
| Fisher Merging | 62.2 | 68.9 | 64.6 | 68.3 | 82.2 |
| RegMean | 58.0 | 71.2 | 73.2 | 71.8 | 83.7 |
| Task Arithmetic | 63.9 | 73.2 | 73.3 | 70.1 | 84.5 |
| TIES-Merging | 66.4 | 73.9 | 76.9 | 73.6 | 86.0 |
With a validation set, TIES-Merging is the strongest merging method in every column, beating the next-best baseline by an average of about 2.3% absolute in the NLP settings and 1.7% absolute in the vision settings, and by 2.5% on the 11-task (IA)^3 benchmark [1]. The headroom to the per-task and multitask references shows that merging still trails dedicated training, but TIES narrows that gap relative to prior merging methods.
The advantages persist without a validation set. Using the fixed top-20%, lambda = 1 recipe, TIES-Merging improves over the best baseline by 6.6% on ViT-B/32 and 2.7% on ViT-L/14, and remains ahead on T5-large, although on T5-base its untuned recipe slightly trails task arithmetic [1]. For out-of-domain generalization, where T5 models merged on seven tasks are evaluated on six held-out datasets from the T0 mixture, TIES-Merging beats the strongest baseline by 1.0% absolute for T5-base and 4.4% absolute for T5-large [1]. The paper also shows that the margin over task arithmetic grows as the number of merged tasks increases, consistent with its claim that interference compounds with more models, and ablations confirm that the sign-election step is the most important contributor [1].
TIES-Merging extends the task-vector framework of task arithmetic: both add scaled task vectors to a shared initialization, but TIES first trims redundant entries and reconciles sign conflicts so that the sum preserves influential parameters rather than cancelling them [1][2]. Model soups average the weights of models fine-tuned on the same task with different hyperparameters, whereas TIES targets merging across different tasks; both rely on the same low-loss connectivity of fine-tuned weights [3]. Relative to Fisher Merging and RegMean, TIES needs neither Fisher-information estimates nor stored activation statistics, and it operates directly on the task vectors [1][5][6].
TIES-Merging is closely related to DARE (Drop And REscale), introduced by Yu et al. in "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" [4]. DARE is a sparsification step that randomly drops a large fraction of task-vector entries (often 90% or more) and rescales the survivors by 1 / (1 - p) to preserve their expected magnitude. Where TIES trims deterministically by magnitude, DARE drops stochastically; the two are complementary, and DARE is frequently applied as a preprocessing step before the elect-sign and disjoint-merge stages of TIES, a combination commonly labeled DARE-TIES [4].
TIES-Merging is one of the standard recipes in open-source model-merging toolkits, most prominently mergekit, where it and its DARE-TIES variant are selectable merge methods alongside linear averaging and task arithmetic [7]. Through these tools the technique is widely used by the open-weights community to fuse multiple fine-tuned large language models, including specialist LoRA and adapter checkpoints, into single models that inherit several abilities at once. Because it requires only the model weights and no training, TIES-Merging is attractive for cheaply assembling multitask and multi-domain models and as a building block in continual and federated learning settings where data cannot be pooled [1][7].