TIES-Merging

Machine Learning Reinforcement Learning

11 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v3 · 2,264 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is TIES-Merging?

TIES-Merging is a training-free model merging method that combines several models fine-tuned from a shared pre-trained checkpoint into one multitask model while explicitly resolving the interference between their parameter updates ^[1]. The name is an acronym for TrIm, Elect Sign, and Merge, the three operations applied in sequence to each model's task vector: trim small-magnitude changes, elect a single aggregate sign per parameter across the models, then average only the trimmed values that agree with the elected sign ^[1]. It was introduced in the 2023 paper "TIES-Merging: Resolving Interference When Merging Models" and reduces two distinct sources of interference (redundant parameters and sign disagreement) that cause simpler methods such as task arithmetic and plain weight averaging to lose information ^[1].

Overview

TIES-Merging combines several models fine-tuned from a shared pre-trained checkpoint into a single multitask model with no additional gradient training ^[1]. It was introduced in the paper "TIES-Merging: Resolving Interference When Merging Models" by Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal, published at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023, proceedings volume 36) ^[1]. The authors were affiliated with the University of North Carolina at Chapel Hill, with Choshen additionally at IBM Research and MIT ^[1].

The method targets a failure mode of simpler merging schemes such as task arithmetic: when the parameter changes from different fine-tuned models are summed, redundant updates and conflicting signs partially cancel useful information, shrinking the magnitude of influential parameters in the merged model and widening the gap to a multitask-trained model as more checkpoints are combined ^[1]. TIES-Merging inserts two cleanup operations, trimming and sign election, before averaging. Across language and vision benchmarks it reports consistent gains over task arithmetic, simple weight averaging, Fisher Merging, and RegMean ^[1].

The paper frames the problem in terms of two interference sources it sets out to fix, stating in its abstract that "prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models." ^[1]

What is model merging and what are task vectors?

Model merging builds a single set of weights by combining multiple existing models in parameter space, with no further gradient training and, in the common case, no access to the original training data. It works because models fine-tuned from the same pre-trained initialization tend to lie in a connected, low-loss region of weight space, so their weights can be interpolated or added without catastrophic loss ^[1]. Merging is used to create one multitask model from many single-task models, to improve robustness and out-of-domain generalization, and to compress a collection of checkpoints into one artifact.

A useful abstraction for merging fine-tuned models is the task vector, introduced by Ilharco et al. for task arithmetic ^[2]. For a model fine-tuned on task t, the task vector is the elementwise difference between the fine-tuned weights and the initialization: $\tau_t = \theta_{\text{ft},t} - \theta_{\text{init}}$ . The task vector isolates the change produced by fine-tuning. Task arithmetic merges n models by summing their task vectors and adding the result back to the initialization with a scaling coefficient: $\theta_{\text{merged}} = \theta_{\text{init}} + \lambda \sum_t \tau_t$ ^[2]. Simple weight averaging, as used by model soups, is the special case in which the task vectors are averaged rather than summed ^[3]. More elaborate baselines include Fisher Merging, which weights each parameter by an approximate Fisher information estimate of its importance ^[5], and RegMean, which solves a closed-form least-squares problem per linear layer to match the merged model's activations to those of the individual models ^[6].

TIES-Merging identifies two sources of interference that these methods leave unaddressed ^[1]:

Interference from redundant parameters. Fine-tuning changes many parameters by only a small amount, and these changes have little effect on the task's own performance. When such a redundant value is summed with a large, influential value from another model, it acts as noise that dilutes the influential value.
Interference from sign disagreement. A given parameter may receive a positive update from one task and a negative update from another. Averaging or summing these opposing values shrinks the parameter toward zero and can harm performance on both tasks. The authors show that sign conflicts arise even when merging only two models and become more frequent as the number of models grows ^[1].

How does TIES-Merging work?

TIES-Merging first decomposes each task vector into a sign vector and a magnitude vector: $\tau_t = \gamma_t \odot \mu_t$ , where $\gamma_t = \mathrm{sgn}(\tau_t)$ holds values in $\{+1, 0, -1\}$ and $\mu_t = \lvert \tau_t \rvert$ holds the absolute magnitudes ^[1]. Given n task vectors, the method then applies the following three steps, summarized by Algorithm 1 of the paper.

1. Trim

For each task vector independently, keep only the entries with the largest magnitude and reset the rest to zero. The paper keeps the top-k% of entries by magnitude and trims the bottom (100 - k)%, producing a sparsified task vector $\hat{\tau}_t$ ^[1]. Setting trimmed entries to zero is equivalent to resetting those parameters to their pre-trained values, so it does not perturb the initialization. The authors find that keeping only the top 20% of entries ( $k = 20$ ) leaves single-task performance essentially unchanged, confirming that the discarded changes are largely redundant ^[1].

2. Elect Sign

Resolve sign disagreement before any averaging. For each parameter, the method sums the signed trimmed task vectors across all models and takes the sign of that sum, electing the direction with the greatest total magnitude across models: $\gamma_m = \mathrm{sgn}\left(\sum_t \hat{\tau}_t\right)$ ^[1]. The resulting aggregate sign vector $\gamma_m$ records, for every parameter, whether the dominant movement across the merged models is positive or negative.

3. Disjoint Merge

For each parameter, average only the trimmed values whose sign agrees with the elected sign, ignoring the disagreeing entries and the zeros left by trimming. Formally, for parameter p the agreeing set is $A_p = \{t : \mathrm{sgn}(\hat{\tau}_{t,p}) = \gamma_{m,p}\}$ , and the merged value is the disjoint mean $\tau_{m,p} = \frac{1}{\lvert A_p \rvert} \sum_{t \in A_p} \hat{\tau}_{t,p}$ ^[1]. Because the mean is taken only over the agreeing, non-zero entries, a parameter influenced by a single model is not diluted by averaging in many near-zero or opposing values, which is the key difference from a plain sum or average.

The merged task vector $\tau_m$ is finally scaled and added to the initialization to produce the merged model: $\theta_m = \theta_{\text{init}} + \lambda \tau_m$ , where $\lambda$ is a scaling hyperparameter, as in task arithmetic ^[1]^[2]. When a validation set is available, the authors tune k and lambda on it. When no validation set is available, they recommend a fixed recipe of keeping the top 20% of entries and setting $\lambda = 1$ , which can be applied in any setting without per-task tuning ^[1].

Why does TIES-Merging reduce interference?

The three steps each remove a specific way that prior methods waste information. Trimming deletes the small, redundant updates that would otherwise act as noise when summed against influential values from other models, which is safe because the paper shows the top 20% of entries carry essentially all the single-task performance ^[1]. Sign election commits every parameter to one direction before averaging, so a positive update from one task and a negative update from another no longer cancel toward zero ^[1]. The disjoint mean then averages only the agreeing, non-zero entries, so a parameter that matters to a single model keeps its magnitude instead of being diluted by many near-zero or opposing values ^[1]. Ablations in the paper confirm that the sign-election step is the single most important contributor to the gains, and the margin over task arithmetic grows as more tasks are merged, consistent with the claim that interference compounds with model count ^[1].

How well does TIES-Merging perform?

The paper evaluates TIES-Merging across modalities, model families, model sizes, and both full fine-tuning and parameter-efficient fine-tuning (PEFT) ^[1]. The PEFT setting merges (IA)^3 adapters on the T0-3B model over 11 tasks; the NLP setting merges full T5-base and T5-large models over 7 tasks; and the vision setting merges CLIP-style ViT-B/32 and ViT-L/14 image encoders over 8 image-classification datasets ^[1].

The table below reports average in-domain accuracy (%) when a validation set is used to tune hyperparameters ^[1]. The fine-tuned and multitask rows are reference points, not merging methods: the former is a per-task upper bound, the latter jointly trains on all tasks.

Method	(IA)^3, 11 tasks	T5-base, 7 tasks	T5-large, 7 tasks	ViT-B/32, 8 tasks	ViT-L/14, 8 tasks
Fine-tuned (per-task upper bound)	71.4	82.8	88.8	90.5	94.2
Multitask training	73.1	83.6	88.1	88.9	93.5
Fisher Merging	62.2	68.9	64.6	68.3	82.2
RegMean	58.0	71.2	73.2	71.8	83.7
Task Arithmetic	63.9	73.2	73.3	70.1	84.5
TIES-Merging	66.4	73.9	76.9	73.6	86.0

With a validation set, TIES-Merging is the strongest merging method in every column, beating the next-best baseline by an average of about 2.3% absolute in the NLP settings and 1.7% absolute in the vision settings, and by 2.5% on the 11-task (IA)^3 benchmark ^[1]. The headroom to the per-task and multitask references shows that merging still trails dedicated training, but TIES narrows that gap relative to prior merging methods.

The advantages persist without a validation set. Using the fixed top-20%, $\lambda = 1$ recipe, TIES-Merging improves over the best baseline by 6.6% on ViT-B/32 and 2.7% on ViT-L/14, and remains ahead on T5-large, although on T5-base its untuned recipe slightly trails task arithmetic ^[1]. For out-of-domain generalization, where T5 models merged on seven tasks are evaluated on six held-out datasets from the T0 mixture, TIES-Merging beats the strongest baseline by 1.0% absolute for T5-base and 4.4% absolute for T5-large ^[1]. The paper also shows that the margin over task arithmetic grows as the number of merged tasks increases, consistent with its claim that interference compounds with more models, and ablations confirm that the sign-election step is the most important contributor ^[1].

How does TIES-Merging compare to task arithmetic, model soups, and DARE?

TIES-Merging extends the task-vector framework of task arithmetic: both add scaled task vectors to a shared initialization, but TIES first trims redundant entries and reconciles sign conflicts so that the sum preserves influential parameters rather than cancelling them ^[1]^[2]. Model soups average the weights of models fine-tuned on the same task with different hyperparameters, whereas TIES targets merging across different tasks; both rely on the same low-loss connectivity of fine-tuned weights ^[3]. Relative to Fisher Merging and RegMean, TIES needs neither Fisher-information estimates nor stored activation statistics, and it operates directly on the task vectors ^[1]^[5]^[6].

TIES-Merging is closely related to DARE (Drop And REscale), introduced by Yu et al. in "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" ^[4]. DARE is a sparsification step that randomly drops a large fraction of task-vector entries (often 90% or more) and rescales the survivors by $\frac{1}{1 - p}$ to preserve their expected magnitude. Where TIES trims deterministically by magnitude, DARE drops stochastically; the two are complementary, and DARE is frequently applied as a preprocessing step before the elect-sign and disjoint-merge stages of TIES, a combination commonly labeled DARE-TIES ^[4].

Where is TIES-Merging used?

TIES-Merging is one of the standard recipes in open-source model-merging toolkits, most prominently mergekit, where it and its DARE-TIES variant are selectable merge methods alongside linear averaging and task arithmetic ^[7]. In mergekit, the DARE implementation explicitly offers a variant with the sign-election step of TIES (dare_ties) and one without it (dare_linear), reflecting that sign election is the distinguishing ingredient TIES contributes ^[7]. Through these tools the technique is widely used by the open-weights community to fuse multiple fine-tuned large language models, including specialist LoRA and adapter checkpoints, into single models that inherit several abilities at once. Because it requires only the model weights and no training, TIES-Merging is attractive for cheaply assembling multitask and multi-domain models and as a building block in continual and federated learning settings where data cannot be pooled ^[1]^[7].

References

Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M. "TIES-Merging: Resolving Interference When Merging Models." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2306.01708. https://arxiv.org/abs/2306.01708 ↩
Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. "Editing Models with Task Arithmetic." International Conference on Learning Representations (ICLR 2023). arXiv:2212.04089. https://arxiv.org/abs/2212.04089 ↩
Wortsman, M., et al. "Model Soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." International Conference on Machine Learning (ICML 2022). arXiv:2203.05482. https://arxiv.org/abs/2203.05482 ↩
Yu, L., Yu, B., Yu, H., Huang, F., and Li, Y. "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (DARE). International Conference on Machine Learning (ICML 2024). arXiv:2311.03099. https://arxiv.org/abs/2311.03099 ↩
Matena, M., and Raffel, C. "Merging Models with Fisher-Weighted Averaging." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). arXiv:2111.09832. https://arxiv.org/abs/2111.09832 ↩
Jin, X., Ren, X., Preotiuc-Pietro, D., and Cheng, P. "Dataless Knowledge Fusion by Merging Weights of Language Models" (RegMean). International Conference on Learning Representations (ICLR 2023). arXiv:2212.09849. https://arxiv.org/abs/2212.09849 ↩
Goddard, C., et al. "Arcee's MergeKit: A Toolkit for Merging Large Language Models." arXiv:2403.13257. https://arxiv.org/abs/2403.13257 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

DARE (Drop And REscale)Model soups Task arithmetic

What is TIES-Merging?

Overview

What is model merging and what are task vectors?

How does TIES-Merging work?

1. Trim

2. Elect Sign

3. Disjoint Merge

Why does TIES-Merging reduce interference?

How well does TIES-Merging perform?

How does TIES-Merging compare to task arithmetic, model soups, and DARE?

Where is TIES-Merging used?

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here