Model merging is a technique in machine learning that combines the parameters of multiple trained models into a single unified model without requiring additional training. Rather than running several models during inference or retraining from scratch, model merging produces one set of weights that ideally inherits the strengths of each source model. The technique has gained significant traction in the large language model (LLM) community since 2022, enabling practitioners to create powerful multitask models by fusing specialized fine-tuned checkpoints at the weight level.
Model merging is sometimes called model fusion or weight-space ensembling. It differs from traditional ensemble learning in a fundamental way: ensembles combine the outputs of multiple models at inference time, while model merging combines the weights before inference, yielding a single model with no additional computational overhead at serving time.
Training large neural networks is computationally expensive. As the open-source AI community has grown, thousands of fine-tuned model variants have become publicly available on platforms like Hugging Face. Many of these models are fine-tuned from the same base checkpoint (for example, Llama 2 7B or Mistral 7B) but specialize in different tasks: one might excel at code generation, another at mathematical reasoning, and yet another at creative writing or instruction following.
Model merging offers a way to combine these specialized capabilities into a single model. The key advantages include:
The theoretical foundation for model merging rests on the concept of linear mode connectivity. When multiple models are fine-tuned from the same pretrained initialization, they tend to converge to solutions that lie within the same basin of the loss function landscape. This means that if you linearly interpolate between the weights of two such models, the loss along that interpolation path remains low rather than spiking through a high-loss barrier.
This phenomenon was studied in depth by Frankle et al. (2020) in their work on linear mode connectivity and by Neyshabur et al. (2020) in research on transfer learning and loss landscapes. The key insight is that fine-tuning from a shared pretrained checkpoint constrains the models to a region of weight space where averaging or interpolating their parameters produces a model that is still performant.
Models derived from the same base checkpoint share the same fundamental representations learned during pretraining. Fine-tuning adjusts these representations only slightly for the target task, meaning the parameter deltas (differences between fine-tuned and base weights) are typically small. Empirical observations show that these deltas often fall within a narrow range of approximately 0.002 in magnitude. Because the adjustments are small and the underlying representation space is shared, combining deltas from different fine-tuning runs can transfer skills without catastrophic interference.
Several methods have been proposed for merging model weights, ranging from simple averaging to sophisticated algorithms that resolve conflicts between parameter updates. The table below provides a summary before each method is discussed in detail.
| Method | Paper | Year | Venue | Key Idea | Pros | Cons |
|---|---|---|---|---|---|---|
| Model Soups | Wortsman et al. | 2022 | ICML | Average weights of models fine-tuned with different hyperparameters | Simple; improves accuracy and robustness over best single model | No guarantee of improvement; ignores parameter conflicts |
| Task Arithmetic | Ilharco et al. | 2022 | ICLR 2023 | Compute task vectors (fine-tuned minus base weights) and add them | Intuitive; supports negation and analogy operations | Ignores interference between task vectors |
| TIES-Merging | Yadav et al. | 2023 | NeurIPS 2023 | Trim small changes, elect sign, merge aligned parameters | Resolves sign conflicts; works across modalities | More complex; requires density hyperparameter |
| DARE | Yu et al. | 2023 | ICML 2024 | Randomly drop delta parameters and rescale remaining ones | Eliminates up to 90-99% redundant deltas; reduces interference | Random pruning may discard important parameters |
| SLERP | Shoemake (1985); adapted for LLMs | 2023 | N/A | Spherical linear interpolation preserving directional information | Smooth interpolation; preserves weight geometry | Limited to merging two models at a time |
| Git Re-Basin | Ainsworth et al. | 2022 | ICLR 2023 | Permute neurons to align models before merging | Enables merging independently trained models | Computationally expensive; limited to smaller models |
| Passthrough (Frankenmerge) | Community method | 2023 | N/A | Concatenate layers from different models | Creates models with novel parameter counts | Experimental; no theoretical guarantees |
| Evolutionary Merging | Akiba et al. (Sakana AI) | 2024 | Nature Machine Intelligence | Use evolutionary algorithms to optimize merge recipes | Automated; discovers non-obvious combinations | Requires evaluation budget; slow search |
The Model Soups method, introduced by Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, and colleagues in their 2022 ICML paper "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," was one of the first works to demonstrate the effectiveness of simple weight averaging for fine-tuned models.
The core idea is straightforward: instead of selecting the single best model from a hyperparameter sweep and discarding the rest, average the weights of multiple fine-tuned models. The authors proposed two strategies:
The greedy soup approach consistently outperformed both the best individual model and the uniform soup. Using a ViT-G model pre-trained on JFT, the greedy soup achieved 90.94% top-1 accuracy on ImageNet. The method also improved out-of-distribution robustness and zero-shot transfer performance on new downstream tasks.
Model Soups showed that weight averaging is not merely a curiosity but a practical technique that can rival or surpass traditional ensembles while maintaining the inference cost of a single model.
Task Arithmetic, introduced by Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi in their paper "Editing Models with Task Arithmetic" (ICLR 2023), formalized the concept of task vectors and demonstrated that arithmetic operations on these vectors produce predictable changes in model behavior.
A task vector is computed by subtracting the base pretrained weights from the fine-tuned weights:
task_vector = weights_finetuned - weights_pretrained
This vector represents the direction in weight space that improves performance on a given task. The authors showed three key operations:
Addition. Adding multiple task vectors together and applying the combined vector to the base model improves performance on all corresponding tasks simultaneously. Given task vectors for tasks A and B, the merged model is: weights_merged = weights_pretrained + alpha * (task_vector_A + task_vector_B), where alpha is a scaling coefficient.
Negation. Subtracting a task vector from the base model decreases performance on that task while leaving other capabilities relatively intact. This can be used to remove unwanted behaviors, such as reducing toxicity.
Analogy. When tasks share an analogy relationship ("A is to B as C is to D"), combining task vectors from three of the tasks can improve performance on the fourth, even without training data for that fourth task.
Task Arithmetic was validated across multiple models, modalities, and tasks, including CLIP models on vision tasks and T5 models on NLP benchmarks. The simplicity of the approach made it a foundation for subsequent merging methods.
TIES-Merging (Trim, Elect Sign, and Merge), introduced by Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal at NeurIPS 2023, addresses a critical weakness of naive merging methods: interference between parameters from different models.
The authors identified two primary sources of interference:
TIES-Merging resolves these issues through a three-step process:
Trim. Reset parameters that changed only a small amount during fine-tuning back to zero (relative to the base model). This is controlled by a density parameter that specifies what percentage of the largest-magnitude changes to retain.
Elect Sign. For each parameter position, determine the dominant sign across all models being merged by summing the magnitudes of positive and negative values. The sign with the greater total magnitude wins.
Merge. Average only the parameter values that agree with the elected sign. Parameters that disagree with the consensus sign are excluded from the average.
This approach ensures that the merged model retains the most impactful parameter changes while avoiding destructive cancellation from sign conflicts. TIES-Merging outperformed simple averaging and Task Arithmetic across a range of modalities, domains, model sizes, and architectures.
DARE (Drop And REscale), introduced by Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li in their paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (ICML 2024), takes a different approach to reducing interference. Rather than using magnitude-based trimming, DARE randomly sparsifies the delta parameters.
The method works in two steps:
The key empirical finding behind DARE is that fine-tuned delta parameters exhibit extreme redundancy. The authors demonstrated that 90% or even 99% of delta parameters can be eliminated without significantly degrading the fine-tuned model's performance. By sparsifying the deltas before merging, DARE dramatically reduces the chance of destructive interference between models.
DARE is not a standalone merging algorithm but rather a preprocessing step that can be combined with other merging methods. The two most common variants are:
Using DARE, the authors created a merged 7-billion-parameter model that achieved first place on the Open LLM Leaderboard at the time of publication.
SLERP (Spherical Linear Interpolation) was originally developed by Ken Shoemake in 1985 for animating rotations with quaternion curves. It was adapted for model merging by the open-source community as a way to interpolate between two sets of model weights while preserving geometric properties in high-dimensional space.
Standard linear interpolation (LERP) computes a weighted average: weights_merged = (1 - t) * weights_A + t * weights_B, where t is a value between 0 and 1. While simple, LERP can reduce the magnitude of the interpolated weight vectors in high-dimensional spaces, potentially losing important information.
SLERP addresses this by interpolating along the surface of a hypersphere rather than cutting through the interior. The process involves:
Because SLERP respects the curvature of the weight space, it tends to produce smoother blends that better preserve the distinct characteristics of both source models. However, SLERP is limited to merging exactly two models at a time. To combine more than two models, practitioners use hierarchical merging, applying SLERP in successive pairs.
SLERP has become one of the most popular merging methods in the community. The interpolation factor t can be set globally or configured per layer, allowing fine-grained control over which model's characteristics dominate in different parts of the network.
Git Re-Basin, introduced by Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa in 2022 (published at ICLR 2023), tackles a fundamental challenge: merging models that were trained independently, not from a shared pretrained checkpoint.
Neural networks have permutation symmetries in their hidden units. Two networks can represent the same function but with neurons in a different order. If you naively average the weights of two independently trained networks, the result is poor because corresponding neurons are not aligned.
Git Re-Basin proposes three algorithms to find a permutation of one model's neurons that aligns them with the other model's neurons, effectively placing both models in the same "basin" of the loss landscape. After this alignment step, standard weight averaging produces a much better merged model.
The authors demonstrated the first zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10. While the merged model performed better than naive interpolation, it did not quite match the performance of an output-space ensemble. Git Re-Basin is computationally expensive and has primarily been demonstrated on smaller models, but it represents an important theoretical contribution to understanding when and why merging works.
All the methods described above operate by combining corresponding parameters at each layer. Frankenmerge (also called passthrough in mergekit) takes a fundamentally different approach: instead of blending layers, it concatenates layers from different models to create a deeper network.
For example, given two 7-billion-parameter models with 32 transformer layers each, a frankenmerge might take all 32 layers from Model A and append the last 8 layers from Model B, producing a 40-layer model with approximately 9 billion parameters. The resulting model has an "exotic" parameter count that does not correspond to any standard architecture.
This approach was inspired by Upstage's depth up-scaling technique used to create SOLAR-10.7B, which concatenated layers from a Llama 2 architecture to create a 48-layer model from two 32-layer sources, followed by continued pretraining. The community-driven frankenmerge technique skips the continued pretraining step, relying on the assumption that adjacent layers in transformer models perform similar functions and can be stacked without retraining.
Notable examples of frankenmerge models include:
Frankenmerge is highly experimental and requires significant trial and error. There is no established formula for determining which layers to take from which model, and results can be unpredictable. Nonetheless, the community has produced impressive results with this technique.
In March 2024, Sakana AI, a Tokyo-based AI research lab founded by former Google Brain researchers, introduced evolutionary model merging in their paper "Evolutionary Optimization of Model Merging Recipes" (published in Nature Machine Intelligence in January 2025). The authors, Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha, proposed using evolutionary algorithms to automatically discover optimal merge configurations.
Prior to this work, model merging relied heavily on human intuition: practitioners manually selected which models to merge, which method to use, and what hyperparameters (such as interpolation weights and density values) to apply. Sakana AI's approach automates this process by treating the merge configuration as a search problem.
The evolutionary algorithm operates in two spaces:
Using this approach, the team created a Japanese Math LLM by merging a Japanese language model (Shisa-Gamma) with mathematics-focused models (WizardMath and Abel). The resulting model achieved state-of-the-art performance on Japanese LLM benchmarks, surpassing models with significantly more parameters, despite never being explicitly trained for Japanese mathematical reasoning.
They also created a culturally-aware Japanese vision-language model that outperformed existing Japanese VLMs at describing culture-specific visual content. These results demonstrated that evolutionary search can discover merge recipes that human experts would be unlikely to find through manual experimentation.
Sakana AI later extended this line of work with CycleQD, a population-based approach using quality-diversity algorithms to produce diverse collections of merged models, each optimized for different capability profiles.
MergeKit is the most widely used open-source toolkit for model merging. It was created by Charles Goddard, a software engineer who had previously worked at NASA and Apple, and is now maintained by Arcee AI. The accompanying paper, "Arcee's MergeKit: A Toolkit for Merging Large Language Models" by Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz, was published at EMNLP 2024.
MergeKit supports the following merge methods:
| Method Name | MergeKit Identifier | Description |
|---|---|---|
| Linear (weighted average) | linear | Simple weighted average of model weights |
| SLERP | slerp | Spherical linear interpolation between two models |
| Task Arithmetic | task_arithmetic | Add scaled task vectors to a base model |
| TIES-Merging | ties | Trim, elect sign, and merge aligned parameters |
| DARE + TIES | dare_ties | DARE sparsification with TIES sign election |
| DARE + Linear | dare_linear | DARE sparsification with linear averaging |
| Passthrough | passthrough | Layer concatenation for frankenmerge |
MergeKit uses YAML configuration files to define merge recipes. A typical configuration specifies the source models, layer ranges, merge method, and method-specific parameters. For example, a SLERP merge configuration might look like:
slices:
- sources:
- model: OpenPipe/mistral-ft-optimized-1218
layer_range: [0, 32]
- model: mlabonne/NeuralHermes-2.5-Mistral-7B
layer_range: [0, 32]
merge_method: slerp
base_model: OpenPipe/mistral-ft-optimized-1218
parameters:
t:
- filter: self_attn
value: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
value: [1, 0.5, 0.7, 0.3, 0]
- value: 0.5
dtype: bfloat16
A key technical feature of MergeKit is its out-of-core processing approach, which allows merges to be performed with minimal memory. Merges can run entirely on CPU or be accelerated with as little as 8 GB of VRAM, making model merging accessible even without expensive hardware. MergeKit integrates with the Hugging Face Transformers library and Hub, enabling users to pull models directly from the Hub and upload merged results.
As of 2025, MergeKit has facilitated the creation of thousands of merged models, many of which have achieved top positions on open benchmarks.
The Hugging Face ecosystem provides extensive support for model merging beyond MergeKit:
The open-source community has produced numerous merged models that have achieved strong benchmark results. Below are some notable examples:
| Model | Creator | Method | Base Models | Parameters | Notable Achievement |
|---|---|---|---|---|---|
| goliath-120b | alpindale | Passthrough | Two Llama 2 70B fine-tunes (Xwin, Euryale) | 120B | One of the first successful frankenmerge models |
| SOLAR-10.7B | Upstage | Depth up-scaling | Llama 2 architecture | 10.7B | Top of Open LLM Leaderboard (Dec 2023); continued pretraining after merge |
| Marcoro14-7B-slerp | Maxime Labonne | SLERP | Marcoroni-7B-v3 + Mistral-7B-Merge-14-v0.1 | 7B | Best 7B model on Open LLM Leaderboard (Feb 2024) |
| Daredevil-7B | Maxime Labonne | DARE-TIES | Multiple Mistral-based models | 7B | Strong benchmark performance with DARE method |
| NeuralBeagle14-7B | Maxime Labonne | DARE-TIES | Multiple Mistral fine-tunes | 7B | High-performing community merge |
The proliferation of merged models on Hugging Face has been remarkable. At various points in 2023 and 2024, merged models occupied the top positions on the Open LLM Leaderboard, demonstrating that merging can compete with, and sometimes surpass, models produced through expensive training runs.
However, the community has also noted concerns about benchmark contamination: some fine-tuned models may have been trained on data that overlaps with benchmark test sets, and merging such models can propagate this contamination. For this reason, evaluation using diverse benchmarks, human evaluation, and LMSYS Chatbot Arena rankings is recommended alongside leaderboard scores.
Model merging and ensemble learning both aim to combine the strengths of multiple models, but they differ in their approach and trade-offs.
| Aspect | Model Merging | Ensemble Learning |
|---|---|---|
| Combination point | Weight space (before inference) | Output space (during inference) |
| Number of models at inference | 1 | Multiple |
| Inference cost | Same as a single model | Proportional to number of models |
| Memory requirement | Same as a single model | Proportional to number of models |
| Performance ceiling | Generally lower than ensembles | Generally higher |
| Flexibility | Requires same architecture | Can combine different architectures |
Ensembles typically achieve higher accuracy because they can capture diverse predictions and reduce variance through output averaging or voting. Model merging sacrifices some of this diversity by collapsing multiple models into one, but it gains practical advantages in deployment: a single merged model requires no special inference infrastructure, uses the same memory as any individual model, and has identical latency.
Research has shown that weight-space ensembles (model merging) can match or approach the performance of output-space ensembles in many settings, particularly when the source models share a common pretrained initialization. The Model Soups paper demonstrated this parity in several vision and NLP tasks.
Model merging has found applications across a variety of settings in the AI community.
The most common application is combining models that have been fine-tuned for different tasks or domains. For instance, merging an instruction-following model with a code-generation model and a mathematical reasoning model can produce a single model that handles all three types of queries. This eliminates the need for routing logic or running multiple specialized models.
As demonstrated by Model Soups, averaging weights across different hyperparameter configurations improves not only in-distribution accuracy but also out-of-distribution robustness. The merged model smooths over the idiosyncrasies of individual training runs, leading to more stable predictions across diverse inputs.
Task Arithmetic's negation operation enables a form of knowledge editing: by subtracting a task vector associated with an undesirable behavior (such as toxic language generation), practitioners can reduce that behavior in the merged model. This provides a lightweight alternative to full RLHF (Reinforcement Learning from Human Feedback) training for behavior modification.
Sakana AI's evolutionary merging demonstrated that models trained in different languages can be merged to create bilingual or multilingual models. Their Japanese Math LLM combined a Japanese language model with English-trained math models, transferring mathematical reasoning capabilities across language boundaries without explicit cross-lingual training.
In federated learning settings, multiple clients train local models on private data and then a central server merges these local models. Model merging techniques from the research community have direct applications in improving the quality of federated aggregation.
Model merging can help address catastrophic forgetting in continual learning scenarios. Rather than sequentially fine-tuning a model on new tasks (which risks forgetting earlier tasks), practitioners can fine-tune separate copies and merge them to retain capabilities across all tasks.
Despite its appeal, model merging has several important limitations.
Most merging methods require the source models to have identical architectures: the same number of layers, the same hidden dimensions, and the same number of attention heads. This is because merging operates on corresponding parameters at each position in the weight tensors. Models with different architectures cannot be merged using standard methods, though the passthrough technique offers a partial workaround by concatenating layers.
Recent research, such as "Transport and Merge: Cross-Architecture Merging for Large Language Models" (2025), has begun to explore techniques for merging models with different architectures, but these methods remain in early stages.
Merging works best when all source models are fine-tuned from the same base pretrained checkpoint. This shared initialization ensures that the models lie in the same basin of the loss landscape and that their parameter spaces are aligned. Merging models fine-tuned from different base checkpoints (for example, merging a Llama model with a Mistral model) typically produces poor results because the parameters are not in correspondence.
Git Re-Basin partially addresses this by aligning neuron orderings through permutation, but the technique is computationally expensive and has not been scaled to the largest modern models.
The quality of a merged model is difficult to predict in advance. Two models that individually perform well may produce a poor merge if their parameter updates interfere destructively. Conversely, merges that seem unlikely to work on paper sometimes produce surprisingly good results. This unpredictability means that model merging still involves significant trial and error.
As the number of models being merged increases, performance tends to degrade. Each additional model introduces more potential for interference and parameter conflicts. Methods like TIES-Merging and DARE mitigate this to some extent, but merging a large number of models (say, ten or more) into a single checkpoint remains challenging.
The field lacks a comprehensive theoretical framework that predicts when merging will succeed and to what degree. Most results are empirical, and the conditions under which merging provides reliable improvements are still being studied. This makes it difficult for practitioners to know in advance whether merging is the right approach for their use case.
Because model merging is so easy to perform, it has enabled a form of benchmark gaming on public leaderboards. Practitioners can quickly merge many combinations of models and submit the best-performing ones, potentially exploiting benchmark-specific weaknesses rather than producing genuinely better models. The community has responded by developing more robust evaluation frameworks and by flagging merged models on leaderboards.
Model merging is closely related to the broader field of model editing, which seeks to modify a trained model's knowledge or behavior without full retraining. Task Arithmetic explicitly frames merging as a model editing technique: task vectors can be added, subtracted, or combined to steer model behavior in specific directions.
The connection to knowledge composition is also significant. A 2025 paper, "Model Merging for Knowledge Editing," proposed a two-stage framework combining robust supervised fine-tuning with model merging to edit factual knowledge in LLMs. The first stage fine-tunes the model to internalize new knowledge, and the second stage merges the fine-tuned model with the original to preserve both the new knowledge and the model's general capabilities.
Model merging can also be viewed as a form of multi-task learning without explicit multi-task training. Traditional multi-task learning requires training on all tasks simultaneously, which introduces challenges like task balancing and data mixing. Model merging sidesteps these issues by allowing each task's model to be trained independently and combined afterward.
Based on community experience and published research, several practical guidelines have emerged:
Start with models from the same base. Merging is most likely to succeed when all source models are fine-tuned from the same pretrained checkpoint.
Use SLERP for two-model merges. When combining exactly two models, SLERP generally produces better results than linear interpolation because it preserves directional information in weight space.
Use DARE-TIES for multi-model merges. When combining three or more models, DARE-TIES tends to produce the best results by reducing interference through both sparsification and sign election.
Keep weights balanced. The sum of interpolation weights across models should generally fall between 0.9 and 1.1. Extreme weight imbalances can destabilize the merged model.
Set density below 0.5. For TIES and DARE methods, a density parameter below 0.5 (retaining less than half of the delta parameters) often works well, though values slightly above 0.5 can sometimes be better.
Pay attention to first and last layers. The first and last layers of a transformer model tend to be the most sensitive. Some practitioners assign different interpolation weights to these layers.
Evaluate broadly. Do not rely on a single benchmark. Use multiple evaluation suites, including human evaluation when possible, to assess the quality of a merged model.
Iterate quickly. Model merging is fast. Take advantage of this by trying many configurations and evaluating each one, rather than spending time trying to find the optimal configuration analytically.
Model merging is an active area of research with several promising directions.
Cross-architecture merging. Enabling merging between models of different architectures and sizes would greatly expand the applicability of the technique. Early work in this area uses techniques like singular value decomposition (SVD) and transport methods to align parameter spaces across architectures.
Automated merge optimization. Building on Sakana AI's evolutionary approach, researchers are developing more sophisticated automated methods for discovering optimal merge recipes, including reinforcement learning-based and Bayesian optimization-based approaches.
Theoretical foundations. Developing a rigorous theoretical understanding of when and why merging works would allow practitioners to predict merge quality before performing the merge, saving time and compute.
Activation-informed merging. Rather than operating solely on static weights, newer methods use activation statistics from calibration data to guide the merging process, identifying which parameters are most important for preserving each model's capabilities.
Per-layer SVD merging. Research using per-layer singular value decomposition to isolate task-relevant directions in weight space has shown that singular vectors can guide merging to reduce destructive interference.
Merging beyond language models. While most current work focuses on LLMs, model merging has potential applications in computer vision, speech recognition, reinforcement learning, and multimodal models. The Model Soups paper already demonstrated benefits for vision models, and extending advanced merging techniques to other domains remains an open opportunity.