# Model merging

> Source: https://aiwiki.ai/wiki/model_merging
> Updated: 2026-06-23
> Categories: Deep Learning, Large Language Models, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Model merging** combines the parameters of multiple trained neural networks into a single unified model without any additional training. Rather than running several models during [inference](/wiki/inference) or retraining from scratch, model merging produces one set of weights that ideally inherits the strengths of each source model at no extra serving cost. The technique has gained significant traction in the [large language model](/wiki/large_language_model) (LLM) community since 2022, and the open-source toolkit [mergekit](/wiki/arcee_ai) alone has "facilitated the merging of thousands of models, contributing to some of the world's most powerful open-source model checkpoints."[7] At several points during 2023 and 2024, merged models occupied the top positions on the Open LLM Leaderboard, demonstrating that merging can rival or surpass models produced through expensive training runs.[4]

Model merging is a technique in [machine learning](/wiki/machine_learning) that is sometimes called **model fusion** or **weight-space ensembling**. It differs from traditional [ensemble learning](/wiki/ensemble_learning) in a fundamental way: ensembles combine the *outputs* of multiple models at inference time, while model merging combines the *weights* before inference, yielding a single model with no additional computational overhead at serving time. Most methods require the source models to share the same architecture and, ideally, the same pretrained base checkpoint.

## What is model merging used for?

Training large neural networks is computationally expensive. As the open-source AI community has grown, thousands of fine-tuned model variants have become publicly available on platforms like [Hugging Face](/wiki/hugging_face). Many of these models are fine-tuned from the same base checkpoint (for example, [Llama 2](/wiki/llama) 7B or [Mistral](/wiki/mistral) 7B) but specialize in different tasks: one might excel at code generation, another at mathematical reasoning, and yet another at creative writing or instruction following.

Model merging offers a way to combine these specialized capabilities into a single model. The key advantages include:

- **No additional training required.** Merging operates directly on model weights, so it needs no [GPU](/wiki/gpu) compute for training, no training data, and no gradient updates.
- **Reduced inference cost.** Unlike ensembles, which must run multiple forward passes, a merged model runs a single forward pass at the same cost as any individual source model.
- **Rapid experimentation.** Practitioners can try many merge configurations in minutes rather than the hours or days required for fine-tuning.
- **Combination of diverse skills.** A merged model can inherit instruction-following ability from one source and domain expertise from another.

## Why does model merging work?

### Linear mode connectivity

The theoretical foundation for model merging rests on the concept of **linear mode connectivity**. When multiple models are fine-tuned from the same pretrained initialization, they tend to converge to solutions that lie within the same basin of the [loss function](/wiki/loss_function) landscape. This means that if you linearly interpolate between the weights of two such models, the loss along that interpolation path remains low rather than spiking through a high-loss barrier.

This phenomenon was studied in depth by Frankle et al. (2020) in their work on linear mode connectivity and the lottery ticket hypothesis,[9] and by Neyshabur et al. (2020) in research on [transfer learning](/wiki/transfer_learning) and loss landscapes. The key insight is that fine-tuning from a shared pretrained checkpoint constrains the models to a region of weight space where averaging or interpolating their parameters produces a model that is still performant.

### Shared representation spaces

Models derived from the same base checkpoint share the same fundamental representations learned during pretraining. [Fine-tuning](/wiki/fine_tuning) adjusts these representations only slightly for the target task, meaning the parameter deltas (differences between fine-tuned and base weights) are typically small. The DARE authors report that supervised fine-tuning delta parameter values are "typically small (within 0.002) with extreme redundancy."[4] Because the adjustments are small and the underlying representation space is shared, combining deltas from different fine-tuning runs can transfer skills without catastrophic interference.

## What are the main model merging techniques?

Several methods have been proposed for merging model weights, ranging from simple averaging to sophisticated algorithms that resolve conflicts between parameter updates. The table below provides a summary before each method is discussed in detail.

| Method | Paper | Year | Venue | Key Idea | Pros | Cons |
|---|---|---|---|---|---|---|
| Model Soups | Wortsman et al. | 2022 | ICML | Average weights of models fine-tuned with different hyperparameters | Simple; improves accuracy and robustness over best single model | No guarantee of improvement; ignores parameter conflicts |
| Task Arithmetic | Ilharco et al. | 2022 | ICLR 2023 | Compute task vectors (fine-tuned minus base weights) and add them | Intuitive; supports negation and analogy operations | Ignores interference between task vectors |
| TIES-Merging | Yadav et al. | 2023 | NeurIPS 2023 | Trim small changes, elect sign, merge aligned parameters | Resolves sign conflicts; works across modalities | More complex; requires density hyperparameter |
| DARE | Yu et al. | 2023 | ICML 2024 | Randomly drop delta parameters and rescale remaining ones | Eliminates up to 90-99% redundant deltas; reduces interference | Random pruning may discard important parameters |
| SLERP | Shoemake (1985); adapted for LLMs | 2023 | N/A | Spherical linear interpolation preserving directional information | Smooth interpolation; preserves weight geometry | Limited to merging two models at a time |
| Git Re-Basin | Ainsworth et al. | 2022 | ICLR 2023 | Permute neurons to align models before merging | Enables merging independently trained models | Computationally expensive; limited to smaller models |
| Passthrough (Frankenmerge) | Community method | 2023 | N/A | Concatenate layers from different models | Creates models with novel parameter counts | Experimental; no theoretical guarantees |
| Evolutionary Merging | Akiba et al. (Sakana AI) | 2024 | Nature Machine Intelligence | Use evolutionary algorithms to optimize merge recipes | Automated; discovers non-obvious combinations | Requires evaluation budget; slow search |

### Model Soups

The **Model Soups** method, introduced by Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, and colleagues in their 2022 ICML paper "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," was one of the first works to demonstrate the effectiveness of simple weight averaging for fine-tuned models.[1] The paper's central observation is that "averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness."[1]

The core idea is straightforward: instead of selecting the single best model from a hyperparameter sweep and discarding the rest, average the weights of multiple fine-tuned models. The authors proposed two strategies:

- **Uniform (naive) soup:** Average the weights of all models from the hyperparameter sweep equally.
- **Greedy soup:** Sort models by validation performance, then iteratively add each model to the soup only if including it improves validation accuracy. If adding a model degrades performance, skip it.

The greedy soup approach consistently outperformed both the best individual model and the uniform soup. Using a [ViT](/wiki/vision_transformer)-G model pre-trained on JFT, the greedy soup achieved 90.94% top-1 accuracy on [ImageNet](/wiki/imagenet), a new state of the art at the time.[1] The method also improved out-of-distribution robustness and zero-shot transfer performance on new downstream tasks.

Model Soups showed that weight averaging is not merely a curiosity but a practical technique that can rival or surpass traditional ensembles while maintaining the inference cost of a single model.

### Task Arithmetic

**Task Arithmetic**, introduced by Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi in their paper "Editing Models with Task Arithmetic" (ICLR 2023), formalized the concept of **task vectors** and demonstrated that arithmetic operations on these vectors produce predictable changes in model behavior.[2] A task vector, in the authors' words, "specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task."[2]

A task vector is computed by subtracting the base pretrained weights from the fine-tuned weights:

**task_vector = weights_finetuned - weights_pretrained**

This vector represents the direction in weight space that improves performance on a given task. The authors showed three key operations:[2]

1. **Addition.** Adding multiple task vectors together and applying the combined vector to the base model improves performance on all corresponding tasks simultaneously. Given task vectors for tasks A and B, the merged model is: **weights_merged = weights_pretrained + alpha * (task_vector_A + task_vector_B)**, where alpha is a scaling coefficient.

2. **Negation.** Subtracting a task vector from the base model *decreases* performance on that task while leaving other capabilities relatively intact. This can be used to remove unwanted behaviors, such as reducing toxicity.

3. **Analogy.** When tasks share an analogy relationship ("A is to B as C is to D"), combining task vectors from three of the tasks can improve performance on the fourth, even without training data for that fourth task.

Task Arithmetic was validated across multiple models, modalities, and tasks, including [CLIP](/wiki/clip) models on vision tasks and [T5](/wiki/t5) models on NLP benchmarks. The simplicity of the approach made it a foundation for subsequent merging methods.

### TIES-Merging

**TIES-Merging** (Trim, Elect Sign, and Merge), introduced by Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal at [NeurIPS](/wiki/neurips) 2023, addresses a critical weakness of naive merging methods: interference between parameters from different models.[3]

The authors identified two primary sources of interference:[3]

- **Redundant parameter values.** Many parameters change only slightly during fine-tuning and contribute noise rather than useful task-specific information.
- **Sign disagreement.** For a given parameter position, different fine-tuned models may push the value in opposite directions (one positive, one negative), and averaging these conflicting updates cancels out useful information.

TIES-Merging resolves these issues through a three-step process:

1. **Trim.** Reset parameters that changed only a small amount during fine-tuning back to zero (relative to the base model). This is controlled by a density parameter that specifies what percentage of the largest-magnitude changes to retain.

2. **Elect Sign.** For each parameter position, determine the dominant sign across all models being merged by summing the magnitudes of positive and negative values. The sign with the greater total magnitude wins.

3. **Merge.** Average only the parameter values that agree with the elected sign. Parameters that disagree with the consensus sign are excluded from the average.

This approach ensures that the merged model retains the most impactful parameter changes while avoiding destructive cancellation from sign conflicts. The authors report that TIES-Merging "outperforms existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings."[3]

### DARE

**DARE** (Drop And REscale), introduced by Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li in their paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (ICML 2024), takes a different approach to reducing interference.[4] Rather than using magnitude-based trimming, DARE randomly sparsifies the delta parameters.

The method works in two steps:

1. **Drop.** Randomly reset a proportion *p* of delta parameters (the difference between fine-tuned and pretrained weights) to zero.
2. **Rescale.** Multiply the remaining delta parameters by 1/(1-p) to maintain the expected magnitude of the original deltas.

The key empirical finding behind DARE is that fine-tuned delta parameters exhibit extreme redundancy. The authors demonstrated that DARE "can effortlessly eliminate 90% or even 99% of delta parameters" without significantly degrading the fine-tuned model's performance.[4] They also stress that the rescale operation is essential: dropping 30% or 40% of delta parameters without rescaling noticeably worsens results. By sparsifying the deltas before merging, DARE dramatically reduces the chance of destructive interference between models.

DARE is not a standalone merging algorithm but rather a preprocessing step that can be combined with other merging methods. The two most common variants are:

- **DARE-TIES:** Apply DARE sparsification, then merge using the TIES-Merging sign election and alignment procedure.
- **DARE-Linear:** Apply DARE sparsification, then merge using simple linear averaging (Task Arithmetic style).

Using DARE, the authors created a merged 7-billion-parameter model, supermario_v2, which held first place on the Open LLM Leaderboard until January 28, 2024.[4]

### SLERP (Spherical Linear Interpolation)

**SLERP** (Spherical Linear Interpolation) was originally developed by Ken Shoemake in 1985 for animating rotations with quaternion curves.[10] It was adapted for model merging by the open-source community as a way to interpolate between two sets of model weights while preserving geometric properties in high-dimensional space.

Standard linear interpolation (LERP) computes a weighted average: **weights_merged = (1 - t) * weights_A + t * weights_B**, where t is a value between 0 and 1. While simple, LERP can reduce the magnitude of the interpolated weight vectors in high-dimensional spaces, potentially losing important information.

SLERP addresses this by interpolating along the surface of a hypersphere rather than cutting through the interior. The process involves:

1. Normalize the input weight vectors to unit length.
2. Compute the angle between the vectors using their dot product.
3. Calculate scale factors using sine ratios based on the interpolation parameter t and the angle.
4. Combine the scaled vectors to produce the interpolated result.

Because SLERP respects the curvature of the weight space, it tends to produce smoother blends that better preserve the distinct characteristics of both source models. However, SLERP is limited to merging exactly two models at a time. To combine more than two models, practitioners use hierarchical merging, applying SLERP in successive pairs.

SLERP has become one of the most popular merging methods in the community. The interpolation factor t can be set globally or configured per layer, allowing fine-grained control over which model's characteristics dominate in different parts of the network.

### Git Re-Basin

**Git Re-Basin**, introduced by Samuel K. Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa in 2022 (published at ICLR 2023), tackles a fundamental challenge: merging models that were trained independently, not from a shared pretrained checkpoint.[5]

Neural networks have **permutation symmetries** in their hidden units. Two networks can represent the same function but with neurons in a different order. If you naively average the weights of two independently trained networks, the result is poor because corresponding neurons are not aligned.

Git Re-Basin proposes three algorithms to find a permutation of one model's neurons that aligns them with the other model's neurons, effectively placing both models in the same "basin" of the loss landscape. After this alignment step, standard weight averaging produces a much better merged model.

The authors demonstrated the first zero-barrier linear mode connectivity between independently trained [ResNet](/wiki/resnet) models on [CIFAR-10](/wiki/cifar) and CIFAR-100, a result that improved as model width increased.[5] While the merged model performed better than naive interpolation, it did not quite match the performance of an output-space ensemble. Git Re-Basin is computationally expensive and has primarily been demonstrated on smaller models, but it represents an important theoretical contribution to understanding when and why merging works.

## What is a frankenmerge?

All the methods described above operate by combining corresponding parameters at each layer. **Frankenmerge** (also called **passthrough** in mergekit) takes a fundamentally different approach: instead of blending layers, it concatenates layers from different models to create a deeper network.

For example, given two 7-billion-parameter models with 32 [transformer](/wiki/transformer) layers each, a frankenmerge might take all 32 layers from Model A and append the last 8 layers from Model B, producing a 40-layer model with approximately 9 billion parameters. The resulting model has an "exotic" parameter count that does not correspond to any standard architecture.

This approach was inspired by Upstage's **depth up-scaling** technique used to create SOLAR-10.7B. SOLAR builds on the Llama 2 architecture initialized with Mistral 7B weights, then duplicates the 32-layer base model, trims 8 layers from each copy, and concatenates the remainder into a 48-layer network (s = 2 x (n - m) = 2 x (32 - 8) = 48), followed by continued pretraining.[8] The community-driven frankenmerge technique skips the continued pretraining step, relying on the assumption that adjacent layers in transformer models perform similar functions and can be stacked without retraining.

Notable examples of frankenmerge models include:

- **goliath-120b**, created by alpindale, which merged two fine-tuned Llama 2 70B models (Xwin and Euryale) into a single 120-billion-parameter model using the passthrough method in mergekit.
- Various 9B-parameter models created from pairs of 7B models.

Frankenmerge is highly experimental and requires significant trial and error. There is no established formula for determining which layers to take from which model, and results can be unpredictable. Nonetheless, the community has produced impressive results with this technique.

## How does evolutionary model merging work?

In March 2024, [Sakana AI](/wiki/sakana_ai), a Tokyo-based AI research lab founded by former [Google](/wiki/google) Brain researchers, introduced **evolutionary model merging** in their paper "Evolutionary Optimization of Model Merging Recipes" (published in Nature Machine Intelligence in January 2025).[6] The authors, Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha, proposed using [evolutionary algorithms](/wiki/evolutionary_algorithm) to automatically discover optimal merge configurations.

Prior to this work, model merging relied heavily on human intuition: practitioners manually selected which models to merge, which method to use, and what hyperparameters (such as interpolation weights and density values) to apply. Sakana AI's approach automates this process by treating the merge configuration as a search problem. As the company explained, "evolutionary algorithms, inspired by natural selection, can unlock more effective merging solutions" because "these algorithms can explore a vast space of possibilities, discovering novel and unintuitive combinations that traditional methods and human intuition might miss."[6]

The evolutionary algorithm operates in two spaces:

- **Parameter space.** Optimizes the interpolation weights and merging hyperparameters for combining model weights.
- **Data flow space.** Optimizes how layers are connected and which layers are used from which model, enabling architectural modifications beyond simple weight averaging.

Using this approach, the team created a Japanese Math LLM (EvoLLM-JP) by merging a Japanese language model (Shisa-[Gamma](/wiki/gamma_ai)) with mathematics-focused models (WizardMath and Abel). The resulting 7-billion-parameter model scored 52.4 on MGSM-JA (Japanese grade-school math) and 69.0 on the Japanese Language Model Evaluation Harness, surpassing every Japanese LLM under 70B parameters and even the previous 70B-parameter Japanese state of the art, despite never being explicitly trained for Japanese mathematical reasoning.[6]

They also created a culturally aware Japanese [vision-language model](/wiki/vision_language_model) (EvoVLM-JP) that outperformed the English baseline it was derived from (LLaVA-1.6-Mistral-7B) and existing Japanese VLMs at describing culture-specific visual content.[6] These results demonstrated that evolutionary search can discover merge recipes that human experts would be unlikely to find through manual experimentation.

Sakana AI later extended this line of work with **CycleQD**, a population-based approach using quality-diversity algorithms to produce diverse collections of merged models, each optimized for different capability profiles.

## What tools are used for model merging?

### MergeKit

**MergeKit** is the most widely used open-source toolkit for model merging. It was created by Charles Goddard, a software engineer who had previously worked at NASA and Apple, and is now maintained by [Arcee AI](/wiki/arcee_ai). The accompanying paper, "Arcee's MergeKit: A Toolkit for Merging Large Language Models" by Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz, was published at EMNLP 2024.[7]

MergeKit supports the following merge methods:

| Method Name | MergeKit Identifier | Description |
|---|---|---|
| Linear (weighted average) | `linear` | Simple weighted average of model weights |
| SLERP | `slerp` | Spherical linear interpolation between two models |
| Task Arithmetic | `task_arithmetic` | Add scaled task vectors to a base model |
| TIES-Merging | `ties` | Trim, elect sign, and merge aligned parameters |
| DARE + TIES | `dare_ties` | DARE sparsification with TIES sign election |
| DARE + Linear | `dare_linear` | DARE sparsification with linear averaging |
| Passthrough | `passthrough` | Layer concatenation for frankenmerge |

Newer releases have added further methods, including `della`, `breadcrumbs`, `model_stock`, `sce`, `karcher`, and `arcee_fusion`, bringing the total to more than a dozen supported algorithms.[7]

MergeKit uses YAML configuration files to define merge recipes. A typical configuration specifies the source models, layer ranges, merge method, and method-specific parameters. For example, a SLERP merge configuration might look like:

```yaml
slices:
  - sources:
      - model: OpenPipe/mistral-ft-optimized-1218
        layer_range: [0, 32]
      - model: mlabonne/NeuralHermes-2.5-Mistral-7B
        layer_range: [0, 32]
merge_method: slerp
base_model: OpenPipe/mistral-ft-optimized-1218
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16
```

A key technical feature of MergeKit is its **out-of-core** processing approach, which allows merges to be performed with minimal memory. According to the project's documentation, "merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM,"[7] making model merging accessible even without expensive hardware. MergeKit integrates with the Hugging Face Transformers library and Hub, enabling users to pull models directly from the Hub and upload merged results.

As of 2025, MergeKit has "facilitated the merging of thousands of models," many of which have achieved top positions on open benchmarks.[7]

### Hugging Face Integration

The [Hugging Face](/wiki/hugging_face) ecosystem provides extensive support for model merging beyond MergeKit:

- **PEFT ([Parameter-Efficient Fine-Tuning](/wiki/peft)) library.** Includes built-in support for merging [LoRA](/wiki/lora) adapters using methods like linear averaging, TIES, DARE, and others.
- **Model Hub.** Hosts thousands of merged models, with metadata tags (such as "merge" and "model merge") that allow users to discover and filter community merges.
- **Open LLM Leaderboard.** Tracks merged models as a distinct category, enabling the community to compare merged models against traditionally trained ones.

## Which merged models are well known?

The open-source community has produced numerous merged models that have achieved strong benchmark results. Below are some notable examples:

| Model | Creator | Method | Base Models | Parameters | Notable Achievement |
|---|---|---|---|---|---|
| goliath-120b | alpindale | Passthrough | Two Llama 2 70B fine-tunes (Xwin, Euryale) | 120B | One of the first successful frankenmerge models |
| SOLAR-10.7B | Upstage | Depth up-scaling | Llama 2 architecture, Mistral 7B weights | 10.7B | Top of Open LLM Leaderboard (Dec 2023); continued pretraining after merge |
| Marcoro14-7B-slerp | Maxime Labonne | SLERP | Marcoroni-7B-v3 + Mistral-7B-Merge-14-v0.1 | 7B | Best 7B model on Open LLM Leaderboard (Feb 2024) |
| Daredevil-7B | Maxime Labonne | DARE-TIES | Multiple Mistral-based models | 7B | Strong benchmark performance with DARE method |
| NeuralBeagle14-7B | Maxime Labonne | DARE-TIES | Multiple Mistral fine-tunes | 7B | High-performing community merge |

The proliferation of merged models on Hugging Face has been remarkable. At various points in 2023 and 2024, merged models occupied the top positions on the Open LLM Leaderboard, demonstrating that merging can compete with, and sometimes surpass, models produced through expensive training runs.[4]

However, the community has also noted concerns about **benchmark contamination**: some fine-tuned models may have been trained on data that overlaps with benchmark test sets, and merging such models can propagate this contamination. For this reason, evaluation using diverse benchmarks, human evaluation, and [LMSYS](/wiki/lmsys) [Chatbot Arena](/wiki/lmsys_chatbot_arena) rankings is recommended alongside leaderboard scores.

## How does model merging differ from ensembling?

Model merging and [ensemble learning](/wiki/ensemble_learning) both aim to combine the strengths of multiple models, but they differ in their approach and trade-offs.

| Aspect | Model Merging | Ensemble Learning |
|---|---|---|
| Combination point | Weight space (before inference) | Output space (during inference) |
| Number of models at inference | 1 | Multiple |
| Inference cost | Same as a single model | Proportional to number of models |
| Memory requirement | Same as a single model | Proportional to number of models |
| Performance ceiling | Generally lower than ensembles | Generally higher |
| Flexibility | Requires same architecture | Can combine different architectures |

Ensembles typically achieve higher accuracy because they can capture diverse predictions and reduce variance through output averaging or voting. Model merging sacrifices some of this diversity by collapsing multiple models into one, but it gains practical advantages in deployment: a single merged model requires no special inference infrastructure, uses the same memory as any individual model, and has identical latency.

Research has shown that weight-space ensembles (model merging) can match or approach the performance of output-space ensembles in many settings, particularly when the source models share a common pretrained initialization. The Model Soups paper demonstrated this parity in several vision and NLP tasks.[1]

## Applications

Model merging has found applications across a variety of settings in the AI community.

### Combining Specialized Models

The most common application is combining models that have been fine-tuned for different tasks or domains. For instance, merging an instruction-following model with a code-generation model and a mathematical reasoning model can produce a single model that handles all three types of queries. This eliminates the need for routing logic or running multiple specialized models.

### Improving Robustness

As demonstrated by Model Soups, averaging weights across different hyperparameter configurations improves not only in-distribution accuracy but also out-of-distribution robustness.[1] The merged model smooths over the idiosyncrasies of individual training runs, leading to more stable predictions across diverse inputs.

### Knowledge Editing

Task Arithmetic's negation operation enables a form of knowledge editing: by subtracting a task vector associated with an undesirable behavior (such as toxic language generation), practitioners can reduce that behavior in the merged model.[2] This provides a lightweight alternative to full [RLHF](/wiki/rlhf) (Reinforcement Learning from Human Feedback) training for behavior modification.

### Cross-Lingual Transfer

Sakana AI's evolutionary merging demonstrated that models trained in different languages can be merged to create bilingual or multilingual models. Their Japanese Math LLM combined a Japanese language model with English-trained math models, transferring mathematical reasoning capabilities across language boundaries without explicit cross-lingual training.[6]

### Federated Learning

In [federated learning](/wiki/federated_learning) settings, multiple clients train local models on private data and then a central server merges these local models. Model merging techniques from the research community have direct applications in improving the quality of federated aggregation.

### Continual Learning

Model merging can help address [catastrophic forgetting](/wiki/catastrophic_forgetting) in continual learning scenarios. Rather than sequentially fine-tuning a model on new tasks (which risks forgetting earlier tasks), practitioners can fine-tune separate copies and merge them to retain capabilities across all tasks.

## What are the limitations of model merging?

Despite its appeal, model merging has several important limitations.

### Architecture Compatibility

Most merging methods require the source models to have identical architectures: the same number of layers, the same hidden dimensions, and the same number of attention heads. This is because merging operates on corresponding parameters at each position in the weight tensors. Models with different architectures cannot be merged using standard methods, though the passthrough technique offers a partial workaround by concatenating layers.

Recent research, such as "Transport and Merge: Cross-Architecture Merging for Large Language Models" (2025), has begun to explore techniques for merging models with different architectures, but these methods remain in early stages.

### Same Base Model Requirement

Merging works best when all source models are fine-tuned from the same base pretrained checkpoint. This shared initialization ensures that the models lie in the same basin of the loss landscape and that their parameter spaces are aligned. Merging models fine-tuned from different base checkpoints (for example, merging a Llama model with a Mistral model) typically produces poor results because the parameters are not in correspondence.

Git Re-Basin partially addresses this by aligning neuron orderings through permutation, but the technique is computationally expensive and has not been scaled to the largest modern models.

### Unpredictable Quality

The quality of a merged model is difficult to predict in advance. Two models that individually perform well may produce a poor merge if their parameter updates interfere destructively. Conversely, merges that seem unlikely to work on paper sometimes produce surprisingly good results. This unpredictability means that model merging still involves significant trial and error.

### Performance Degradation at Scale

As the number of models being merged increases, performance tends to degrade. Each additional model introduces more potential for interference and parameter conflicts. Methods like TIES-Merging and DARE mitigate this to some extent, but merging a large number of models (say, ten or more) into a single checkpoint remains challenging.

### No Theoretical Guarantees

The field lacks a comprehensive theoretical framework that predicts when merging will succeed and to what degree. Most results are empirical, and the conditions under which merging provides reliable improvements are still being studied. This makes it difficult for practitioners to know in advance whether merging is the right approach for their use case.

### Benchmark Gaming

Because model merging is so easy to perform, it has enabled a form of benchmark gaming on public leaderboards. Practitioners can quickly merge many combinations of models and submit the best-performing ones, potentially exploiting benchmark-specific weaknesses rather than producing genuinely better models. The community has responded by developing more robust evaluation frameworks and by flagging merged models on leaderboards.

## Connection to Model Editing and Knowledge Composition

Model merging is closely related to the broader field of **model editing**, which seeks to modify a trained model's knowledge or behavior without full retraining. Task Arithmetic explicitly frames merging as a model editing technique: task vectors can be added, subtracted, or combined to steer model behavior in specific directions.[2]

The connection to **knowledge composition** is also significant. A 2025 paper, "Model Merging for [Knowledge Editing](/wiki/knowledge_editing)," proposed a two-stage framework combining robust supervised fine-tuning with model merging to edit factual knowledge in LLMs. The first stage fine-tunes the model to internalize new knowledge, and the second stage merges the fine-tuned model with the original to preserve both the new knowledge and the model's general capabilities.

Model merging can also be viewed as a form of **multi-task learning** without explicit multi-task training. Traditional multi-task learning requires training on all tasks simultaneously, which introduces challenges like task balancing and data mixing. Model merging sidesteps these issues by allowing each task's model to be trained independently and combined afterward.

## Practical Tips for Model Merging

Based on community experience and published research, several practical guidelines have emerged:

1. **Start with models from the same base.** Merging is most likely to succeed when all source models are fine-tuned from the same pretrained checkpoint.

2. **Use SLERP for two-model merges.** When combining exactly two models, SLERP generally produces better results than linear interpolation because it preserves directional information in weight space.

3. **Use DARE-TIES for multi-model merges.** When combining three or more models, DARE-TIES tends to produce the best results by reducing interference through both sparsification and sign election.

4. **Keep weights balanced.** The sum of interpolation weights across models should generally fall between 0.9 and 1.1. Extreme weight imbalances can destabilize the merged model.

5. **Set density below 0.5.** For TIES and DARE methods, a density parameter below 0.5 (retaining less than half of the delta parameters) often works well, though values slightly above 0.5 can sometimes be better.

6. **Pay attention to first and last layers.** The first and last layers of a transformer model tend to be the most sensitive. Some practitioners assign different interpolation weights to these layers.

7. **Evaluate broadly.** Do not rely on a single benchmark. Use multiple evaluation suites, including human evaluation when possible, to assess the quality of a merged model.

8. **Iterate quickly.** Model merging is fast. Take advantage of this by trying many configurations and evaluating each one, rather than spending time trying to find the optimal configuration analytically.

## Recent Developments (2025-2026)

### ACM Computing Surveys 2026 Comprehensive Review

In 2026, Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao published a comprehensive survey, "Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications, and Opportunities," in *ACM Computing Surveys* (Volume 58, Issue 8, article 216, 41 pages). The survey proposes a new taxonomic framework that organizes the rapidly growing literature, covers the application of model merging across more than ten machine learning subfields (including continual learning, multi-task learning, and few-shot learning), and identifies cross-architecture merging, theoretical grounding, and multimodal merging as the most important open frontiers.[11]

### 2026 Era Survey (FUSE Taxonomy)

A March 2026 preprint by Mingyang Song and Mao Zheng, "Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions," organized the field through a four-part FUSE taxonomy (Foundations, Unification strategies, Scenarios, Ecosystem). It reviews algorithmic approaches spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization, and applies them across multi-task learning, safety alignment, domain specialization, and federated learning.[12]

### Activation-Informed Merging

A February 2025 paper, "Activation-Informed Merging of Large Language Models" (AIM) by Amin Heyrani Nobari and colleagues, demonstrated that using activation statistics from a task-agnostic calibration set to prioritize critical base-model weights substantially reduces destructive interference compared to purely weight-based methods. AIM is framed as a complementary solution that can be applied on top of any existing merging method.[13]

## Future Directions

Model merging is an active area of research with several promising directions.

**Cross-architecture merging.** Enabling merging between models of different architectures and sizes would greatly expand the applicability of the technique. Early work in this area uses techniques like singular value decomposition (SVD) and transport methods to align parameter spaces across architectures.

**Automated merge optimization.** Building on Sakana AI's evolutionary approach, researchers are developing more sophisticated automated methods for discovering optimal merge recipes, including reinforcement learning-based and [Bayesian optimization](/wiki/bayesian_optimization)-based approaches.

**Theoretical foundations.** Developing a rigorous theoretical understanding of when and why merging works would allow practitioners to predict merge quality before performing the merge, saving time and compute.

**Activation-informed merging.** Rather than operating solely on static weights, newer methods use activation statistics from calibration data to guide the merging process, identifying which parameters are most important for preserving each model's capabilities.

**Per-layer SVD merging.** Research using per-layer singular value decomposition to isolate task-relevant directions in weight space has shown that singular vectors can guide merging to reduce destructive interference.

**Merging beyond language models.** While most current work focuses on LLMs, model merging has potential applications in [computer vision](/wiki/computer_vision), [speech recognition](/wiki/speech_recognition), [reinforcement learning](/wiki/reinforcement_learning), and multimodal models. The Model Soups paper already demonstrated benefits for vision models, and extending advanced merging techniques to other domains remains an open opportunity.

## References

1. Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., Schmidt, L. (2022). "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." *Proceedings of the 39th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2203.05482

2. Ilharco, G., Ribeiro, M.T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., Farhadi, A. (2023). "Editing Models with Task Arithmetic." *11th International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2212.04089

3. Yadav, P., Tam, D., Choshen, L., Raffel, C., Bansal, M. (2023). "TIES-Merging: Resolving Interference When Merging Models." *Advances in Neural Information Processing Systems (NeurIPS)*. https://arxiv.org/abs/2306.01708

4. Yu, L., Yu, B., Yu, H., Huang, F., Li, Y. (2024). "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch." *Proceedings of the 41st International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2311.03099

5. Ainsworth, S.K., Hayase, J., Srinivasa, S. (2023). "Git Re-Basin: Merging Models modulo Permutation Symmetries." *11th International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2209.04836

6. Akiba, T., Shing, M., Tang, Y., Sun, Q., Ha, D. (2025). "Evolutionary Optimization of Model Merging Recipes." *Nature Machine Intelligence*. https://arxiv.org/abs/2403.13187

7. Goddard, C., Siriwardhana, S., Ehghaghi, M., Meyers, L., Karpukhin, V., Benedict, B., McQuade, M., Solawetz, J. (2024). "Arcee's MergeKit: A Toolkit for Merging Large Language Models." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP): Industry Track*. https://arxiv.org/abs/2403.13257

8. Kim, D., Park, C., Kim, S., Lee, W., Song, W., Kim, Y., Kim, H., Kim, Y., Lee, H., Kim, J., et al. (2023). "SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling." *arXiv preprint arXiv:2312.15166*. https://arxiv.org/abs/2312.15166

9. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M. (2020). "Linear Mode Connectivity and the Lottery Ticket Hypothesis." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.

10. Shoemake, K. (1985). "Animating Rotation with Quaternion Curves." *ACM SIGGRAPH Computer Graphics*, 19(3), 245-254.

11. Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., Tao, D. (2026). "Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications, and Opportunities." *ACM Computing Surveys*, 58(8), Article 216. https://dl.acm.org/doi/10.1145/3787849

12. Song, M., Zheng, M. (2026). "Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions." arXiv preprint arXiv:2603.09938. https://arxiv.org/abs/2603.09938

13. Heyrani Nobari, A., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., Azizan, N. (2025). "Activation-Informed Merging of Large Language Models." arXiv preprint arXiv:2502.02421. https://arxiv.org/abs/2502.02421