Model soups

Machine Learning Reinforcement Learning

8 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 1,683 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Model soups is a weight-averaging technique (a form of model merging) that combines several independently fine-tuned neural networks into a single model by averaging their parameters. It was introduced by Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre and colleagues from the University of Washington, Columbia University, Google Research, Meta AI Research and Tel Aviv University, in the paper "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," presented at the International Conference on Machine Learning (ICML) in 2022 ^[1]^[2], with a public reference implementation ^[3].

The conventional recipe for getting the best model out of a fine-tuning hyperparameter sweep is to train many candidate models, evaluate each on a held-out validation set, keep the single best, and discard the rest. Model soups revisit that final selection step. Instead of throwing away the runner-up models, the method averages the weights of several candidates that were all fine-tuned from the same pretrained initialization but with different hyperparameters such as learning rate, data augmentation, random seed and training length. Because the output is a single set of weights, a soup has exactly the same inference and memory cost as any one of its ingredients, unlike an ensemble, which must run and store every member ^[1].

The central empirical finding is that this averaged model is frequently more accurate and more robust than the best individual ingredient. Applied to a Vision Transformer ViT-G/14 pretrained on JFT-3B and fine-tuned on ImageNet, a model soup reached 90.94% top-1 accuracy, a state-of-the-art result at the time of publication ^[1].

How it works

Let $\theta_0$ be a shared pretrained initialization, for example a CLIP image encoder ^[9], and let $\theta_1, \ldots, \theta_k$ be the weight vectors produced by $k$ independent fine-tuning runs that all start from $\theta_0$ and differ only in hyperparameters. A soup is a weighted average of a subset of these vectors. The paper proposes three recipes ^[1].

Uniform soup

The uniform soup is the plain mean of all $k$ candidates: $\theta_{\text{soup}} = \frac{1}{k}$ times the sum of $\theta_1$ through $\theta_k$ . It requires no validation data and is trivially cheap to compute. Its weakness is that every ingredient is weighted equally, so a few weak or divergent runs can pull the average below the best single model.

Greedy soup

The greedy soup is constructed so that it can never be worse than the best ingredient on validation data. The candidates are first sorted in decreasing order of held-out validation accuracy. The soup starts empty; each model is then considered in turn and folded into the running average only if doing so does not decrease accuracy on the held-out set. Models that would hurt the soup are skipped. Because the procedure begins with the strongest model and accepts a new ingredient only when it helps, the greedy soup is guaranteed to be at least as accurate as the single best model on the held-out set, and in practice it usually exceeds it ^[1]. This is the recipe the authors recommend by default.

Learned soup

The learned soup is a more expensive variant that does not simply include or exclude each model but instead learns continuous interpolation coefficients, optionally one per layer, by gradient-based minibatch optimization against the held-out set. It can extract a little more accuracy, but it requires holding all candidate models in memory at the same time during construction ^[1].

Why it works

Averaging the weights of two arbitrary neural networks usually yields a useless model, because the loss landscape is highly non-convex and independently trained networks settle into different, disconnected basins. The precondition that makes soups work is shared initialization. Models that are all fine-tuned from the same pretrained checkpoint tend to remain inside a single low-error region of weight space, a property closely tied to linear mode connectivity: the straight-line path between two such solutions stays low in loss, so points along it, including the midpoint, are themselves good models ^[1]^[8].

Within such a basin, averaging weights approximates averaging outputs. The paper gives an analysis that relates the gap between a weight-averaged model and a logit-ensemble of the same models to two quantities: the flatness (local curvature, or second derivative) of the loss and the confidence of the predictions. When the loss is locally flat and the predictions are confident, the two behave alike, so one averaged set of weights captures much of the benefit an ensemble would deliver, at the cost of only a single forward pass ^[1]. Averaging can also land in a flatter part of the loss surface than any individual endpoint, which is associated with better generalization. Empirically, the authors observe that the accuracy advantage of interpolating two fine-tuned models is correlated with how flat the loss is along the path between them ^[1].

Results

The original work validated soups across image and text models. For a CLIP ViT-B/32 and an ALIGN EfficientNet-L2 fine-tuned on ImageNet, the greedy soup beat the best model from the hyperparameter sweep by 0.7 and 0.5 percentage points respectively ^[1]. Gains also appeared on five natural distribution shift benchmarks (ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet and ImageNet-A), where soups often improved robustness by more than they improved in-distribution accuracy. The headline result was the ViT-G/14 greedy soup reaching 90.94% top-1 on ImageNet, surpassing the previous best of 90.88% set by CoAtNet ^[1]^[10]^[11]. The approach also extended to fine-tuned transformer text classifiers and improved zero-shot transfer to new downstream tasks ^[1].

Setting	Outcome
CLIP ViT-B/32 on ImageNet, greedy soup vs best single model	+0.7 percentage points
ALIGN EfficientNet-L2 on ImageNet, greedy soup vs best single model	+0.5 percentage points
ViT-G/14 (JFT-3B pretrained), greedy soup, ImageNet top-1	90.94% (new state of the art; prior best 90.88%)
Inference and memory cost vs one fine-tuned model	Identical
Inference and memory cost vs a $k$ -member ensemble	Roughly $1/k$ (one network instead of $k$ )

A consistent nuance is that a full output ensemble can still edge out a soup on in-distribution accuracy, while soups tend to match or beat ensembles on distribution shifts, and they do so at a fraction of the inference cost ^[1].

Relationship to other methods

Model soups sit within a broader family of weight-averaging and model merging techniques.

Stochastic weight averaging (SWA) averages weights sampled along a single training trajectory under a cyclical or high constant learning rate ^[4]. Exponential moving averaging (EMA) and the classical Polyak-Ruppert averages similarly smooth the weights of one run over time; the soups paper in fact stores an EMA of each run as a candidate ingredient. Soups differ from all of these by averaging the endpoints of multiple independent runs that differ in hyperparameters, rather than snapshots from one run.

WiSE-FT, from an overlapping group of authors, interpolates between a zero-shot model and its fine-tuned version to trade off in-distribution and out-of-distribution accuracy; it is essentially a two-point soup with the pretrained checkpoint as one ingredient ^[5].

Task arithmetic generalizes the idea to "task vectors," the difference between fine-tuned and pretrained weights, which can be added or subtracted to compose or remove capabilities ^[6]. When task vectors come from the same task and are averaged with equal weight, the construction reduces to a uniform soup. Later merging methods such as TIES-Merging and DARE extend averaging to models fine-tuned on different tasks by resolving sign conflicts and parameter interference ^[7]. Model soups are widely regarded as a foundational precursor to this model merging literature.

Limitations

The method's effectiveness rests on assumptions that bound where it applies ^[1]:

Shared initialization is required. Averaging models trained from scratch with different random initializations generally fails, because they occupy different loss basins and the linear path between them passes through high-loss regions.
The uniform soup can be worse than the best single model when poor ingredients are included. The greedy and learned soups mitigate this, but they require a held-out validation set and extra evaluation passes at construction time.
The learned soup additionally needs all candidate models resident in memory at once.
The equivalence to ensembling is only approximate; it holds best in the flat-loss, confident-prediction regime, and ensembles can still win on in-distribution accuracy.
Soups do not save training compute. The full hyperparameter sweep still has to be run; the saving is that no model is wasted and inference stays the cost of a single network.
The strongest evidence is for fine-tuning large pretrained models. Benefits for small models or for training from scratch are less established.

References

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML 2022. arXiv:2203.05482. https://arxiv.org/abs/2203.05482 ↩
Proceedings of the 39th International Conference on Machine Learning, PMLR 162, 2022. https://proceedings.mlr.press/v162/wortsman22a.html ↩
mlfoundations. "model-soups" reference implementation. https://github.com/mlfoundations/model-soups ↩
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. "Averaging Weights Leads to Wider Optima and Better Generalization." UAI 2018. arXiv:1803.05407. https://arxiv.org/abs/1803.05407 ↩
Wortsman, M., Ilharco, G., Kim, J. W., et al. "Robust fine-tuning of zero-shot models" (WiSE-FT). CVPR 2022. arXiv:2109.01903. https://arxiv.org/abs/2109.01903 ↩
Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. "Editing Models with Task Arithmetic." ICLR 2023. arXiv:2212.04089. https://arxiv.org/abs/2212.04089 ↩
Yadav, P., Tam, D., Choshen, L., Raffel, C., and Bansal, M. "TIES-Merging: Resolving Interference When Merging Models." NeurIPS 2023. arXiv:2306.01708. https://arxiv.org/abs/2306.01708 ↩
Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. "Linear Mode Connectivity and the Lottery Ticket Hypothesis." ICML 2020. arXiv:1912.05671. https://arxiv.org/abs/1912.05671 ↩
Radford, A., Kim, J. W., Hallacy, C., et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML 2021. arXiv:2103.00020. https://arxiv.org/abs/2103.00020 ↩
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. "Scaling Vision Transformers." CVPR 2022. arXiv:2106.04560. https://arxiv.org/abs/2106.04560 ↩
Dai, Z., Liu, H., Le, Q. V., and Tan, M. "CoAtNet: Marrying Convolution and Attention for All Data Sizes." NeurIPS 2021. arXiv:2106.04803. https://arxiv.org/abs/2106.04803 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

TIES-Merging Task arithmetic Wisdom of the Crowd

Overview

How it works

Uniform soup

Greedy soup

Learned soup

Why it works

Results

Relationship to other methods

Limitations

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here