Model soups
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,683 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,683 words
Add missing citations, update stale details, or suggest a clearer explanation.
Model soups is a weight-averaging technique (a form of model merging) that combines several independently fine-tuned neural networks into a single model by averaging their parameters. It was introduced by Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre and colleagues from the University of Washington, Columbia University, Google Research, Meta AI Research and Tel Aviv University, in the paper "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time," presented at the International Conference on Machine Learning (ICML) in 2022 [1][2], with a public reference implementation [3].
The conventional recipe for getting the best model out of a fine-tuning hyperparameter sweep is to train many candidate models, evaluate each on a held-out validation set, keep the single best, and discard the rest. Model soups revisit that final selection step. Instead of throwing away the runner-up models, the method averages the weights of several candidates that were all fine-tuned from the same pretrained initialization but with different hyperparameters such as learning rate, data augmentation, random seed and training length. Because the output is a single set of weights, a soup has exactly the same inference and memory cost as any one of its ingredients, unlike an ensemble, which must run and store every member [1].
The central empirical finding is that this averaged model is frequently more accurate and more robust than the best individual ingredient. Applied to a Vision Transformer ViT-G/14 pretrained on JFT-3B and fine-tuned on ImageNet, a model soup reached 90.94% top-1 accuracy, a state-of-the-art result at the time of publication [1].
Let theta_0 be a shared pretrained initialization, for example a CLIP image encoder [9], and let theta_1, ..., theta_k be the weight vectors produced by k independent fine-tuning runs that all start from theta_0 and differ only in hyperparameters. A soup is a weighted average of a subset of these vectors. The paper proposes three recipes [1].
The uniform soup is the plain mean of all k candidates: theta_soup = (1/k) times the sum of theta_1 through theta_k. It requires no validation data and is trivially cheap to compute. Its weakness is that every ingredient is weighted equally, so a few weak or divergent runs can pull the average below the best single model.
The greedy soup is constructed so that it can never be worse than the best ingredient on validation data. The candidates are first sorted in decreasing order of held-out validation accuracy. The soup starts empty; each model is then considered in turn and folded into the running average only if doing so does not decrease accuracy on the held-out set. Models that would hurt the soup are skipped. Because the procedure begins with the strongest model and accepts a new ingredient only when it helps, the greedy soup is guaranteed to be at least as accurate as the single best model on the held-out set, and in practice it usually exceeds it [1]. This is the recipe the authors recommend by default.
The learned soup is a more expensive variant that does not simply include or exclude each model but instead learns continuous interpolation coefficients, optionally one per layer, by gradient-based minibatch optimization against the held-out set. It can extract a little more accuracy, but it requires holding all candidate models in memory at the same time during construction [1].
Averaging the weights of two arbitrary neural networks usually yields a useless model, because the loss landscape is highly non-convex and independently trained networks settle into different, disconnected basins. The precondition that makes soups work is shared initialization. Models that are all fine-tuned from the same pretrained checkpoint tend to remain inside a single low-error region of weight space, a property closely tied to linear mode connectivity: the straight-line path between two such solutions stays low in loss, so points along it, including the midpoint, are themselves good models [1][8].
Within such a basin, averaging weights approximates averaging outputs. The paper gives an analysis that relates the gap between a weight-averaged model and a logit-ensemble of the same models to two quantities: the flatness (local curvature, or second derivative) of the loss and the confidence of the predictions. When the loss is locally flat and the predictions are confident, the two behave alike, so one averaged set of weights captures much of the benefit an ensemble would deliver, at the cost of only a single forward pass [1]. Averaging can also land in a flatter part of the loss surface than any individual endpoint, which is associated with better generalization. Empirically, the authors observe that the accuracy advantage of interpolating two fine-tuned models is correlated with how flat the loss is along the path between them [1].
The original work validated soups across image and text models. For a CLIP ViT-B/32 and an ALIGN EfficientNet-L2 fine-tuned on ImageNet, the greedy soup beat the best model from the hyperparameter sweep by 0.7 and 0.5 percentage points respectively [1]. Gains also appeared on five natural distribution shift benchmarks (ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet and ImageNet-A), where soups often improved robustness by more than they improved in-distribution accuracy. The headline result was the ViT-G/14 greedy soup reaching 90.94% top-1 on ImageNet, surpassing the previous best of 90.88% set by CoAtNet [1][10][11]. The approach also extended to fine-tuned transformer text classifiers and improved zero-shot transfer to new downstream tasks [1].
| Setting | Outcome |
|---|---|
| CLIP ViT-B/32 on ImageNet, greedy soup vs best single model | +0.7 percentage points |
| ALIGN EfficientNet-L2 on ImageNet, greedy soup vs best single model | +0.5 percentage points |
| ViT-G/14 (JFT-3B pretrained), greedy soup, ImageNet top-1 | 90.94% (new state of the art; prior best 90.88%) |
| Inference and memory cost vs one fine-tuned model | Identical |
| Inference and memory cost vs a k-member ensemble | Roughly 1/k (one network instead of k) |
A consistent nuance is that a full output ensemble can still edge out a soup on in-distribution accuracy, while soups tend to match or beat ensembles on distribution shifts, and they do so at a fraction of the inference cost [1].
Model soups sit within a broader family of weight-averaging and model merging techniques.
Stochastic weight averaging (SWA) averages weights sampled along a single training trajectory under a cyclical or high constant learning rate [4]. Exponential moving averaging (EMA) and the classical Polyak-Ruppert averages similarly smooth the weights of one run over time; the soups paper in fact stores an EMA of each run as a candidate ingredient. Soups differ from all of these by averaging the endpoints of multiple independent runs that differ in hyperparameters, rather than snapshots from one run.
WiSE-FT, from an overlapping group of authors, interpolates between a zero-shot model and its fine-tuned version to trade off in-distribution and out-of-distribution accuracy; it is essentially a two-point soup with the pretrained checkpoint as one ingredient [5].
Task arithmetic generalizes the idea to "task vectors," the difference between fine-tuned and pretrained weights, which can be added or subtracted to compose or remove capabilities [6]. When task vectors come from the same task and are averaged with equal weight, the construction reduces to a uniform soup. Later merging methods such as TIES-Merging and DARE extend averaging to models fine-tuned on different tasks by resolving sign conflicts and parameter interference [7]. Model soups are widely regarded as a foundational precursor to this model merging literature.
The method's effectiveness rests on assumptions that bound where it applies [1]: