DoReMi
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Add missing citations, update stale details, or suggest a clearer explanation.
DoReMi (Domain Reweighting with Minimax Optimization) is a method for automatically choosing the proportions, or "domain weights," of each data source in a pretraining corpus for a large language model. It was introduced in the paper "DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining" by Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu, a collaboration between Stanford University and Google (Google Brain and what is now Google DeepMind), first posted in May 2023 and presented at NeurIPS 2023 [1]. Xie, Liang, and Ma were at Stanford, with Xie working as a student researcher at Google; the remaining authors were at Google [1][9].
The central idea is to train one small model to discover a good data mixture, then reuse that fixed mixture to train a much larger model far more cheaply. In the headline experiment, a 280M-parameter proxy model set the mixture for an 8B-parameter model (about 30 times larger). The resulting mixture improved the 8B model's average few-shot downstream accuracy on The Pile by 6.5 percentage points and let it reach the baseline model's accuracy using 2.6 times fewer training steps [1]. The method takes its name from how the mixture is found: a proxy model is trained with group distributionally robust optimization (Group DRO), a minimax procedure in which an adversary repeatedly shifts weight toward the domains where the proxy still has the most room to improve. The sequence of weights the adversary visits, averaged over the run, becomes the output mixture [1].
Modern pretraining corpora are assembled from many heterogeneous sources. The Pile, for example, combines 22 domains spanning filtered web text (Pile-CC), books, code from GitHub, academic papers from ArXiv and PubMed Central, question-and-answer text from StackExchange, and more [2]. The fraction of training tokens drawn from each domain is a major design choice: it strongly affects the final model's quality on both perplexity and downstream tasks, and getting it wrong wastes compute on data that does not transfer [1].
In practice these proportions were usually set by heuristics. The Pile, for instance, fixes weights largely by the available token counts of each source combined with manual judgments about quality [2]. A more principled alternative is to tune the mixture against a set of target downstream tasks, but that is expensive, requires training many candidate models, and bakes in assumptions about which tasks matter. DoReMi was designed to find a strong, general-purpose mixture cheaply and without any knowledge of downstream tasks, so that the same weights can be reused across model scales [1].
DoReMi runs in three stages. The first two use only small models; the expensive third stage trains the target large model once, on the optimized mixture.
A small reference model (280M parameters in the main experiments) is trained normally on the baseline domain weights, for example the corpus's default proportions. This reference is not the deliverable. Its purpose is to record, for every training example, how hard that example is under a standard mixture. Those per-token reference losses, the negative log-likelihoods, define a per-domain baseline of achievable loss [1].
A proxy model of the same small size is then trained with Group DRO, the group distributionally robust optimization procedure of Sagawa and colleagues [3]. The key quantity is the excess loss on each example: the proxy model's loss minus the reference model's loss. Subtracting the reference is what makes DoReMi work. A naive worst-case objective would simply chase the domains with the highest absolute loss, which tends to be inherently high-entropy or noisy text that no model can predict well. Excess loss instead measures the headroom a domain still offers, so the optimizer concentrates on domains where additional weighting actually yields improvement [1].
Training solves a minimax problem: minimize over the proxy parameters, and maximize over the domain weights on the probability simplex, the weighted sum of each domain's average excess loss. The proxy descends on its parameters while an adversary updates the domain weights by exponentiated gradient ascent, a multiplicative-weights rule that raises the weight of any domain with high excess loss in proportion to exp(step size times that excess loss). After each update the weights are renormalized and lightly smoothed toward a uniform distribution (with a small constant of about 0.001) so that no domain collapses to zero [1]. Crucially, the proxy model itself is discarded. What DoReMi keeps is the trajectory of domain weights: the final mixture is the average of the weights across all training steps, which is the standard way to read a single solution out of this kind of online minimax game [1].
The averaged domain weights are used to resample the corpus, and the large target model is trained once on this reweighted data with an ordinary objective. Because the mixture is a fixed list of proportions, it carries no extra cost at the large scale beyond the one-time price of the two small runs in Stages 1 and 2 [1].
On The Pile, DoReMi's optimized weights departed sharply from the defaults. The proxy moved a large amount of weight onto general web text and away from several narrow technical sources, while still giving proportionally large boosts to small underrepresented domains such as YouTubeSubtitles and PhilPapers [1]. Selected weights from the 280M proxy run:
| Pile domain | Baseline weight | DoReMi weight | Direction |
|---|---|---|---|
| Pile-CC (filtered web text) | 0.112 | 0.606 | strong increase |
| ArXiv | 0.105 | 0.004 | strong decrease |
| PubMed Central | 0.107 | 0.005 | strong decrease |
| StackExchange | 0.093 | 0.015 | decrease |
The most striking finding is that the reweighting was not a simple trade of one domain's quality for another's. The 8B model trained on DoReMi weights improved log-perplexity on all 22 Pile domains, including the technical domains whose weights were cut by more than an order of magnitude [1]. In other words, training on far fewer ArXiv or PubMed tokens still lowered perplexity on ArXiv and PubMed, because a better overall mixture transfers across domains. Alongside the perplexity gains, the model improved average one-shot downstream accuracy by 6.5 percentage points over the baseline and matched the baseline's accuracy in 2.6 times fewer steps [1].
DoReMi was also tested on the proprietary GLaM dataset, which has 8 domains [1][4]. There the authors ran DoReMi for a second iteration, using the first round's weights as the new reference. The resulting mixture matched the performance of weights the GLaM team had hand-tuned using downstream tasks, even though DoReMi never observed any downstream task during optimization [1][4]. This task-agnostic property is one of the method's main selling points.
DoReMi sits within the broader field of data curation for pretraining, but it operates at the level of whole domains rather than individual examples. It assumes a fixed partition of the corpus into domains and only adjusts their proportions; it does not filter, deduplicate, or score individual documents. This distinguishes it from example-level data selection methods such as DSIR (Data Selection via Importance Resampling), introduced by Xie and several of the same collaborators, which chooses individual documents to match a target distribution [5]. In a full pipeline the two are complementary: quality filtering and selection decide which documents enter each domain, while DoReMi decides how much of each domain to sample.
DoReMi's use of Group DRO is itself a repurposing. Group DRO was created to make a single model robust to worst-case subpopulation shift [3]; DoReMi instead treats the adversary's converged weighting as a recipe for training a different, larger model. The method helped popularize the idea of using cheap proxy models to optimize the data mixture, and several later approaches built on or competed with it. RegMix casts mixture selection as a regression problem, training many tiny models on varied mixtures and fitting a predictor of the best one, and reports matching or exceeding DoReMi while using roughly a tenth of the compute, though it optimizes against target tasks rather than being task-agnostic [6]. Data Mixing Laws fit functional forms that predict performance directly from mixture proportions, allowing the optimum to be estimated from small-scale runs [7]. Other lines of work, such as DOGE, estimate each domain's contribution without a separate reference model, aiming to simplify the DoReMi recipe [6][7].
DoReMi has several practical caveats. It requires two extra training runs (the reference and the proxy) before the main run; the authors argue this overhead is small relative to a large pretraining job and is a one-time, reusable cost, but it is still nontrivial, and the per-run cost grows with the number of domains [1]. The method depends on a sensible predefined domain partition: it can reweight the groups it is given but cannot discover better groupings on its own, which motivated later work on regrouping data before reweighting.
The most discussed limitation is sensitivity to proxy scale. The mixture DoReMi recovers can change substantially with the size of the proxy model, so weights tuned at one scale may not be optimal at another. Later analyses reported, for example, that the weight placed on general web text can swing widely between a smaller and a larger proxy, undercutting the assumption that a single small proxy yields a universally good mixture [6]. Consistent with this scale dependence, the authors' own public reimplementation, using 120M rather than 280M models, reported improved or comparable perplexity on 15 of the 22 Pile domains rather than all 22 [8]. Finally, because DoReMi is deliberately task-agnostic, it cannot exploit knowledge of a specific downstream application; when the target tasks are known in advance, methods that optimize the mixture against those tasks can do better [6].