Selective Language Modeling (Rho-1)
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,735 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,735 words
Add missing citations, update stale details, or suggest a clearer explanation.
Selective Language Modeling (SLM) is a pretraining objective for language models that applies the training loss to only a chosen subset of tokens rather than to every token in the corpus. Instead of the standard practice of minimizing next-token prediction error uniformly across all positions, SLM scores each token in the training data using a separate, small reference model and back-propagates loss only through the highest-scoring tokens. The method was introduced in the 2024 paper "Rho-1: Not All Tokens Are What You Need," by Zhenghao Lin, Zhibin Gou, and co-authors from Microsoft Research together with collaborators at Xiamen University and Tsinghua University [1][3]. The slogan that captures the idea, and the paper's title, is "not all tokens are what you need."
The models trained with the technique are called Rho-1 (written with the Greek letter rho, and released as the "rho-math" model series). Continually pretrained on a roughly 15 billion token mathematics corpus, Rho-1 matched the math performance of much larger pretraining runs while using a small fraction of the tokens, and it produced what the authors reported as the first 1B-parameter model to exceed 40 percent accuracy on the MATH benchmark [1]. The paper was selected for an oral presentation and named an Outstanding Main Track Runner-Up at NeurIPS 2024 [5].
A standard causal language model is trained with a uniform next-token cross-entropy loss: every token in the corpus contributes equally to the gradient, and the model is optimized to predict each one in turn. The Rho-1 authors questioned the implicit assumption behind this objective, namely that every token is equally worth learning [1].
To test that assumption, they tracked the per-token loss of a model over the course of a normal pretraining run and sorted tokens by how their loss changed from early to late training. They found four distinct patterns, in roughly the following proportions [1]:
| Category | Share of tokens | Loss behavior during training | Interpretation |
|---|---|---|---|
| H to L | about 26 percent | starts high, falls to low | genuinely learned; the valuable signal |
| L to L | about 51 percent | already low throughout | already known; little new signal |
| H to H | about 11 percent | stays high | hard or noisy; rarely fit |
| L to H | about 12 percent | rises during training | unstable, often noise |
The implication is that the majority of tokens fall into the "L to L" bucket and are already predictable, while a meaningful fraction are noisy ("H to H" and "L to H") and may actively hurt training. Only about a quarter of tokens show the desirable trajectory of being learned during training. Spending equal optimization effort on already-learned and noisy tokens is, on this view, wasteful at best and harmful at worst. SLM is the proposed fix: concentrate the loss on the tokens that are both learnable and not yet learned.
This token-level perspective is distinct from document-level data curation. Conventional curation filters or reweights whole documents before training. SLM keeps the documents but selects which tokens within them drive the loss, recognizing that even a high-quality document contains many trivial or distracting tokens, and even a noisy document may contain useful ones.
SLM is organized as a three-step pipeline [1].
1. Train a reference model. A small reference model is trained (or fine-tuned) on a modest amount of high-quality, curated data that reflects the distribution the final model should target. In the math experiments the reference model was trained on about 0.5 billion tokens of curated and synthetic mathematics; for the general-domain experiments it used about 1.9 billion tokens drawn from open instruction and curated sources such as Tulu-v2 and OpenHermes-2.5 [1]. The reference model encodes a notion of which tokens a competent model on the target distribution finds predictable.
2. Score every token by excess loss. The reference model's per-token loss is computed once over the entire pretraining corpus and stored. During the main training run, each token receives a score called its excess loss, defined as the difference between the loss of the model being trained and the loss of the reference model on that token:
L_excess(x_i) = L_theta(x_i) - L_ref(x_i)
Here L_theta is the current training model's loss at position i and L_ref is the stored reference loss. A high excess loss means the reference model predicts the token well while the model being trained does not, signaling a token that is both learnable and not yet learned. Crucially, this single quantity filters out both unwanted groups at once: already-learned tokens have low loss under the training model and so a low score, and noisy tokens have high loss under the reference model too, which cancels in the difference and again yields a low score.
3. Train selectively on the top tokens. Within each training batch the tokens are ranked by excess loss, and the cross-entropy loss is applied only to the top k percent. The selective objective can be written as
L_SLM(theta) = - (1 / (N * k%)) * sum_i I_k%(x_i) * log P(x_i | x_<i ; theta)
where I_k% is an indicator that is 1 for tokens in the top k percent by excess loss and 0 otherwise. The default selection ratio was 60 percent for the 1B model and 70 percent for the 7B model [1]. An important and easily missed point is that selection is applied to the loss, not to the input: the full token sequence is still fed through the transformer so that every position has its complete left context, and only the loss and gradient are masked down to the selected tokens. SLM is therefore a token-level loss-weighting scheme rather than a token-dropping or data-filtering scheme.
The technique was validated mainly through continual pretraining on mathematics. Starting from TinyLlama-1.1B and from Mistral 7B, the authors continually pretrained on the OpenWebMath corpus (roughly 14 billion to 15 billion tokens), with SLM keeping about 9 billion tokens for the 1B model and about 10.5 billion for the 7B model [1].
Applying SLM raised average few-shot accuracy across nine math tasks by about 16.5 percent for the 1B model and about 10.4 percent for the 7B model relative to ordinary continual pretraining, with gains on individual tasks reaching as high as 30 percent. The authors reported reaching the baseline's accuracy roughly 5 to 10 times faster in terms of tokens consumed [1].
The headline comparison was against DeepSeekMath, a contemporaneous math model pretrained on a far larger math corpus (on the order of hundreds of billions of tokens). Rho-1-7B matched DeepSeekMath-7B's MATH accuracy while, per the paper, using only about 3 percent as many pretraining tokens [1]. Reported few-shot chain-of-thought and tool-integrated ("interpreter") accuracies are shown below [1][4].
| Model | MATH | GSM8K | Setting |
|---|---|---|---|
| Rho-Math-1B | 15.6 | 36.2 | few-shot CoT |
| Rho-Math-7B | 31.0 | 66.9 | few-shot CoT |
| Rho-Math-1B-Interpreter | 40.6 | 59.4 | tool-integrated |
| Rho-Math-7B-Interpreter | 51.8 | 81.3 | tool-integrated |
| DeepSeekMath-7B | 34.2 | 64.1 | few-shot CoT |
| DeepSeekMath-7B | 52.0 | 79.8 | tool-integrated |
After supervised fine-tuning with a code interpreter, Rho-Math-1B reached 40.6 percent on MATH, which the authors described as the first 1B-parameter model to surpass 40 percent on that benchmark, and Rho-Math-7B reached 51.8 percent, essentially matching the much more heavily pretrained DeepSeekMath-7B [1][4]. Microsoft released the rho-math-1b-v0.1, rho-math-7b-v0.1, and their interpreter variants under an MIT license [4].
To show the method was not specific to mathematics, the authors also continually pretrained TinyLlama-1.1B on 80 billion general tokens with SLM. This produced an average improvement of about 6.8 percent across 15 diverse benchmarks, with the largest gains, exceeding 10 percent, on code and math tasks [1].
SLM reframes data quality as a token-level property and shows that a simple per-token weighting, driven by an inexpensive reference model, can substantially improve the sample efficiency of pretraining. It is conceptually related to curriculum learning and to the broader literature on data selection, but it operates at a finer granularity than document filtering and adapts to the model's own learning state through the excess-loss signal. Because the selection mechanism is a thin wrapper around the ordinary cross-entropy loss, it is straightforward to add to existing training pipelines. The recognition at NeurIPS 2024 reflected the result that a 1B and a 7B model could reach math performance previously associated with far larger token budgets [1][5].
Several limitations and caveats apply. First, the method depends entirely on the reference model and the curated data used to train it: the selected tokens inherit whatever the reference treats as "high value," so a narrow or biased reference will skew selection, and defining a representative target distribution is harder for broad, general-purpose pretraining than for a focused domain such as mathematics. Second, SLM adds an extra reference-model training stage and a one-time scoring pass over the corpus, and it introduces the selection ratio k as a hyperparameter that must be tuned. Third, because the full sequence is still processed in the forward pass, SLM does not reduce the compute cost per token; its benefit is fewer tokens and steps to reach a given quality, that is, better data efficiency rather than lower per-step cost. Finally, the strongest demonstrations were on continual pretraining, especially for math; while the paper showed general-domain gains, the behavior of SLM in large-scale, from-scratch web pretraining was less fully explored at the time of publication. These caveats notwithstanding, "not all tokens are what you need" has become a frequently cited framing for token-level data selection in large language model training [1][2].