Selective Language Modeling (Rho-1)

Machine Learning Reinforcement Learning

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,735 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Selective Language Modeling (SLM) is a pretraining objective for language models that applies the training loss to only a chosen subset of tokens rather than to every token in the corpus. Instead of the standard practice of minimizing next-token prediction error uniformly across all positions, SLM scores each token in the training data using a separate, small reference model and back-propagates loss only through the highest-scoring tokens. The method was introduced in the 2024 paper "Rho-1: Not All Tokens Are What You Need," by Zhenghao Lin, Zhibin Gou, and co-authors from Microsoft Research together with collaborators at Xiamen University and Tsinghua University ^[1]^[3]. The slogan that captures the idea, and the paper's title, is "not all tokens are what you need."

The models trained with the technique are called Rho-1 (written with the Greek letter rho, and released as the "rho-math" model series). Continually pretrained on a roughly 15 billion token mathematics corpus, Rho-1 matched the math performance of much larger pretraining runs while using a small fraction of the tokens, and it produced what the authors reported as the first 1B-parameter model to exceed 40 percent accuracy on the MATH benchmark ^[1]. The paper was selected for an oral presentation and named an Outstanding Main Track Runner-Up at NeurIPS 2024 ^[5].

Motivation

A standard causal language model is trained with a uniform next-token cross-entropy loss: every token in the corpus contributes equally to the gradient, and the model is optimized to predict each one in turn. The Rho-1 authors questioned the implicit assumption behind this objective, namely that every token is equally worth learning ^[1].

To test that assumption, they tracked the per-token loss of a model over the course of a normal pretraining run and sorted tokens by how their loss changed from early to late training. They found four distinct patterns, in roughly the following proportions ^[1]:

Category	Share of tokens	Loss behavior during training	Interpretation
H to L	about 26 percent	starts high, falls to low	genuinely learned; the valuable signal
L to L	about 51 percent	already low throughout	already known; little new signal
H to H	about 11 percent	stays high	hard or noisy; rarely fit
L to H	about 12 percent	rises during training	unstable, often noise

The implication is that the majority of tokens fall into the "L to L" bucket and are already predictable, while a meaningful fraction are noisy ("H to H" and "L to H") and may actively hurt training. Only about a quarter of tokens show the desirable trajectory of being learned during training. Spending equal optimization effort on already-learned and noisy tokens is, on this view, wasteful at best and harmful at worst. SLM is the proposed fix: concentrate the loss on the tokens that are both learnable and not yet learned.

This token-level perspective is distinct from document-level data curation. Conventional curation filters or reweights whole documents before training. SLM keeps the documents but selects which tokens within them drive the loss, recognizing that even a high-quality document contains many trivial or distracting tokens, and even a noisy document may contain useful ones.

How Selective Language Modeling works

SLM is organized as a three-step pipeline ^[1].

1. Train a reference model. A small reference model is trained (or fine-tuned) on a modest amount of high-quality, curated data that reflects the distribution the final model should target. In the math experiments the reference model was trained on about 0.5 billion tokens of curated and synthetic mathematics; for the general-domain experiments it used about 1.9 billion tokens drawn from open instruction and curated sources such as Tulu-v2 and OpenHermes-2.5 ^[1]. The reference model encodes a notion of which tokens a competent model on the target distribution finds predictable.

2. Score every token by excess loss. The reference model's per-token loss is computed once over the entire pretraining corpus and stored. During the main training run, each token receives a score called its excess loss, defined as the difference between the loss of the model being trained and the loss of the reference model on that token:

L_excess(x_i) = L_theta(x_i) - L_ref(x_i)

Here L_theta is the current training model's loss at position i and L_ref is the stored reference loss. A high excess loss means the reference model predicts the token well while the model being trained does not, signaling a token that is both learnable and not yet learned. Crucially, this single quantity filters out both unwanted groups at once: already-learned tokens have low loss under the training model and so a low score, and noisy tokens have high loss under the reference model too, which cancels in the difference and again yields a low score.

3. Train selectively on the top tokens. Within each training batch the tokens are ranked by excess loss, and the cross-entropy loss is applied only to the top k percent. The selective objective can be written as

L_SLM(theta) = - (1 / (N * k%)) * sum_i I_k%(x_i) * log P(x_i | x_<i ; theta)

where I_k% is an indicator that is 1 for tokens in the top k percent by excess loss and 0 otherwise. The default selection ratio was 60 percent for the 1B model and 70 percent for the 7B model ^[1]. An important and easily missed point is that selection is applied to the loss, not to the input: the full token sequence is still fed through the transformer so that every position has its complete left context, and only the loss and gradient are masked down to the selected tokens. SLM is therefore a token-level loss-weighting scheme rather than a token-dropping or data-filtering scheme.

Rho-1 results

The technique was validated mainly through continual pretraining on mathematics. Starting from TinyLlama-1.1B and from Mistral 7B, the authors continually pretrained on the OpenWebMath corpus (roughly 14 billion to 15 billion tokens), with SLM keeping about 9 billion tokens for the 1B model and about 10.5 billion for the 7B model ^[1].

Applying SLM raised average few-shot accuracy across nine math tasks by about 16.5 percent for the 1B model and about 10.4 percent for the 7B model relative to ordinary continual pretraining, with gains on individual tasks reaching as high as 30 percent. The authors reported reaching the baseline's accuracy roughly 5 to 10 times faster in terms of tokens consumed ^[1].

The headline comparison was against DeepSeekMath, a contemporaneous math model pretrained on a far larger math corpus (on the order of hundreds of billions of tokens). Rho-1-7B matched DeepSeekMath-7B's MATH accuracy while, per the paper, using only about 3 percent as many pretraining tokens ^[1]. Reported few-shot chain-of-thought and tool-integrated ("interpreter") accuracies are shown below ^[1]^[4].

Model	MATH	GSM8K	Setting
Rho-Math-1B	15.6	36.2	few-shot CoT
Rho-Math-7B	31.0	66.9	few-shot CoT
Rho-Math-1B-Interpreter	40.6	59.4	tool-integrated
Rho-Math-7B-Interpreter	51.8	81.3	tool-integrated
DeepSeekMath-7B	34.2	64.1	few-shot CoT
DeepSeekMath-7B	52.0	79.8	tool-integrated

After supervised fine-tuning with a code interpreter, Rho-Math-1B reached 40.6 percent on MATH, which the authors described as the first 1B-parameter model to surpass 40 percent on that benchmark, and Rho-Math-7B reached 51.8 percent, essentially matching the much more heavily pretrained DeepSeekMath-7B ^[1]^[4]. Microsoft released the rho-math-1b-v0.1, rho-math-7b-v0.1, and their interpreter variants under an MIT license ^[4].

To show the method was not specific to mathematics, the authors also continually pretrained TinyLlama-1.1B on 80 billion general tokens with SLM. This produced an average improvement of about 6.8 percent across 15 diverse benchmarks, with the largest gains, exceeding 10 percent, on code and math tasks ^[1].

Significance and limitations

SLM reframes data quality as a token-level property and shows that a simple per-token weighting, driven by an inexpensive reference model, can substantially improve the sample efficiency of pretraining. It is conceptually related to curriculum learning and to the broader literature on data selection, but it operates at a finer granularity than document filtering and adapts to the model's own learning state through the excess-loss signal. Because the selection mechanism is a thin wrapper around the ordinary cross-entropy loss, it is straightforward to add to existing training pipelines. The recognition at NeurIPS 2024 reflected the result that a 1B and a 7B model could reach math performance previously associated with far larger token budgets ^[1]^[5].

Several limitations and caveats apply. First, the method depends entirely on the reference model and the curated data used to train it: the selected tokens inherit whatever the reference treats as "high value," so a narrow or biased reference will skew selection, and defining a representative target distribution is harder for broad, general-purpose pretraining than for a focused domain such as mathematics. Second, SLM adds an extra reference-model training stage and a one-time scoring pass over the corpus, and it introduces the selection ratio k as a hyperparameter that must be tuned. Third, because the full sequence is still processed in the forward pass, SLM does not reduce the compute cost per token; its benefit is fewer tokens and steps to reach a given quality, that is, better data efficiency rather than lower per-step cost. Finally, the strongest demonstrations were on continual pretraining, especially for math; while the paper showed general-domain gains, the behavior of SLM in large-scale, from-scratch web pretraining was less fully explored at the time of publication. These caveats notwithstanding, "not all tokens are what you need" has become a frequently cited framing for token-level data selection in large language model training ^[1]^[2].

References

Lin, Z., Gou, Z., Gong, Y., Liu, X., Shen, Y., Xu, R., Lin, C., Yang, Y., Jiao, J., Duan, N., and Chen, W. "Rho-1: Not All Tokens Are What You Need." arXiv:2404.07965, 2024. https://arxiv.org/abs/2404.07965 ↩
"Not All Tokens Are What You Need for Pretraining." NeurIPS 2024 (OpenReview). https://openreview.net/forum?id=0NMzBwqaAJ ↩
"Not All Tokens Are What You Need for Pretraining." Microsoft Research publication page. https://www.microsoft.com/en-us/research/publication/not-all-tokens-are-what-you-need-for-pretraining/ ↩
microsoft/rho. GitHub repository (Rho-1 code and released rho-math models). https://github.com/microsoft/rho ↩
"Congratulations to the NeurIPS 2024 award winners." AIhub, 11 December 2024. https://aihub.org/2024/12/11/congratulations-to-the-neurips2024-award-winners/ ↩
Shao, Z., et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, 2024. https://arxiv.org/abs/2402.03300

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Reinforcement learning

Overview

Motivation

How Selective Language Modeling works

Rho-1 results

Significance and limitations

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)