ReST / ReST-EM (Reinforced Self-Training)

Machine Learning Reinforcement Learning

11 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 2,109 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

ReST (Reinforced Self-Training) is a family of self-training algorithms for large language models that improve a model by fine-tuning it on its own filtered outputs instead of on additional human-written data. The shared recipe alternates two phases. A "Grow" phase samples many candidate outputs from the current model, and an "Improve" phase keeps only the outputs that earn a high reward and trains the model on those survivors. Repeating the loop lets the model bootstrap from a fixed pool of inputs plus a reward signal, steadily increasing the probability it places on outputs that score well. ^[1]^[2]

The name covers two related methods from Google DeepMind. The original ReST, introduced by Caglar Gulcehre and colleagues in August 2023, frames the loop as growing-batch reinforcement learning and applies it to machine translation, filtering samples with a learned reward model trained on human preferences. ^[1] ReST-EM, written ReST^EM in the paper after the expectation-maximization (EM) algorithm, was introduced by Avi Singh and colleagues in December 2023 and published in Transactions on Machine Learning Research in 2024. It recasts the same loop as an EM procedure driven by binary correctness rewards and demonstrates it on competition mathematics and programming. ^[2] ReST-EM is the variant most associated with the reasoning-model line of work, and it is closely related to the earlier STaR method. ^[3]

ReST (original)

Gulcehre et al. cast alignment as a growing-batch reinforcement learning problem and proposed ReST as a sample-efficient, mostly offline alternative to online RLHF. ^[1] A standard online method such as PPO repeatedly samples fresh outputs from the policy as it updates, which is expensive because every gradient step needs new generations and reward-model scores. ReST instead separates generation from learning. It samples a large batch of outputs once, scores them, and then reuses that fixed dataset for several rounds of offline policy improvement, so the cost of generation is amortized across many cheap fine-tuning passes.

Concretely, ReST starts from a supervised policy, which for translation is a trained neural machine-translation model. The outer Grow loop uses the current policy to generate multiple candidate translations for every source sentence and adds them to a growing dataset. The inner Improve loop scores each candidate with a learned reward model and filters with a threshold, keeping samples whose reward exceeds a cutoff, then fine-tunes the policy on the kept subset. The Improve loop is run several times over the same grown dataset using an increasing sequence of thresholds tau1 < tau2 < ... < tauN, so each pass trains on a smaller but higher-quality subset. ^[1]

The authors tested several offline objectives for the Improve step, including behavioral cloning (standard negative log-likelihood on the filtered data), GOLD, and reward-weighted variants, and reported that the simple behavioral-cloning loss on filtered data was a strong and stable choice. ReST was evaluated on machine-translation benchmarks including IWSLT 2014 German-to-English and WMT 2020 Chinese-to-English, plus an internal web-domain dataset, using a learned, reference-free translation-quality metric as the reward model. Across automatic metrics and human evaluation, all ReST variants improved translation quality over the supervised baseline while using less compute than online reinforcement learning. ^[1]

ReST-EM

Singh et al. asked whether a model can improve beyond human-generated data when an automatic check on correctness is available, as it is for mathematics and code. ^[2] Their method, ReST-EM, keeps the Grow and Improve structure but makes two changes suited to verifiable reasoning. First, the reward is binary: a sample earns reward 1 if its final answer is correct (on the MATH benchmark) or if it passes the hidden unit tests (on APPS), and 0 otherwise. There is no learned reward model and no threshold schedule; filtering simply keeps the correct samples. Second, each Improve step fine-tunes the original base model from scratch rather than continuing from the previous iteration, a choice the authors adopt explicitly to limit task-specific overfitting and drift from the base model. This matches STaR and differs from the original ReST, which warm-starts. ^[1]^[2]^[3]

The "EM" in the name refers to a derivation of the loop as expectation-maximization. Introduce a binary optimality variable that indicates a high-reward output. Maximizing the log-likelihood of observing optimality leads, through a standard evidence-lower-bound argument, to two alternating steps. The E-step (Generate) draws samples from the current policy and weights them by reward, which for a binary reward means simply collecting the correct samples; this is the Grow phase. The M-step (Improve) maximizes the reward-weighted log-likelihood J(theta) = E[ r(x, y) log p(y | x; theta) ] over the dataset, which, again for binary rewards, reduces to ordinary supervised fine-tuning on the filtered correct samples. Iterating the E-step and M-step is therefore the Grow and Improve loop, now with a clean probabilistic justification. ^[2]

ReST-EM was run on PaLM 2 models of several sizes (PaLM 2-S, the code-capable PaLM 2-S*, and the large PaLM 2-L), using the roughly 7,500 training problems of the MATH benchmark and the roughly 2,342 introductory problems of APPS. ^[2]^[7]^[8]^[9]

The Grow and Improve loop

A single iteration of either method has the same shape. ^[1]^[2]

Grow (E-step). For each input x in the training set, sample K outputs from the current policy, typically with temperature sampling to encourage diversity.
Score. Assign each sample a reward: a learned reward-model score in the original ReST, or a 0/1 correctness check in ReST-EM.
Filter. Keep the high-reward samples, meaning those above the current threshold in ReST or the correct ones in ReST-EM.
Improve (M-step). Fine-tune on the kept samples by maximizing the reward-weighted log-likelihood.
Repeat. Use the improved model to generate the next batch, and continue until performance saturates.

Sampling many candidates per input (a large K) raises the chance of finding at least one high-reward output for hard inputs, which is what lets the model manufacture training signal for problems it cannot yet solve reliably. The two methods differ mainly in bookkeeping, as summarized below.

Aspect	ReST (Gulcehre et al., 2023)	ReST-EM (Singh et al., 2023 to 2024)
Framing	Growing-batch reinforcement learning	Expectation-maximization
Reward	Learned reward model, continuous score	Binary correctness verifier (0 or 1)
Filtering	Increasing thresholds tau1 < ... < tauN	Single rule: keep reward = 1
Loop structure	One Grow, several Improve steps	Alternate one Generate and one Improve
Fine-tune start	Warm-start from previous policy	Restart from base model each round
Domain	Machine translation	Math (MATH) and code (APPS)
Base model	Supervised translation model	PaLM 2-S, PaLM 2-S*, PaLM 2-L

Relationship to other methods

ReST-EM sits in a cluster of "generate, filter, then fine-tune" techniques that approximate reinforcement learning with a reward by reweighting the model's own samples. ^[2]^[3]

STaR (Zelikman et al., 2022) is the direct predecessor: it generates chain-of-thought rationales, keeps those that reach the correct answer, and fine-tunes on them. ReST-EM differs by using temperature sampling of many candidates rather than greedy decoding, by dropping STaR's "rationalization" step (which the authors note inflates false-positive solutions that reach the right answer through faulty reasoning), and by giving the loop an explicit EM derivation. ^[2]^[3]
Rejection sampling fine-tuning (RFT, Yuan et al., 2023) is essentially a single iteration of the loop: sample, keep correct solutions, and fine-tune once. ReST-EM generalizes it to multiple iterations. ^[2]^[4]
RAFT (Reward rAnked FineTuning, Dong et al., 2023) ranks samples by reward and trains on the top ones; for binary rewards it is an instantiation of the same reweighted-likelihood update. ^[2]^[5]
Expert iteration (ExIt, Anthony et al., 2017), the template behind AlphaZero-style systems, uses an "expert" such as tree search to produce improved targets onto which a policy is distilled. ReST-EM replaces search with temperature sampling against a verifier, and its authors flag stronger search in the E-step as future work. ^[2]^[6]

The common thread, including with rejection sampling and best-of-N distillation, is treating the model as its own data generator and using a reward to decide what to learn from. ^[2]

Results

The headline finding is that ReST-EM outperforms supervised fine-tuning on human-written solutions, and that the advantage grows with model size, so larger models benefit at least as much as smaller ones. ^[2] On MATH, PaLM 2-L test accuracy rose by roughly six percentage points over the base and human-data baselines; on APPS the gains were similar. The approximate figures read from the paper appear below, and most of the improvement arrives in the first one or two iterations. ^[2]

Benchmark (PaLM 2-L, approximate)	Base / human-data SFT	After ReST-EM
MATH, test accuracy	about 35%	about 41%
APPS Introductory	about 32%	about 38%

The self-trained models also transferred well to tasks they were not trained on. Fine-tuning on MATH improved GSM8K performance (for example, majority voting over 64 samples rose from about 44% to about 49%); a PaLM 2-L model trained with ReST-EM on MATH scored strongly on a held-out Hungarian high-school mathematics exam, behind only GPT-4 among the models the authors compared; and there was no meaningful degradation across the Big-Bench Hard suite under chain-of-thought evaluation. ^[2] These transfer results indicate the method teaches general problem-solving rather than memorizing the training set.

By contrast, the original ReST improved machine-translation quality on both automatic metrics and human evaluation while being more compute-efficient than online RLHF, establishing the Grow and Improve template that ReST-EM later specialized to reasoning. ^[1]

Limitations

Both methods share the limits of reward-filtered self-training. ^[1]^[2]^[3]

A reward signal is required. ReST-EM needs an automatic, reliable check of correctness, which exists cleanly only for verifiable domains such as mathematics, code, or formal tasks; the original ReST needs a learned reward model, which can be gamed (reward hacking) when it is imperfect.
A pool of inputs is required. Each new task needs a moderately sized set of problems or prompts to grow data from.
Bounded by the base model. The loop can only learn from outputs the model sometimes produces correctly, so gains are capped by the model's pass@K coverage. ReST-EM may not close the gap to pass@K for large K, and very hard problems the model never solves contribute no signal.
Diminishing returns and overfitting. Improvement typically saturates after one to three iterations, and on small datasets further iterations can hurt: APPS performance regressed in the second iteration as the model overfit the roughly 2,342 training problems. Restarting each M-step from the base model mitigates but does not eliminate this.
Right answer, wrong reasoning. Because filtering checks only the final outcome, a binary reward accepts solutions that reach the correct answer through invalid steps, which can reinforce spurious reasoning. This is the motivation ReST-EM cites for avoiding STaR-style rationalization.
Compute cost and diversity. Each iteration's generate-and-fine-tune cycle is expensive, and repeatedly training on filtered self-generated data can narrow the model's output diversity.

References

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas. "Reinforced Self-Training (ReST) for Language Modeling." arXiv:2308.08998, August 2023. https://arxiv.org/abs/2308.08998 ↩
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, et al. "Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models." Transactions on Machine Learning Research, 2024. arXiv:2312.06585. https://arxiv.org/abs/2312.06585 ↩
Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman. "STaR: Bootstrapping Reasoning with Reasoning." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). arXiv:2203.14465. https://arxiv.org/abs/2203.14465 ↩
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, Jingren Zhou. "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models" (rejection sampling fine-tuning). arXiv:2308.01825, 2023. https://arxiv.org/abs/2308.01825 ↩
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang. "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment." Transactions on Machine Learning Research, 2023. arXiv:2304.06767. https://arxiv.org/abs/2304.06767 ↩
Thomas Anthony, Zheng Tian, David Barber. "Thinking Fast and Slow with Deep Learning and Tree Search" (expert iteration). NeurIPS 2017. arXiv:1705.08439. https://arxiv.org/abs/1705.08439 ↩
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS 2021. arXiv:2103.03874. https://arxiv.org/abs/2103.03874 ↩
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt. "Measuring Coding Challenge Competence With APPS." NeurIPS 2021. arXiv:2105.09938. https://arxiv.org/abs/2105.09938 ↩
Rohan Anil, Andrew M. Dai, Orhan Firat, et al. (Google). "PaLM 2 Technical Report." arXiv:2305.10403, 2023. https://arxiv.org/abs/2305.10403 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Instruction backtranslation (Humpback)