ReST / ReST-EM (Reinforced Self-Training)
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,109 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,109 words
Add missing citations, update stale details, or suggest a clearer explanation.
ReST (Reinforced Self-Training) is a family of self-training algorithms for large language models that improve a model by fine-tuning it on its own filtered outputs instead of on additional human-written data. The shared recipe alternates two phases. A "Grow" phase samples many candidate outputs from the current model, and an "Improve" phase keeps only the outputs that earn a high reward and trains the model on those survivors. Repeating the loop lets the model bootstrap from a fixed pool of inputs plus a reward signal, steadily increasing the probability it places on outputs that score well. [1][2]
The name covers two related methods from Google DeepMind. The original ReST, introduced by Caglar Gulcehre and colleagues in August 2023, frames the loop as growing-batch reinforcement learning and applies it to machine translation, filtering samples with a learned reward model trained on human preferences. [1] ReST-EM, written ReST^EM in the paper after the expectation-maximization (EM) algorithm, was introduced by Avi Singh and colleagues in December 2023 and published in Transactions on Machine Learning Research in 2024. It recasts the same loop as an EM procedure driven by binary correctness rewards and demonstrates it on competition mathematics and programming. [2] ReST-EM is the variant most associated with the reasoning-model line of work, and it is closely related to the earlier STaR method. [3]
Gulcehre et al. cast alignment as a growing-batch reinforcement learning problem and proposed ReST as a sample-efficient, mostly offline alternative to online RLHF. [1] A standard online method such as PPO repeatedly samples fresh outputs from the policy as it updates, which is expensive because every gradient step needs new generations and reward-model scores. ReST instead separates generation from learning. It samples a large batch of outputs once, scores them, and then reuses that fixed dataset for several rounds of offline policy improvement, so the cost of generation is amortized across many cheap fine-tuning passes.
Concretely, ReST starts from a supervised policy, which for translation is a trained neural machine-translation model. The outer Grow loop uses the current policy to generate multiple candidate translations for every source sentence and adds them to a growing dataset. The inner Improve loop scores each candidate with a learned reward model and filters with a threshold, keeping samples whose reward exceeds a cutoff, then fine-tunes the policy on the kept subset. The Improve loop is run several times over the same grown dataset using an increasing sequence of thresholds tau1 < tau2 < ... < tauN, so each pass trains on a smaller but higher-quality subset. [1]
The authors tested several offline objectives for the Improve step, including behavioral cloning (standard negative log-likelihood on the filtered data), GOLD, and reward-weighted variants, and reported that the simple behavioral-cloning loss on filtered data was a strong and stable choice. ReST was evaluated on machine-translation benchmarks including IWSLT 2014 German-to-English and WMT 2020 Chinese-to-English, plus an internal web-domain dataset, using a learned, reference-free translation-quality metric as the reward model. Across automatic metrics and human evaluation, all ReST variants improved translation quality over the supervised baseline while using less compute than online reinforcement learning. [1]
Singh et al. asked whether a model can improve beyond human-generated data when an automatic check on correctness is available, as it is for mathematics and code. [2] Their method, ReST-EM, keeps the Grow and Improve structure but makes two changes suited to verifiable reasoning. First, the reward is binary: a sample earns reward 1 if its final answer is correct (on the MATH benchmark) or if it passes the hidden unit tests (on APPS), and 0 otherwise. There is no learned reward model and no threshold schedule; filtering simply keeps the correct samples. Second, each Improve step fine-tunes the original base model from scratch rather than continuing from the previous iteration, a choice the authors adopt explicitly to limit task-specific overfitting and drift from the base model. This matches STaR and differs from the original ReST, which warm-starts. [1][2][3]
The "EM" in the name refers to a derivation of the loop as expectation-maximization. Introduce a binary optimality variable that indicates a high-reward output. Maximizing the log-likelihood of observing optimality leads, through a standard evidence-lower-bound argument, to two alternating steps. The E-step (Generate) draws samples from the current policy and weights them by reward, which for a binary reward means simply collecting the correct samples; this is the Grow phase. The M-step (Improve) maximizes the reward-weighted log-likelihood J(theta) = E[ r(x, y) log p(y | x; theta) ] over the dataset, which, again for binary rewards, reduces to ordinary supervised fine-tuning on the filtered correct samples. Iterating the E-step and M-step is therefore the Grow and Improve loop, now with a clean probabilistic justification. [2]
ReST-EM was run on PaLM 2 models of several sizes (PaLM 2-S, the code-capable PaLM 2-S*, and the large PaLM 2-L), using the roughly 7,500 training problems of the MATH benchmark and the roughly 2,342 introductory problems of APPS. [2][7][8][9]
A single iteration of either method has the same shape. [1][2]
Sampling many candidates per input (a large K) raises the chance of finding at least one high-reward output for hard inputs, which is what lets the model manufacture training signal for problems it cannot yet solve reliably. The two methods differ mainly in bookkeeping, as summarized below.
| Aspect | ReST (Gulcehre et al., 2023) | ReST-EM (Singh et al., 2023 to 2024) |
|---|---|---|
| Framing | Growing-batch reinforcement learning | Expectation-maximization |
| Reward | Learned reward model, continuous score | Binary correctness verifier (0 or 1) |
| Filtering | Increasing thresholds tau1 < ... < tauN | Single rule: keep reward = 1 |
| Loop structure | One Grow, several Improve steps | Alternate one Generate and one Improve |
| Fine-tune start | Warm-start from previous policy | Restart from base model each round |
| Domain | Machine translation | Math (MATH) and code (APPS) |
| Base model | Supervised translation model | PaLM 2-S, PaLM 2-S*, PaLM 2-L |
ReST-EM sits in a cluster of "generate, filter, then fine-tune" techniques that approximate reinforcement learning with a reward by reweighting the model's own samples. [2][3]
The common thread, including with rejection sampling and best-of-N distillation, is treating the model as its own data generator and using a reward to decide what to learn from. [2]
The headline finding is that ReST-EM outperforms supervised fine-tuning on human-written solutions, and that the advantage grows with model size, so larger models benefit at least as much as smaller ones. [2] On MATH, PaLM 2-L test accuracy rose by roughly six percentage points over the base and human-data baselines; on APPS the gains were similar. The approximate figures read from the paper appear below, and most of the improvement arrives in the first one or two iterations. [2]
| Benchmark (PaLM 2-L, approximate) | Base / human-data SFT | After ReST-EM |
|---|---|---|
| MATH, test accuracy | about 35% | about 41% |
| APPS Introductory | about 32% | about 38% |
The self-trained models also transferred well to tasks they were not trained on. Fine-tuning on MATH improved GSM8K performance (for example, majority voting over 64 samples rose from about 44% to about 49%); a PaLM 2-L model trained with ReST-EM on MATH scored strongly on a held-out Hungarian high-school mathematics exam, behind only GPT-4 among the models the authors compared; and there was no meaningful degradation across the Big-Bench Hard suite under chain-of-thought evaluation. [2] These transfer results indicate the method teaches general problem-solving rather than memorizing the training set.
By contrast, the original ReST improved machine-translation quality on both automatic metrics and human evaluation while being more compute-efficient than online RLHF, establishing the Grow and Improve template that ReST-EM later specialized to reasoning. [1]
Both methods share the limits of reward-filtered self-training. [1][2][3]