SPIN (Self-Play Fine-Tuning)
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,637 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,637 words
Add missing citations, update stale details, or suggest a clearer explanation.
SPIN (Self-Play fIne-tuNing) is a post-training method for large language models introduced by researchers at the University of California, Los Angeles (UCLA) in January 2024. SPIN iteratively improves a supervised fine-tuning (SFT) checkpoint by framing alignment as a two-player game between the model being trained and a frozen copy of an earlier iteration of itself. The current model learns to distinguish human-written ground-truth responses (treated as "chosen" samples) from responses generated by the previous iteration (treated as "rejected" samples), and the resulting objective can be optimized with the same logistic-loss machinery used by Direct Preference Optimization (DPO).[^1]
The method was proposed in the paper "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models" by Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu, posted to arXiv as 2401.01335 on 2 January 2024 and accepted at the International Conference on Machine Learning (ICML) 2024.[^1][^2] SPIN demonstrated that an existing SFT checkpoint, the publicly available zephyr-7b-sft-full derived from Mistral 7B, can be significantly improved on the HuggingFace Open LLM Leaderboard, MT-Bench, and BIG-Bench using only the existing SFT dataset (a 50k-prompt subset of UltraChat 200K) plus self-generated synthetic completions, without collecting any additional human preference data.[^1][^3]
By late 2023, the dominant recipes for aligning open-source LLMs involved either Reinforcement Learning from Human Feedback (RLHF), as popularized for InstructGPT and ChatGPT, or Direct Preference Optimization (DPO), which trains directly on preference pairs without an explicit reward model.[^4][^5] Both approaches share a hard requirement: they need preference data, typically expensive human annotations or, in distillation pipelines, judgements from a stronger model such as GPT-4. The question SPIN addresses is whether a weak SFT model can be pushed substantially further using only the SFT data already on hand.[^1]
The SPIN paper answers this question affirmatively. Starting from the Zephyr 7B SFT checkpoint trained on UltraChat 200K, three iterations of self-play raised the average score on the six tasks of the HuggingFace Open LLM Leaderboard (ARC-Challenge, HellaSwag, MMLU, TruthfulQA, WinoGrande, GSM8K) from 58.14 to 63.16, with the largest absolute gains on GSM8K (26.76 to 38.97) and TruthfulQA (43.73 to 54.90).[^1][^6] The authors also report that SPIN at iteration 1 already surpasses a baseline DPO run on the Zephyr beta preference data (62k pairs of GPT-4-judged completions) on most leaderboard tasks.[^1]
SPIN has been compared to a long tradition of self-play in reinforcement learning, in particular AlphaGo and AlphaZero, as well as to generative adversarial networks (GANs), which similarly pit a generator and a discriminator against one another. A key conceptual difference is that in SPIN both "players" are instantiations of the same language model at different training steps, so no separate discriminator network is required.[^1]
Modern open-source chat models are typically built in three stages. Pretraining yields a base language model. Supervised fine-tuning on instruction-following demonstrations adapts the base model to the chat format. A preference optimization stage, originally implemented with PPO-based RLHF and later with DPO, refines the model on pairs of "chosen" and "rejected" responses.[^4][^5][^7]
RLHF was popularized by the InstructGPT paper of Ouyang et al. (2022), which used a learned reward model trained on human comparison data to optimize a policy via Proximal Policy Optimization (PPO).[^4] DPO, proposed by Rafailov et al. (2023), removes the explicit reward model by reparameterizing the optimal RLHF policy in terms of a closed-form log-ratio and training the language model directly with a binary classification loss on chosen/rejected pairs.[^5] DPO has become a standard component of post-training pipelines such as the Zephyr recipe by Tunstall et al. (2023) at Hugging Face, which reached strong MT-Bench scores by combining SFT on UltraChat with DPO on the GPT-4-judged UltraFeedback dataset.[^7]
A persistent limitation of both RLHF and DPO is the cost and availability of preference data. Constructing UltraFeedback-scale datasets requires either human annotators or judgements from frontier proprietary models, and the resulting preference signal is by construction "human-level" in quality.[^7][^8] Methods that try to reduce or eliminate this cost include Constitutional AI and its RLAIF variant, which substitute AI-generated critiques for human preferences,[^9] and self-training approaches such as ReST^EM by Singh, Co-Reyes and colleagues, which iteratively fine-tunes on filtered self-generated solutions to problems with verifiable scalar feedback.[^10] SPIN takes a different angle: it does not generate additional preference labels at all, but rather treats existing ground-truth responses as positives and self-generated continuations as negatives.[^1]
SPIN operates iteratively. Let ${(x_i, y_i)}{i=1}^N$ denote an SFT dataset of prompt–response pairs, where each $y_i$ is the human-written ground-truth completion for prompt $x_i$. Let $p{\theta_t}$ denote the LLM at iteration $t$, with $\theta_0$ initialized from the SFT checkpoint. SPIN proceeds as follows.[^1]
Intuitively, the main player $p_\theta$ (the current model being trained) is rewarded for raising the log-probability ratio it assigns to the ground-truth response $y$ relative to the frozen previous iteration $p_{\theta_t}$, while simultaneously lowering the same ratio for the self-generated response $y'$.[^1]
The authors derive the loss from a game-theoretic formulation in which a "main player" $f$ (a real-valued function of $(x, y)$) tries to maximize an Integral Probability Metric (IPM) between the target data distribution $p_\text{data}$ and the opponent distribution $p_{\theta_t}$, while the opponent (the previous LLM iteration) tries to match the data distribution. Restricting the function class of the main player to log-density-ratio functions parameterized by $f(x,y) = \lambda \log [p_\theta(y\mid x) / p_{\theta_t}(y\mid x)]$ yields the SPIN objective. Each iteration thus alternates between two steps of a saddle-point optimization, which the paper analyzes as a fixed-point iteration on the space of LLM policies.[^1]
In the published experiments, the synthetic prompts are 50,000 random prompts drawn from the HuggingFace ultrachat_200k dataset, the same SFT corpus on which zephyr-7b-sft-full was trained.[^1][^3] At each iteration, the model generates one synthetic response per prompt with sampling, and the loss is computed on the pair (ground-truth, self-generated). The released codebase at github.com/uclaml/SPIN provides four checkpoints, iter0 through iter3, along with the corresponding parquet datasets.[^3]
The SPIN paper provides both a fixed-point characterization of the iterative procedure and an explicit closed-form expression for the per-iteration update under logistic loss.
The central theoretical result is a fixed-point theorem stating, informally, that the SPIN training objective is globally minimized if and only if the model's conditional distribution matches the target data distribution. Concretely, if $p_{\theta_t}(\cdot \mid x) = p_\text{data}(\cdot \mid x)$ then $\theta_t$ is a global minimizer of $L_\text{SPIN}(\theta, \theta_t)$ for any $\lambda \ge 0$; conversely, if $p_{\theta_t}(\cdot \mid x) \ne p_\text{data}(\cdot \mid x)$ then there exists a finite $\lambda$ such that $\theta_t$ is not a global minimizer.[^1] The optimization process therefore "naturally stops" once the model reproduces the empirical data distribution.[^1]
Under logistic loss and an additional regularity condition, the paper derives an idealized expression for the next-iterate distribution: $$p_{\theta_{t+1}}(y \mid x) ;\propto; p_{\theta_t}(y \mid x) \left( \frac{p_\text{data}(y \mid x)}{p_{\theta_t}(y \mid x)} \right)^{1/\lambda}.$$ This is a geometric interpolation between the previous policy and the data distribution: as $\lambda \to \infty$, $p_{\theta_{t+1}} \to p_{\theta_t}$; as $\lambda \to 0^+$, $p_{\theta_{t+1}} \to p_\text{data}$.[^1] The expression makes precise the sense in which each SPIN step "moves toward" the data distribution, with $\lambda$ acting as a step size.
The two-player formulation of SPIN echoes the generator/discriminator structure of GANs.[^1][^11] However, SPIN's "discriminator" is constrained to the parametric form of a log density ratio, and both players are the same LLM at different iterations rather than independent networks. This connection has been examined in follow-up work that introduces explicit Kullback-Leibler regularization, equivalent to mixing the previous policy with a base policy ("fictitious play"), to stabilize the self-play dynamics.[^12]
The base model is zephyr-7b-sft-full, the SFT-only checkpoint of HuggingFace's Zephyr recipe, itself initialized from Mistral 7B and fine-tuned on UltraChat 200K.[^1][^7] All SPIN experiments share this starting point. Training uses DeepSpeed ZeRO-3, FlashAttention-2, the RMSProp optimizer, bfloat16 precision, and a global batch size of 64 on 8x A100 GPUs, with each iteration running two epochs of supervised optimization over a 50k synthetic dataset.[^3][^6]
Evaluation uses the standard lm-evaluation-harness setup for the HuggingFace Open LLM Leaderboard, which averages six tasks: ARC-Challenge, HellaSwag, MMLU, TruthfulQA, WinoGrande, and GSM8K.[^1] The paper reports the following averages and per-task numbers (in percent accuracy, except TruthfulQA which is mc2 and GSM8K which is exact-match):[^1][^6]
| Model | ARC | TruthfulQA | WinoGrande | GSM8K | HellaSwag | MMLU | Average |
|---|---|---|---|---|---|---|---|
| Baseline (zephyr-7b-sft-full) | 60.41 | 43.73 | 74.19 | 26.76 | 82.85 | 60.92 | 58.14 |
| SPIN iter 0 | 63.40 | 49.18 | 72.69 | 35.10 | 84.38 | 60.03 | 60.80 |
| SPIN iter 1 | 65.19 | 55.17 | 72.30 | 35.78 | 84.96 | 59.34 | 62.12 |
| SPIN iter 3 | 65.87 | 54.90 | 73.72 | 38.97 | 85.54 | 59.99 | 63.16 |
The largest single-iteration gain occurs at iteration 0, with GSM8K improving by more than 8 absolute points and TruthfulQA by more than 5. By iteration 3, the cumulative improvement on average accuracy is approximately 5.02 absolute points, with GSM8K (+12.21) and TruthfulQA (+11.17) carrying most of the gain. Improvements taper after iteration 1 but do not regress.[^1]
On MT-Bench, the multi-turn LLM-as-judge benchmark of Zheng et al. (2023), the paper reports the SPIN-trained model improving from a baseline of 5.94 to 6.78 by iteration 2, an improvement comparable in magnitude to that obtained by full RLHF on the Zephyr beta model.[^1][^13] Selected BIG-Bench tasks similarly improve across iterations; for instance, Causal Judgment, Formal Fallacies, and Sports Understanding all gain several points relative to the SFT baseline.[^1]
The most striking experimental claim is that SPIN at iteration 0, using only synthetic negatives, is competitive with DPO trained on the 62k UltraFeedback preference pairs used for zephyr-7b-beta, and that SPIN at iteration 1 surpasses DPO on most leaderboard tasks. The paper interprets this as evidence that "iterative training is a necessary component in SPIN as it breaks the limit of multi-epoch training" on a single static preference dataset.[^1] An ablation shows that running DPO for additional epochs on the same 62k preference set does not produce comparable gains.[^1]
A widely cited observation is that the SPIN per-iteration objective, when instantiated with logistic loss, has the same functional form as the DPO loss with the previous iteration $p_{\theta_t}$ serving as the reference policy and the chosen/rejected roles played by ground-truth and self-generated responses respectively.[^1][^14] Concretely, the SPIN loss becomes $$L_\text{SPIN}(\theta, \theta_t) = - \mathbb{E}\Big[ \log \sigma\big( \lambda \log \tfrac{p_\theta(y\mid x)}{p_{\theta_t}(y\mid x)} - \lambda \log \tfrac{p_\theta(y'\mid x)}{p_{\theta_t}(y'\mid x)} \big) \Big],$$ which is exactly the DPO loss with $\beta = \lambda$, reference policy $\pi_\text{ref} = p_{\theta_t}$, chosen response $y_w = y$, and rejected response $y_l = y'$.[^1][^14]
Despite this equivalence in functional form, the paper emphasizes three differences from standard DPO. First, DPO is typically run as a single offline pass; SPIN runs many iterations, each of which generates its own negatives. Second, DPO requires preference data, whereas SPIN consumes only an SFT dataset and self-generated samples. Third, the SPIN framework permits losses other than logistic, although in practice only the logistic instantiation has been explored experimentally.[^1] The verl reinforcement learning library by ByteDance implements SPIN under this DPO-equivalent view, sharing infrastructure with online DPO.[^14]
Several contemporaneous methods generalize DPO to an iterative or online setting. Snorkel AI's Snorkel-Mistral-PairRM-DPO recipe, released in early 2024, performs three rounds of self-generation followed by DPO using PairRM as a preference oracle.[^15] Methods such as Iterative DPO (Xu et al., 2023; Xiong et al., 2023) and Iterative Reasoning Preference Optimization extend this idea further, using external reward models or rule-based filtering to label self-generated pairs.[^16] SPIN sits within this family as the variant that uses the SFT ground-truth itself, rather than any external preference model, as the positive signal.[^1]
A contemporaneous method, "Self-Rewarding Language Models" by Yuan et al. (2024) from Meta, posted to arXiv on 18 January 2024 (sixteen days after SPIN) and accepted at ICML 2024, also performs iterative DPO without external preference data. Self-Rewarding LMs instead use the model itself in an LLM-as-a-Judge role to score its own samples, producing preference pairs that are then used for DPO. Three iterations of self-rewarding on Llama 2 70B reportedly surpass Claude 2, Gemini Pro, and GPT-4-0613 on the AlpacaEval 2.0 benchmark.[^17] SPIN and Self-Rewarding LMs are often discussed together as the two principal "self-improvement via iterative DPO" papers of January 2024; the methods differ in how the rejected response is obtained (a previous iteration of the model, versus a same-iteration sample judged worse by the model-as-judge).[^17]
Constitutional AI and RLAIF, introduced by Bai et al. (2022) at Anthropic, similarly aim to reduce reliance on human preference labels by having a model critique and revise its own outputs according to a written constitution, then training a reward model on the resulting AI-generated preferences.[^9] SPIN is more minimal: there is no constitution, no critique, and no separate reward model, and the "preference" is implicit in the contrast between human-written and self-generated text.[^1]
ReST^EM, proposed by Singh, Co-Reyes and collaborators at Google DeepMind in December 2023, performs expectation-maximization-style self-training. The model samples candidate solutions, filters them by a binary correctness signal (for example, whether a math solution matches the gold answer or whether code passes test cases), and fine-tunes on the filtered positives.[^10] Unlike SPIN, ReST^EM requires a verifier and discards rather than penalizes incorrect samples, but both methods iteratively bootstrap an LLM from a fixed seed dataset without additional human labels.[^10]
The first half of 2024 saw a burst of work on iterative self-improvement of LLMs. Beyond SPIN and Self-Rewarding LMs, notable contemporaneous methods include:
mistral-7b-instruct and various academic instantiations use a fixed reward model (PairRM, UltraRM) to label self-generated pairs across multiple DPO rounds.[^15]Mistral-7B-Instruct-v0.2 it reaches a 28.53% length-controlled win rate against GPT-4 Turbo on AlpacaEval 2.0 using only 60k UltraFeedback prompts and the PairRM 0.4B preference model; on Llama-3-8B-Instruct it reaches 38.77%.[^18] SPPO shares authors with SPIN (Yuan, Ji, Gu) and is often viewed as the preference-optimization analogue of SPIN.[^18]Within months of release, SPIN attracted both empirical analyses and methodological extensions.
Alami, Abubaker, Achab, Seddik, and Lahlou (2024) examined SPIN's training stability and proposed KL-regularized variants and fictitious-play schemes to dampen oscillations across iterations. They show that KL regularization is equivalent to replacing the previous policy with a geometric mixture of the base and previous policies, and report stabilized MT-Bench and Open LLM Leaderboard gains.[^12]
SPIN-Diffusion, mentioned above, ports the self-play formulation to text-to-image diffusion models, where the "policy" is the score network and "samples" are images.[^19] The same authorial team has also extended self-play ideas to other generative modeling settings.
SPIN has been adapted to specialized tasks where self-generated data quality varies. ExSPIN extends SPIN to text-to-SQL parsing with explicit execution feedback, addressing instability in domains where the SFT ground-truth is itself noisy.[^20] Reports also note that on code generation benchmarks such as SPIDER and BIRD, vanilla SPIN can degrade performance relative to SFT, motivating these task-specific variants.[^20]
The official uclaml/SPIN repository releases the data-generation, training, and evaluation pipelines along with checkpoints for iter0 through iter3 of the Zephyr 7B reproduction.[^3] HuggingFace community members and the verl team have integrated SPIN-style training into general-purpose post-training libraries, where it appears as an instantiation of online DPO with a particular choice of negative sampler.[^14]
The SPIN paper and its follow-ups identify several limitations.
Fixed target distribution. SPIN converges, by construction, when $p_\theta = p_\text{data}$. The performance ceiling is therefore the distribution of the SFT corpus. If the SFT data is itself imperfect or limited, the SPIN-trained model can be no better than that data on the metrics implicitly captured by the SFT responses.[^1] Self-Rewarding LMs and similar methods attempt to break this ceiling by allowing the target distribution to evolve via self-judging.[^17]
Iteration-count plateau. The paper reports the largest gains at iteration 0 and diminishing returns thereafter, with most experiments stopping at iteration 3.[^1] Subsequent analyses report that performance can degrade past a peak iteration on some datasets, motivating regularization or stopping heuristics.[^12]
Sensitivity to the base model. SPIN relies on the SFT model being strong enough to produce useful synthetic negatives. On weaker base models or on tasks far from the SFT distribution, the synthetic responses can be too low-quality to provide a useful learning signal, and the iterative procedure can produce negative gains; reported failures include SPIN reducing accuracy by up to 12 percentage points relative to SFT on the SPIDER text-to-SQL benchmark when applied to DeepSeek-Coder 6.7B.[^20]
Computational overhead. Each iteration requires generating one synthetic completion per prompt, then performing two epochs of fine-tuning. The compute cost scales linearly with the number of iterations, and the published recipe uses 8x A100 GPUs for the 7B-scale experiments.[^3][^6]
Distinction from "true" self-play. Unlike AlphaGo or AlphaZero, where self-play interacts with an environment that supplies an objective reward, SPIN has no environment and no reward signal. The "win condition" is reproducing the SFT data, which is qualitatively different from improving on an open-ended task.[^1]
Risk of mode collapse. The two-player formulation, like other adversarial training paradigms, can suffer instability and oscillation between iterations.[^12]