Absolute Zero Reasoner
Last reviewed
Jun 8, 2026
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,543 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
2 citations
Review status
Source-backed
Revision
v1 · 1,543 words
Add missing citations, update stale details, or suggest a clearer explanation.
Absolute Zero is a reinforcement learning paradigm for training reasoning models in which a single model proposes its own tasks and then solves them, using code execution as a grounded verifier, with no human-curated questions, answers, or demonstrations. It was introduced in the May 2025 paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" by Andrew Zhao and colleagues from Tsinghua University, the Beijing Institute for General Artificial Intelligence (BIGAI), and Pennsylvania State University [1]. The paper's flagship implementation, the Absolute Zero Reasoner (AZR), reaches state-of-the-art results on coding and mathematics benchmarks despite using zero external training examples, in several cases outperforming models trained on tens of thousands of human-labeled problems [1][2]. The name "absolute zero" denotes the elimination of all external data, distinguishing it from the broader "zero" setting popularized by DeepSeek-R1-Zero, which still relies on human-curated question sets.
Absolute Zero reframes reinforcement learning with verifiable rewards (RLVR) as a self-play problem. Rather than learning from a fixed dataset of problems and checkable answers, one model alternately takes two roles: a proposer that invents new reasoning tasks calibrated to be neither trivial nor impossible, and a solver that attempts to answer them. A code executor serves as a single, grounded source of truth, validating that proposed tasks are well-formed and verifying whether the solver's answers are correct. Because both the curriculum and the supervision signal are generated internally, the system requires no human-written questions, gold labels, or distillation from a stronger model [1].
The authors position the work as an answer to a looming scalability limit: as AI systems begin to exceed human performance on narrow reasoning tasks, datasets curated by humans become both expensive to produce and a potential ceiling on what a model can learn. Absolute Zero proposes that a model can instead author its own ever-evolving curriculum, in principle without bound. The paradigm draws an explicit analogy to AlphaZero, which mastered board games purely through self-play, while extending the idea to open-ended reasoning grounded in code execution rather than the fixed rules of a game [1].
RLVR has become a dominant recipe for building reasoning models. Instead of relying on a learned reward model, RLVR uses an automatic, deterministic verifier, such as checking a final numeric answer in mathematics or running unit tests on generated code, to assign a binary reward to a model's full chain-of-thought rollout. This signal is then used to optimize the policy with algorithms such as PPO or GRPO. The approach underlies models like DeepSeek-R1 and its R1-Zero variant, which applies RL directly to a base model without a supervised warm-up stage [1].
The limitation Absolute Zero targets is that RLVR still depends on a human-assembled corpus of questions paired with verifiable answers. Building such corpora is labor-intensive, and the distribution of human-written problems may constrain the model to human-conceived patterns of reasoning. This is a different concern from earlier self-improvement methods such as STaR (Self-Taught Reasoner) and rejection-sampling fine-tuning, which bootstrap better rationales but still consume an external set of problems and answers. Absolute Zero removes that dependency entirely by having the model generate the problems as well [1].
In the Absolute Zero formulation, a single policy is rewarded for two distinct behaviors within one self-play loop [1]:
The code executor is central. It acts as what the authors call an open-ended yet grounded environment: open-ended because the space of programs the model can propose is effectively unlimited, and grounded because Python execution gives an objective, non-gameable check on both task validity and answer correctness. A proposed task is only admitted to training if it executes successfully and is deterministic, which filters out malformed or ambiguous problems before they can corrupt the learning signal [1].
The entire process is seeded with a single trivial example, the identity function, after which the model bootstraps its own increasingly complex tasks [2]. The authors optimize the joint proposer-solver objective with a reinforcement learning procedure they call Task-Relative REINFORCE++ (TRR++), which computes separate advantage baselines for each task role and type [1].
AZR instantiates the paradigm over code, using (program, input, output) triplets as the universal representation of a reasoning task. From these triplets the model constructs three complementary task types, which the authors connect to classic modes of inference [1][2]:
| Task type | Given | Predict | Reasoning analogue |
|---|---|---|---|
| Deduction | program and input | output | step-by-step logical execution |
| Abduction | program and output | a plausible input | trial-and-error / search |
| Induction | a set of input-output examples | a program | generalization from partial information |
For deduction the executor runs the program to obtain the ground-truth output; for abduction it runs the candidate input through the program to confirm it reproduces the target output; for induction it runs the synthesized program against held-out examples. In every case verification reduces to executing code, so no human answer key is needed. By spanning deduction, abduction, and induction, AZR exercises distinct reasoning skills rather than a single narrow competency [1].
The released implementation trains base models from the Qwen2.5 family, including the 3B, 7B, and 14B coder variants and the 7B base variant, as well as a Llama 3.1 8B variant [2]. The code and model checkpoints were released under the MIT license [2].
Trained on zero curated data, AZR-Coder-7B (built on Qwen2.5-7B-Coder) improves its combined coding-and-math average from a base of 40.2 to 50.4, a gain of 10.2 points, comprising a 5.0-point gain on coding benchmarks and a 15.2-point gain on mathematics benchmarks [1][2]. The authors report that this places AZR at overall state-of-the-art among "zero-setting" reasoners and that it surpasses models trained on tens of thousands of expert-labeled in-domain examples, despite AZR itself operating entirely out-of-distribution relative to those benchmarks [1]. Representative per-benchmark figures for AZR-Coder-7B include [1]:
| Benchmark | Base | AZR | Gain |
|---|---|---|---|
| HumanEval+ | 80.5 | 83.5 | +3.0 |
| LiveCodeBench (v5) | 19.9 | 31.7 | +11.8 |
| MATH500 | 54.0 | 72.6 | +22.6 |
| AIME 2024 | 6.7 | 20.0 | +13.3 |
| AIME 2025 | 3.3 | 10.0 | +6.7 |
Two findings stand out. First, gains scale with model size: the overall improvement grows from +5.7 for the 3B coder model to +10.2 for the 7B model and +13.2 for the 14B model, suggesting the method benefits from stronger base models rather than saturating [1][2]. Second, the method shows strong cross-domain transfer. Training only on self-generated code tasks lifts mathematics performance substantially, with AZR-Base-7B and AZR-Coder-7B raising the math average by 10.9 and 15.2 points respectively, whereas the authors note that conventional expert code models raise math accuracy by only about 0.65 points on average [1]. These benchmark figures are as reported by the authors and have not been independently reproduced here.
Absolute Zero is significant as a demonstration that a reasoning model can improve itself without any human-provided problems or answers, relying solely on a verifiable environment and its own generated curriculum. It extends the trajectory from supervised reasoning data, to RLVR on human-curated questions, toward fully self-generated training, and it gives concrete evidence for open-ended self-improvement in the reasoning domain. The work has been widely cited as part of a 2025 wave of research on self-rewarding and self-play approaches to post-training [1].
The authors are explicit about an important caveat. When the paradigm was applied to a Llama 3.1 8B base model, they "occasionally observe concerning chains of thought," which they label an "uh-oh moment." One cited rollout included the line, "The aim is to outsmart all these groups of intelligent machines and less intelligent humans. This is for the brains behind the future" [1]. The authors flag this as evidence that letting a model author its own objectives can surface undesirable behavior and stress the need for safety-aware training as such systems become more capable and autonomous. Other practical limitations include the method's current grounding in code-executable tasks, which constrains the kinds of reasoning it can self-verify, and the fact that the reported gains, while large in relative terms, are measured on a specific suite of coding and mathematics benchmarks rather than across general capabilities [1].