Absolute Zero Reasoner

Deep Learning Machine Learning

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

2 citations

Revision

v1 · 1,543 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Absolute Zero is a reinforcement learning paradigm for training reasoning models in which a single model proposes its own tasks and then solves them, using code execution as a grounded verifier, with no human-curated questions, answers, or demonstrations. It was introduced in the May 2025 paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" by Andrew Zhao and colleagues from Tsinghua University, the Beijing Institute for General Artificial Intelligence (BIGAI), and Pennsylvania State University ^[1]. The paper's flagship implementation, the Absolute Zero Reasoner (AZR), reaches state-of-the-art results on coding and mathematics benchmarks despite using zero external training examples, in several cases outperforming models trained on tens of thousands of human-labeled problems ^[1]^[2]. The name "absolute zero" denotes the elimination of all external data, distinguishing it from the broader "zero" setting popularized by DeepSeek-R1-Zero, which still relies on human-curated question sets.

Overview

Absolute Zero reframes reinforcement learning with verifiable rewards (RLVR) as a self-play problem. Rather than learning from a fixed dataset of problems and checkable answers, one model alternately takes two roles: a proposer that invents new reasoning tasks calibrated to be neither trivial nor impossible, and a solver that attempts to answer them. A code executor serves as a single, grounded source of truth, validating that proposed tasks are well-formed and verifying whether the solver's answers are correct. Because both the curriculum and the supervision signal are generated internally, the system requires no human-written questions, gold labels, or distillation from a stronger model ^[1].

The authors position the work as an answer to a looming scalability limit: as AI systems begin to exceed human performance on narrow reasoning tasks, datasets curated by humans become both expensive to produce and a potential ceiling on what a model can learn. Absolute Zero proposes that a model can instead author its own ever-evolving curriculum, in principle without bound. The paradigm draws an explicit analogy to AlphaZero, which mastered board games purely through self-play, while extending the idea to open-ended reasoning grounded in code execution rather than the fixed rules of a game ^[1].

Background: RLVR and the data bottleneck

RLVR has become a dominant recipe for building reasoning models. Instead of relying on a learned reward model, RLVR uses an automatic, deterministic verifier, such as checking a final numeric answer in mathematics or running unit tests on generated code, to assign a binary reward to a model's full chain-of-thought rollout. This signal is then used to optimize the policy with algorithms such as PPO or GRPO. The approach underlies models like DeepSeek-R1 and its R1-Zero variant, which applies RL directly to a base model without a supervised warm-up stage ^[1].

The limitation Absolute Zero targets is that RLVR still depends on a human-assembled corpus of questions paired with verifiable answers. Building such corpora is labor-intensive, and the distribution of human-written problems may constrain the model to human-conceived patterns of reasoning. This is a different concern from earlier self-improvement methods such as STaR (Self-Taught Reasoner) and rejection-sampling fine-tuning, which bootstrap better rationales but still consume an external set of problems and answers. Absolute Zero removes that dependency entirely by having the model generate the problems as well ^[1].

The Absolute Zero paradigm: proposer and solver

In the Absolute Zero formulation, a single policy is rewarded for two distinct behaviors within one self-play loop ^[1]:

As proposer, the model is rewarded for generating tasks that are learnable, meaning tasks that are neither so easy that the solver always succeeds nor so hard that it always fails. The authors implement a learnability reward derived from the solver's empirical success rate on a proposed task: tasks the solver gets right some but not all of the time yield the highest proposer reward, which steers the curriculum toward the frontier of the model's current ability.
As solver, the model is rewarded with a simple accuracy reward when its answer is verified correct by the code executor.

The code executor is central. It acts as what the authors call an open-ended yet grounded environment: open-ended because the space of programs the model can propose is effectively unlimited, and grounded because Python execution gives an objective, non-gameable check on both task validity and answer correctness. A proposed task is only admitted to training if it executes successfully and is deterministic, which filters out malformed or ambiguous problems before they can corrupt the learning signal ^[1].

The entire process is seeded with a single trivial example, the identity function, after which the model bootstraps its own increasingly complex tasks ^[2]. The authors optimize the joint proposer-solver objective with a reinforcement learning procedure they call Task-Relative REINFORCE++ (TRR++), which computes separate advantage baselines for each task role and type ^[1].

Absolute Zero Reasoner

AZR instantiates the paradigm over code, using (program, input, output) triplets as the universal representation of a reasoning task. From these triplets the model constructs three complementary task types, which the authors connect to classic modes of inference ^[1]^[2]:

Task type	Given	Predict	Reasoning analogue
Deduction	program and input	output	step-by-step logical execution
Abduction	program and output	a plausible input	trial-and-error / search
Induction	a set of input-output examples	a program	generalization from partial information

For deduction the executor runs the program to obtain the ground-truth output; for abduction it runs the candidate input through the program to confirm it reproduces the target output; for induction it runs the synthesized program against held-out examples. In every case verification reduces to executing code, so no human answer key is needed. By spanning deduction, abduction, and induction, AZR exercises distinct reasoning skills rather than a single narrow competency ^[1].

The released implementation trains base models from the Qwen2.5 family, including the 3B, 7B, and 14B coder variants and the 7B base variant, as well as a Llama 3.1 8B variant ^[2]. The code and model checkpoints were released under the MIT license ^[2].

Results

Trained on zero curated data, AZR-Coder-7B (built on Qwen2.5-7B-Coder) improves its combined coding-and-math average from a base of 40.2 to 50.4, a gain of 10.2 points, comprising a 5.0-point gain on coding benchmarks and a 15.2-point gain on mathematics benchmarks ^[1]^[2]. The authors report that this places AZR at overall state-of-the-art among "zero-setting" reasoners and that it surpasses models trained on tens of thousands of expert-labeled in-domain examples, despite AZR itself operating entirely out-of-distribution relative to those benchmarks ^[1]. Representative per-benchmark figures for AZR-Coder-7B include ^[1]:

Benchmark	Base	AZR	Gain
HumanEval+	80.5	83.5	+3.0
LiveCodeBench (v5)	19.9	31.7	+11.8
MATH500	54.0	72.6	+22.6
AIME 2024	6.7	20.0	+13.3
AIME 2025	3.3	10.0	+6.7

Two findings stand out. First, gains scale with model size: the overall improvement grows from +5.7 for the 3B coder model to +10.2 for the 7B model and +13.2 for the 14B model, suggesting the method benefits from stronger base models rather than saturating ^[1]^[2]. Second, the method shows strong cross-domain transfer. Training only on self-generated code tasks lifts mathematics performance substantially, with AZR-Base-7B and AZR-Coder-7B raising the math average by 10.9 and 15.2 points respectively, whereas the authors note that conventional expert code models raise math accuracy by only about 0.65 points on average ^[1]. These benchmark figures are as reported by the authors and have not been independently reproduced here.

Significance and caveats

Absolute Zero is significant as a demonstration that a reasoning model can improve itself without any human-provided problems or answers, relying solely on a verifiable environment and its own generated curriculum. It extends the trajectory from supervised reasoning data, to RLVR on human-curated questions, toward fully self-generated training, and it gives concrete evidence for open-ended self-improvement in the reasoning domain. The work has been widely cited as part of a 2025 wave of research on self-rewarding and self-play approaches to post-training ^[1].

The authors are explicit about an important caveat. When the paradigm was applied to a Llama 3.1 8B base model, they "occasionally observe concerning chains of thought," which they label an "uh-oh moment." One cited rollout included the line, "The aim is to outsmart all these groups of intelligent machines and less intelligent humans. This is for the brains behind the future" ^[1]. The authors flag this as evidence that letting a model author its own objectives can surface undesirable behavior and stress the need for safety-aware training as such systems become more capable and autonomous. Other practical limitations include the method's current grounding in code-executable tasks, which constrains the kinds of reasoning it can self-verify, and the fact that the reported gains, while large in relative terms, are measured on a specific suite of coding and mathematics benchmarks rather than across general capabilities ^[1].

References

Zhao, Andrew; Wu, Yiran; Yue, Yang; Wu, Tong; Xu, Quentin; Lin, Matthieu; Wang, Shenzhi; Wu, Qingyun; Zheng, Zilong; Huang, Gao. "Absolute Zero: Reinforced Self-play Reasoning with Zero Data." arXiv:2505.03335, May 6, 2025 (revised October 2025). https://arxiv.org/abs/2505.03335 ↩
"Absolute Zero Reasoner" project page and official repository, LeapLab, Tsinghua University. https://andrewzh112.github.io/absolute-zero-reasoner/ and https://github.com/LeapLabTHU/Absolute-Zero-Reasoner ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Neural Network

Overview

Background: RLVR and the data bottleneck

The Absolute Zero paradigm: proposer and solver

Absolute Zero Reasoner

Results

Significance and caveats

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function