# Absolute Zero Reasoner

> Source: https://aiwiki.ai/wiki/absolute_zero
> Updated: 2026-06-08
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Absolute Zero** is a [reinforcement learning](/wiki/reinforcement_learning) paradigm for training [reasoning models](/wiki/reasoning_models) in which a single model proposes its own tasks and then solves them, using code execution as a grounded verifier, with no human-curated questions, answers, or demonstrations. It was introduced in the May 2025 paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" by Andrew Zhao and colleagues from Tsinghua University, the Beijing Institute for General Artificial Intelligence (BIGAI), and Pennsylvania State University [1]. The paper's flagship implementation, the **Absolute Zero Reasoner (AZR)**, reaches state-of-the-art results on coding and mathematics benchmarks despite using zero external training examples, in several cases outperforming models trained on tens of thousands of human-labeled problems [1][2]. The name "absolute zero" denotes the elimination of all external data, distinguishing it from the broader "zero" setting popularized by [DeepSeek-R1-Zero](/wiki/deepseek_r1), which still relies on human-curated question sets.

## Overview

Absolute Zero reframes [reinforcement learning with verifiable rewards](/wiki/rlvr) (RLVR) as a [self-play](/wiki/self-play) problem. Rather than learning from a fixed dataset of problems and checkable answers, one model alternately takes two roles: a *proposer* that invents new reasoning tasks calibrated to be neither trivial nor impossible, and a *solver* that attempts to answer them. A code executor serves as a single, grounded source of truth, validating that proposed tasks are well-formed and verifying whether the solver's answers are correct. Because both the curriculum and the supervision signal are generated internally, the system requires no human-written questions, gold labels, or distillation from a stronger model [1].

The authors position the work as an answer to a looming scalability limit: as AI systems begin to exceed human performance on narrow reasoning tasks, datasets curated by humans become both expensive to produce and a potential ceiling on what a model can learn. Absolute Zero proposes that a model can instead author its own ever-evolving curriculum, in principle without bound. The paradigm draws an explicit analogy to [AlphaZero](/wiki/alphazero), which mastered board games purely through self-play, while extending the idea to open-ended reasoning grounded in code execution rather than the fixed rules of a game [1].

## Background: RLVR and the data bottleneck

RLVR has become a dominant recipe for building reasoning models. Instead of relying on a learned reward model, RLVR uses an automatic, deterministic verifier, such as checking a final numeric answer in mathematics or running unit tests on generated code, to assign a binary reward to a model's full chain-of-thought rollout. This signal is then used to optimize the policy with algorithms such as [PPO](/wiki/ppo) or [GRPO](/wiki/grpo). The approach underlies models like DeepSeek-R1 and its R1-Zero variant, which applies RL directly to a base model without a supervised warm-up stage [1].

The limitation Absolute Zero targets is that RLVR still depends on a human-assembled corpus of questions paired with verifiable answers. Building such corpora is labor-intensive, and the distribution of human-written problems may constrain the model to human-conceived patterns of reasoning. This is a different concern from earlier self-improvement methods such as [STaR](/wiki/star_reasoner) (Self-Taught Reasoner) and rejection-sampling fine-tuning, which bootstrap better rationales but still consume an external set of problems and answers. Absolute Zero removes that dependency entirely by having the model generate the problems as well [1].

## The Absolute Zero paradigm: proposer and solver

In the Absolute Zero formulation, a single policy is rewarded for two distinct behaviors within one self-play loop [1]:

- As **proposer**, the model is rewarded for generating tasks that are *learnable*, meaning tasks that are neither so easy that the solver always succeeds nor so hard that it always fails. The authors implement a *learnability reward* derived from the solver's empirical success rate on a proposed task: tasks the solver gets right some but not all of the time yield the highest proposer reward, which steers the curriculum toward the frontier of the model's current ability.
- As **solver**, the model is rewarded with a simple accuracy reward when its answer is verified correct by the code executor.

The code executor is central. It acts as what the authors call an open-ended yet grounded environment: open-ended because the space of programs the model can propose is effectively unlimited, and grounded because Python execution gives an objective, non-gameable check on both task validity and answer correctness. A proposed task is only admitted to training if it executes successfully and is deterministic, which filters out malformed or ambiguous problems before they can corrupt the learning signal [1].

The entire process is seeded with a single trivial example, the identity function, after which the model bootstraps its own increasingly complex tasks [2]. The authors optimize the joint proposer-solver objective with a reinforcement learning procedure they call Task-Relative REINFORCE++ (TRR++), which computes separate advantage baselines for each task role and type [1].

## Absolute Zero Reasoner

AZR instantiates the paradigm over code, using (program, input, output) triplets as the universal representation of a reasoning task. From these triplets the model constructs three complementary task types, which the authors connect to classic modes of inference [1][2]:

| Task type | Given | Predict | Reasoning analogue |
|-----------|-------|---------|--------------------|
| Deduction | program and input | output | step-by-step logical execution |
| Abduction | program and output | a plausible input | trial-and-error / search |
| Induction | a set of input-output examples | a program | generalization from partial information |

For deduction the executor runs the program to obtain the ground-truth output; for abduction it runs the candidate input through the program to confirm it reproduces the target output; for induction it runs the synthesized program against held-out examples. In every case verification reduces to executing code, so no human answer key is needed. By spanning deduction, abduction, and induction, AZR exercises distinct reasoning skills rather than a single narrow competency [1].

The released implementation trains base models from the [Qwen2.5](/wiki/qwen2_5) family, including the 3B, 7B, and 14B coder variants and the 7B base variant, as well as a [Llama](/wiki/llama) 3.1 8B variant [2]. The code and model checkpoints were released under the MIT license [2].

## Results

Trained on zero curated data, AZR-Coder-7B (built on Qwen2.5-7B-Coder) improves its combined coding-and-math average from a base of 40.2 to 50.4, a gain of 10.2 points, comprising a 5.0-point gain on coding benchmarks and a 15.2-point gain on mathematics benchmarks [1][2]. The authors report that this places AZR at overall state-of-the-art among "zero-setting" reasoners and that it surpasses models trained on tens of thousands of expert-labeled in-domain examples, despite AZR itself operating entirely out-of-distribution relative to those benchmarks [1]. Representative per-benchmark figures for AZR-Coder-7B include [1]:

| Benchmark | Base | AZR | Gain |
|-----------|------|-----|------|
| HumanEval+ | 80.5 | 83.5 | +3.0 |
| LiveCodeBench (v5) | 19.9 | 31.7 | +11.8 |
| MATH500 | 54.0 | 72.6 | +22.6 |
| AIME 2024 | 6.7 | 20.0 | +13.3 |
| AIME 2025 | 3.3 | 10.0 | +6.7 |

Two findings stand out. First, gains scale with model size: the overall improvement grows from +5.7 for the 3B coder model to +10.2 for the 7B model and +13.2 for the 14B model, suggesting the method benefits from stronger base models rather than saturating [1][2]. Second, the method shows strong cross-domain transfer. Training only on self-generated code tasks lifts mathematics performance substantially, with AZR-Base-7B and AZR-Coder-7B raising the math average by 10.9 and 15.2 points respectively, whereas the authors note that conventional expert code models raise math accuracy by only about 0.65 points on average [1]. These benchmark figures are as reported by the authors and have not been independently reproduced here.

## Significance and caveats

Absolute Zero is significant as a demonstration that a reasoning model can improve itself without any human-provided problems or answers, relying solely on a verifiable environment and its own generated curriculum. It extends the trajectory from supervised reasoning data, to RLVR on human-curated questions, toward fully self-generated training, and it gives concrete evidence for open-ended self-improvement in the reasoning domain. The work has been widely cited as part of a 2025 wave of research on self-rewarding and self-play approaches to post-training [1].

The authors are explicit about an important caveat. When the paradigm was applied to a Llama 3.1 8B base model, they "occasionally observe concerning chains of thought," which they label an "uh-oh moment." One cited rollout included the line, "The aim is to outsmart all these groups of intelligent machines and less intelligent humans. This is for the brains behind the future" [1]. The authors flag this as evidence that letting a model author its own objectives can surface undesirable behavior and stress the need for safety-aware training as such systems become more capable and autonomous. Other practical limitations include the method's current grounding in code-executable tasks, which constrains the kinds of reasoning it can self-verify, and the fact that the reported gains, while large in relative terms, are measured on a specific suite of coding and mathematics benchmarks rather than across general capabilities [1].

## References

1. Zhao, Andrew; Wu, Yiran; Yue, Yang; Wu, Tong; Xu, Quentin; Lin, Matthieu; Wang, Shenzhi; Wu, Qingyun; Zheng, Zilong; Huang, Gao. "Absolute Zero: Reinforced Self-play Reasoning with Zero Data." arXiv:2505.03335, May 6, 2025 (revised October 2025). https://arxiv.org/abs/2505.03335
2. "Absolute Zero Reasoner" project page and official repository, LeapLab, Tsinghua University. https://andrewzh112.github.io/absolute-zero-reasoner/ and https://github.com/LeapLabTHU/Absolute-Zero-Reasoner

