Tülu 3
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 4,064 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 4,064 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tülu 3 is a fully open post-training recipe and a corresponding family of instruction-tuned language models released by the Allen Institute for AI (Ai2) on November 21, 2024. The 8B and 70B variants were released on that date, with the technical report posted to arXiv as preprint 2411.15124 on November 22, 2024 and a much larger 405B variant added on January 30, 2025. Tülu 3 is built on top of Meta's Llama 3.1 base models and ships with three things at once: a set of post-trained model checkpoints, the complete training data and code used to produce them, and a four stage alignment recipe that Ai2 documents end to end. The pipeline introduced Reinforcement Learning with Verifiable Rewards (RLVR) as a named third stage, in which the reward signal comes from deterministic checkers rather than a learned reward model.
The project is often discussed as the moment when verifiable-reward reinforcement learning entered the open language model toolkit. Two months after Tülu 3 was released, DeepSeek's R1 paper used the same general idea at much larger scale and with a different optimisation algorithm, GRPO, to produce a reasoning model that drew global attention. Ai2's own retrospective on the 405B run, published the day after R1, explicitly compared scaling behaviour between the two systems and noted that the math-only RLVR data mixture matched a similar observation in the DeepSeek-R1 report. For this reason Tülu 3 is frequently cited as the first widely reproduced open-recipe demonstration of RLVR style training on a frontier sized model.
The entire pipeline, including 939,344 training prompts, evaluation harnesses, reward functions, and roughly 50 derived model checkpoints, was released under permissive terms. Code is Apache 2.0 through the Open Instruct repository, while the model weights themselves inherit Meta's Llama 3.1 Community License because Tülu 3 finetunes Llama 3.1 base.
Tülu is Ai2's running line of open instruction-tuned models, named after the tülu, a hybrid camelid. The first Tülu paper appeared in mid 2023 and explored how to recover ChatGPT style helpfulness on top of open base models using mixed instruction datasets. Tülu 2, released in November 2023, was the first version to ship a 70B variant and the first to use DPO as a preference learning stage on top of supervised fine tuning. The Tülu 2 DPO 70B model was for several months the strongest open weight chat model that did not require a separate reward model at training time, and it became a common research baseline for preference optimisation work through 2024.
A mid 2024 update called Tülu 2.5 added PPO trained models alongside the original DPO models, with publicly released reward models and value functions. It was framed as an investigation into when preference optimisation algorithms diverge in practice, foreshadowing the heavier reinforcement learning component that would appear in Tülu 3. By late 2024 the headline open weight instruction models (Llama 3.1 Instruct, Qwen 2.5 Instruct, Mistral Instruct, Nemotron) all used closed post-training recipes whose data and reward functions were not public. Ai2 framed Tülu 3 as an attempt to close that gap at the recipe level, not just at the weights level.
The Tülu 3 paper has 22 listed authors, with lead authorship by Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, and Hamish Ivison, and senior authorship by Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. All authors are affiliated with Ai2, with several also holding joint appointments at the University of Washington.
The Tülu 3 recipe is structured as a four stage pipeline with an explicit decontamination and evaluation step bolted onto the end. The four stages were described in the November 2024 paper as data, SFT, DPO, and RLVR. Ai2 sometimes counts the evaluation stage as a fifth pipeline element when explaining the recipe externally, since they treat decontamination as part of training rather than a separate audit.
| Stage | Purpose | Data scale |
|---|---|---|
| 1. Prompt curation | Collect and synthesise prompts targeting core skills | 939,344 prompts |
| 2. SFT | Teach response format, basic capabilities, refusal behaviour | Hundreds of thousands of prompt or response pairs |
| 3. DPO | Sharpen preferred answers using on- and off-policy preferences | On policy preference pairs generated from SFT checkpoints |
| 4. RLVR | Push verifiable skills (math, code, instruction following) above the DPO ceiling | Prompts with deterministic verifiers; PPO with verifier reward |
| 5. Eval and decontamination | Standardise scoring and audit train-eval overlap | OLMES harness; n-gram overlap checks |
The initial prompt set contains 939,344 prompts split roughly 57 percent public and 43 percent synthetic. The public portion combines existing instruction datasets including FLAN, OpenAssistant, ShareGPT exports, NoRobots, the WildChat real user chats, and competition mathematics from the MATH and GSM8K training splits. The synthetic portion was generated using a persona-conditioned methodology in which prompts are produced by frontier models given specific persona descriptions. Responses were generated by GPT-4o or Claude 3.5 Sonnet for SFT data, with the choice depending on the skill being targeted. This use of proprietary model outputs is why individual data subsets carry their own additional terms even though the headline release is openly licensed.
Ai2 organised the data around eleven core skills including reasoning, mathematics, coding, instruction following, knowledge recall, multilinguality, and safety. For each skill the team built a target evaluation, a training data mixture, and a held out development set, an approach intended to avoid the common pitfall of letting one capability quietly degrade while another improves.
SFT is the first model training stage and uses the curated prompt-response pairs to teach the base Llama 3.1 model the conversational format, refusal behaviour, and baseline competence at each core skill. The SFT mixture combines the per skill subsets into a single training set with hyperparameter tuning for the mixture weights. Ai2 reports SFT runs of roughly two epochs at a learning rate around 5e-6 for the 8B, with longer runs and lower rates for the 70B and 405B.
The second stage applies DPO on a mix of on policy and off policy preference data. On policy pairs are constructed by sampling multiple completions from the SFT checkpoint, then using GPT-4o or Claude 3.5 Sonnet as a preference judge. Off policy pairs come from existing preference datasets including UltraFeedback and HelpSteer. The paper describes this mix as critical, with on-policy pairs providing the gradient direction the model can actually move toward while off-policy pairs prevent reward hacking on the judge's stylistic preferences.
The final training stage, RLVR, is the recipe's signature contribution. It uses PPO with a deterministic verifier as the reward signal rather than a learned reward model. The verifier is a Python function that consumes a prompt and a model response and returns a numerical reward, typically binary. For mathematics the verifier extracts a final answer and compares it to ground truth via symbolic equality. For code it compiles and runs the response against a test suite. For instruction following it checks structural constraints such as whether the response is in JSON, contains an exact phrase, or has the requested number of bullet points.
The RLVR phase uses only prompts whose verifiers exist. The published RLVR mixture allenai/RLVR-GSM-MATH-IF-Mixed-Constraints contains roughly 30,000 prompts spanning GSM8K, MATH, and instruction following with structural constraints. Ai2 found that this relatively small prompt budget was sufficient to lift performance noticeably above the DPO checkpoint while avoiding overfitting on the verifier function.
RLVR was named and formalised in the Tülu 3 paper. The basic idea is that for any skill where the correctness of a response can be checked by a program, the noisy and expensive reward model used in conventional RLHF can be replaced by a deterministic checker. The policy is trained with PPO to maximise the verifier reward subject to a KL penalty against the SFT or DPO reference, the same objective form used in standard RLHF. The only difference is that the reward function is a callable verifier rather than a transformer based reward model.
The verifier reward formulation has three immediate consequences. First, the reward function is auditable; anyone can inspect the Python code that produces the training signal. Second, the verifier never goes out of distribution; it returns the same answer for the same input. Third, there is no reward model training cost, which removes one of the more delicate components of the RLHF pipeline. The trade off is that RLVR can only be applied where verifiers exist, which currently means mathematics, code, formal logic, and constrained generation. Conversation quality and helpfulness still rely on the SFT and DPO stages.
The Tülu 3 paper reports that RLVR delivered improvements of 1.7 points on MATH, 3.3 points on GSM8K, and 1.3 points on IFEval over the DPO checkpoint, with smaller spillover gains on tasks not directly trained. The gains came without obvious degradation elsewhere, which had been a recurring problem in earlier PPO based recipes. Ai2 attributes this stability to the determinism of the reward and a relatively short training horizon of 100,000 episodes at the 8B and 70B scales.
On the 405B model the training horizon was extended to 300,000 episodes and the mixture was simplified to mathematics only. Ai2 reported that at this scale the math-only mix worked better than the diverse mix that helped the smaller models, and that math gains were larger in absolute terms than at 8B or 70B. This pattern, where RLVR pays off more at larger model sizes, was noted as parallel to the same observation in the DeepSeek-R1 report.
Ai2 released three primary model sizes, each accompanied by a separately released SFT and DPO checkpoint so that researchers can study the contribution of each pipeline stage in isolation.
| Variant | Parameters | Release date | Base model | Final stage | HF repository |
|---|---|---|---|---|---|
| Llama 3.1 Tülu 3 8B | 8 billion | November 21, 2024 | meta-llama/Llama-3.1-8B | RLVR (PPO) | allenai/Llama-3.1-Tulu-3-8B |
| Llama 3.1 Tülu 3 70B | 70 billion | November 21, 2024 | meta-llama/Llama-3.1-70B | RLVR (PPO) | allenai/Llama-3.1-Tulu-3-70B |
| Llama 3.1 Tülu 3 405B | 405 billion | January 30, 2025 | meta-llama/Llama-3.1-405B | RLVR (PPO) | allenai/Llama-3.1-Tulu-3-405B |
Each final model has matching SFT and DPO checkpoints published separately, for example allenai/Llama-3.1-Tulu-3-8B-SFT and allenai/Llama-3.1-Tulu-3-8B-DPO. There are also reward model and value model artefacts released to support reproducibility of the RLVR stage. Together with smaller derivative releases used in OLMo 2 post training, the Tülu 3 collection on Hugging Face exceeds 50 separate checkpoints.
| Model | Learning rate | KL beta | Effective batch size | Total episodes | Max response length |
|---|---|---|---|---|---|
| Tülu 3 8B | 3e-7 | 0.05 | 224 | 100,000 | 2,048 |
| Tülu 3 70B | 1e-7 | 0.07 | 640 | 100,000 | 2,048 |
| Tülu 3 405B | 1e-7 | 0.05 | 1,856 | 300,000 | 2,048 |
The 405B run used 32 nodes (256 H100 GPUs) for training and 16-way tensor-parallel inference through vLLM on a separate 240 GPU pool. Each RLVR iteration took roughly 550 seconds of inference, 25 seconds of weight transfer, and 1,500 seconds of training, for a wall clock cycle of about 35 minutes per update. Ai2 noted that 405B training was constrained more by inference throughput than by the optimiser.
Ai2's evaluations were run through OLMES, the in-house Open Language Model Evaluation System used by the OLMo team. The headline tables in the Tülu 3 paper compare each variant against the Llama 3.1 Instruct model of the same size, and against the strongest peer at that scale. The 8B comparison focuses on Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct, the 70B comparison on Llama 3.1 70B Instruct, and the 405B comparison on Llama 3.1 405B Instruct, DeepSeek V3, and GPT-4o.
| Benchmark | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct |
|---|---|---|---|
| Average | 64.8 | 62.2 | 57.8 |
| MMLU (0-shot) | 68.2 | 71.2 | 76.6 |
| GSM8K (8-shot) | 87.6 | 83.4 | 83.8 |
| MATH (4-shot) | 43.7 | 42.5 | 14.8 |
| IFEval | 82.4 | 80.6 | 74.7 |
| HumanEval (pass@10) | 83.9 | 86.3 | 93.1 |
| AlpacaEval 2 | 34.5 | 24.2 | 29.0 |
The 8B is strongest on mathematics, instruction following, and AlpacaEval 2 chat preference. It trails Qwen 2.5 on MMLU and HumanEval, a gap Ai2 attributes to differences in the base models rather than the post-training recipe, since Qwen 2.5 base is itself stronger on these tasks than Llama 3.1 8B base.
| Benchmark | Tülu 3 70B | Llama 3.1 70B Instruct |
|---|---|---|
| Average | 76.0 | 73.4 |
| MMLU (0-shot, CoT) | 83.1 | 85.3 |
| GSM8K (8-shot, CoT) | 93.5 | 93.7 |
| MATH (4-shot, Flex) | 63.0 | 56.4 |
| IFEval (prompt loose) | 83.2 | 88.0 |
| HumanEval (pass@10) | 92.4 | 93.6 |
| Safety (6-task average) | 88.3 | 76.5 |
The 70B model recovers most of the gap to Llama 3.1 70B Instruct on general capabilities while overtaking it on mathematics and on Ai2's six task safety suite. Ai2 frames the safety improvement as a direct consequence of the explicit safety prompts included in both the SFT and DPO mixtures.
| Benchmark | Tülu 3 405B | Llama 3.1 405B Instruct | DeepSeek V3 | GPT-4o |
|---|---|---|---|---|
| Average (without safety) | 80.0 | 78.1 | 79.0 | 80.5 |
| Average (with safety) | 80.7 | 79.0 | 75.9 | 81.6 |
| MMLU (5-shot) | 87.0 | 88.0 | 82.1 | 87.9 |
| GSM8K (8-shot) | 95.5 | 95.4 | 94.1 | 91.7 |
| MATH (4-shot) | 67.3 | 66.6 | 72.5 | 68.8 |
| HumanEval (pass@10) | 95.9 | 95.9 | 94.6 | 97.0 |
| IFEval (prompt loose) | 86.0 | 88.4 | 88.0 | 84.8 |
| AlpacaEval 2 | 51.4 | 38.5 | 53.5 | 65.0 |
| Safety (6-task average) | 86.7 | 86.8 | 72.2 | 90.9 |
At the 405B scale Tülu 3 leads Llama 3.1 Instruct and DeepSeek V3 on the average score with safety included, and lands within striking distance of GPT-4o. The DeepSeek V3 numbers used in this comparison come from Ai2's own evaluations through OLMES and may differ from the figures DeepSeek publishes in its own reports.
The Tülu 3 release uses the term "fully open" in the same restrictive sense that Ai2 applies to the OLMo 2 family. A fully open post-training release in this vocabulary includes the model weights, the entire training data, the per stage code, the reward functions, the evaluation harness, and the recipe documentation including hyperparameter choices and dataset mixture weights. The intent is that a third party with sufficient compute could re-run the pipeline against a different base model or different data, and obtain a numerically comparable result.
The artefacts published with Tülu 3 cover this scope. The 939,344 prompt collection is on Hugging Face under allenai/tulu-3-sft-mixture. The reward functions used during RLVR are in the Open Instruct repository under open_instruct/ground_truth_utils.py. The OLMES evaluation harness is at allenai/olmes. The intermediate SFT and DPO checkpoints are released as separate Hugging Face models. Even the persona descriptions used to generate the synthetic prompts are published, alongside the prompt templates used to call GPT-4o and Claude 3.5 Sonnet.
This level of openness has real costs. The collection occupies several terabytes of storage, and the licensing surface is more complex than for a single weights only release. Subsets of the prompt data carry additional terms inherited from the upstream sources (FLAN, OpenAssistant, WildChat, and so on), and the responses generated by GPT-4o and Claude 3.5 Sonnet are bound by the relevant terms of service for synthetic data use. Ai2 documents these per-subset terms in the dataset cards.
Tülu 3 model weights are released under Meta's Llama 3.1 Community License, inherited from the Llama 3.1 base. This is permissive for most users but imposes an acceptable use policy and a 700 million monthly active user threshold above which a separate commercial agreement is required. Use of the models also requires acceptance of the Gemma Terms of Use and the Qwen License Agreement, because some synthetic training data was generated using those models.
The training and evaluation code in Open Instruct and OLMES is released under Apache 2.0. The datasets carry per-subset licensing because they aggregate material from multiple upstream sources. Ai2's general stance is that the recipe and code should be as openly licensed as possible to maximise reproducibility, while the weights inherit whatever constraint the base model imposes.
Tülu 3 sits in a 2024 to 2025 open weight instruction-tuning landscape where the main reference points are Llama 3.1 Instruct, Qwen 2.5 Instruct, NVIDIA's Nemotron 70B, Mistral Instruct, and Nous Hermes 3.
| Model | Parameters | Released | Post-training code public | RL stage |
|---|---|---|---|---|
| Tülu 3 8B and 70B | 8B, 70B | November 21, 2024 | Yes (Open Instruct) | RLVR (PPO) |
| Tülu 3 405B | 405B | January 30, 2025 | Yes (Open Instruct) | RLVR (PPO) |
| Llama 3.1 Instruct | 8B, 70B, 405B | July 2024 | No | DPO + Rejection sampling |
| Qwen 2.5 Instruct | 7B to 72B | September 2024 | No | DPO + PPO |
| Nemotron 70B Instruct | 70B | October 2024 | Partial | RLHF |
| Nous Hermes 3 405B | 405B | November 2024 | No | DPO |
At the 8B scale Tülu 3 is the strongest open recipe model on mathematics and instruction following, while Qwen 2.5 Instruct is stronger on general knowledge and coding. At the 70B scale Tülu 3 closes most of the gap to Llama 3.1 70B Instruct and leads on math and safety. At the 405B scale Tülu 3 leads Llama 3.1 405B Instruct on Ai2's aggregate including safety, and is competitive with DeepSeek V3, though direct comparisons depend on the evaluation harness used.
The Tülu 3 paper introduced RLVR as a named pipeline stage on November 22, 2024. The DeepSeek R1 paper followed in January 2025 and used a similar idea, training with verifiable rewards on mathematics and coding to produce a long reasoning model with strong scores on competition benchmarks. DeepSeek used GRPO, a variant of PPO that drops the value function critic and instead computes a baseline from a group of sampled completions, rather than the standard PPO used in Tülu 3. This makes the DeepSeek approach more memory efficient at large scale, since the value model has been removed.
The two projects share the verifiable reward idea but emphasise different things. Tülu 3 frames RLVR as one stage in a broader general purpose post-training pipeline whose primary goal is matching Llama 3.1 Instruct across the board. DeepSeek R1 frames RLVR as the entire training story for a specialised reasoning model. The DeepSeek R1 paper does not cite Tülu 3 by name, but the Ai2 team observed that the math-only RLVR scaling behaviour they had documented for the 405B run matched the same observation in the R1 report.
In the months that followed, the broader open language model ecosystem converged on the RLVR-plus-GRPO combination as the default reinforcement learning stage for reasoning oriented training. Ai2's own OLMo 2 32B Instruct model switched from PPO based RLVR to GRPO for this reason. Ai2 and DeepSeek are usually credited together for popularising verifiable-reward RL, with Tülu 3 cited for naming the technique and shipping a reproducible recipe, and DeepSeek R1 for demonstrating how far the technique can be pushed when scaled.
The November 2024 release was covered by VentureBeat, GeekWire, and MarkTechPost, which led with the claim that Ai2 had matched or beaten the closed Llama 3.1 Instruct line using an entirely open recipe. Coverage emphasised three points: the named introduction of RLVR, the full release of training data and reward functions, and OLMES as a public evaluation harness that other groups could rerun. The release was discussed extensively on the Interconnects newsletter run by Nathan Lambert, one of the lead authors.
The 405B release on January 30, 2025 attracted more attention because it landed days after DeepSeek R1 had become the dominant story in open AI. GeekWire ran the headline that Ai2 was "challenging DeepSeek on key benchmarks." VentureBeat described Tülu 3 405B as the first fully open post-training recipe to beat DeepSeek V3 on aggregate benchmarks including safety. The Ai2 blog post discussed inference throughput as the binding constraint on RLVR iteration time at the 405B scale.
Ai2 used Tülu 3 as the canonical post training recipe for subsequent OLMo releases. The OLMo 2 7B and 13B Instruct models used Tülu 3 directly, while the OLMo 2 32B Instruct used a revised Tülu 3.1 mixture with GRPO replacing PPO. The recipe was also picked up by external groups for finetuning runs on Qwen, Mistral, and Phi base models.
Criticisms were narrower than the praise. Some commentators noted that Tülu 3 still trailed Llama 3.1 Instruct on MMLU and HumanEval at the 8B and 70B scales, suggesting that recipe parity did not always translate to capability parity when the upstream base model retained advantages from its own undocumented post-training. Others observed that the openness of the recipe was bounded by the upstream Llama license, so the model could not be used commercially without accepting Meta's acceptable use policy. Ai2 addressed both critiques in subsequent releases, by using Tülu 3 on top of the fully open OLMo 2 base models and by continuing to push the recipe's competitive position with the Tülu 3.1 update.