Tülu 3

AI Research Open Source AI Reinforcement Learning Research Organizations

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 4,239 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Tülu 3 is a fully open post-training recipe and a corresponding family of instruction-tuned language models released by the Allen Institute for AI (Ai2) on November 21, 2024.^[2] Ai2 describes it as "a family of open state-of-the-art post-trained models, alongside all of the data, data mixes, recipes, code, infrastructure, and evaluation framework."^[2] Built on top of Meta's Llama 3.1 base models, Tülu 3 ships three things at once: a set of post-trained model checkpoints, the complete training data and code used to produce them, and a four stage alignment recipe that Ai2 documents end to end.^[1] The 8B and 70B variants shipped on November 21, 2024, with the technical report posted to arXiv as preprint 2411.15124 on November 22, 2024 and a much larger 405B variant added on January 30, 2025.^[1]^[3] The pipeline introduced Reinforcement Learning with Verifiable Rewards (RLVR) as a named third stage, in which the reward signal comes from deterministic checkers rather than a learned reward model.^[1]

The project is often discussed as the moment when verifiable-reward reinforcement learning entered the open language model toolkit. Two months after Tülu 3 was released, DeepSeek's R1 paper used the same general idea at much larger scale and with a different optimisation algorithm, GRPO, to produce a reasoning model that drew global attention.^[16] Ai2's own retrospective on the 405B run, published the day after R1, explicitly compared scaling behaviour between the two systems and noted that the math-only RLVR data mixture matched a similar observation in the DeepSeek-R1 report.^[3] For this reason Tülu 3 is frequently cited as the first widely reproduced open-recipe demonstration of RLVR style training on a frontier sized model.

The entire pipeline, including 939,344 training prompts, evaluation harnesses, reward functions, and roughly 50 derived model checkpoints, was released under permissive terms.^[1]^[2] Code is Apache 2.0 through the Open Instruct repository, while the model weights themselves inherit Meta's Llama 3.1 Community License because Tülu 3 finetunes Llama 3.1 base.^[8]^[4]

When was Tülu 3 released?

Tülu is Ai2's running line of open instruction-tuned models, named after the tülu, a hybrid camelid. The first Tülu paper appeared in mid 2023 and explored how to recover ChatGPT style helpfulness on top of open base models using mixed instruction datasets. Tülu 2, released in November 2023, was the first version to ship a 70B variant and the first to use DPO as a preference learning stage on top of supervised fine tuning. The Tülu 2 DPO 70B model was for several months the strongest open weight chat model that did not require a separate reward model at training time, and it became a common research baseline for preference optimisation work through 2024.^[17]

A mid 2024 update called Tülu 2.5 added PPO trained models alongside the original DPO models, with publicly released reward models and value functions. It was framed as an investigation into when preference optimisation algorithms diverge in practice, foreshadowing the heavier reinforcement learning component that would appear in Tülu 3. By late 2024 the headline open weight instruction models (Llama 3.1 Instruct, Qwen 2.5 Instruct, Mistral Instruct, Nemotron) all used closed post-training recipes whose data and reward functions were not public. Ai2 framed Tülu 3 as an attempt to close that gap at the recipe level, not just at the weights level.^[2]

Tülu 3 itself arrived in two waves: the 8B and 70B models plus the arXiv technical report in November 2024, and the 405B model on January 30, 2025.^[1]^[3] The Tülu 3 paper has 22 listed authors, with lead authorship by Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, and Hamish Ivison, and senior authorship by Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. All authors are affiliated with Ai2, with several also holding joint appointments at the University of Washington.^[1]

Tülu 3 pipeline

The Tülu 3 recipe is structured as a four stage pipeline with an explicit decontamination and evaluation step bolted onto the end. The four stages were described in the November 2024 paper as data, SFT, DPO, and RLVR.^[1] Ai2 sometimes counts the evaluation stage as a fifth pipeline element when explaining the recipe externally, since they treat decontamination as part of training rather than a separate audit.^[2]

Stage	Purpose	Data scale
1. Prompt curation	Collect and synthesise prompts targeting core skills	939,344 prompts
2. SFT	Teach response format, basic capabilities, refusal behaviour	Hundreds of thousands of prompt or response pairs
3. DPO	Sharpen preferred answers using on- and off-policy preferences	On policy preference pairs generated from SFT checkpoints
4. RLVR	Push verifiable skills (math, code, instruction following) above the DPO ceiling	Prompts with deterministic verifiers; PPO with verifier reward
5. Eval and decontamination	Standardise scoring and audit train-eval overlap	OLMES harness; n-gram overlap checks

Prompt curation

The initial prompt set contains 939,344 prompts split roughly 57 percent public and 43 percent synthetic.^[1] The public portion combines existing instruction datasets including FLAN, OpenAssistant, ShareGPT exports, NoRobots, the WildChat real user chats, and competition mathematics from the MATH and GSM8K training splits.^[1] The synthetic portion was generated using a persona-conditioned methodology in which prompts are produced by frontier models given specific persona descriptions.^[1] Responses were generated by GPT-4o or Claude 3.5 Sonnet for SFT data, with the choice depending on the skill being targeted.^[1] This use of proprietary model outputs is why individual data subsets carry their own additional terms even though the headline release is openly licensed.

Ai2 organised the data around eleven core skills including reasoning, mathematics, coding, instruction following, knowledge recall, multilinguality, and safety.^[1] For each skill the team built a target evaluation, a training data mixture, and a held out development set, an approach intended to avoid the common pitfall of letting one capability quietly degrade while another improves.^[1]

Supervised fine tuning

SFT is the first model training stage and uses the curated prompt-response pairs to teach the base Llama 3.1 model the conversational format, refusal behaviour, and baseline competence at each core skill.^[1] The SFT mixture combines the per skill subsets into a single training set with hyperparameter tuning for the mixture weights.^[1] Ai2 reports SFT runs of roughly two epochs at a learning rate around 5e-6 for the 8B, with longer runs and lower rates for the 70B and 405B.^[1]

Direct preference optimisation

The second stage applies DPO on a mix of on policy and off policy preference data.^[1] On policy pairs are constructed by sampling multiple completions from the SFT checkpoint, then using GPT-4o or Claude 3.5 Sonnet as a preference judge.^[1] Off policy pairs come from existing preference datasets including UltraFeedback and HelpSteer.^[1] The paper describes this mix as critical, with on-policy pairs providing the gradient direction the model can actually move toward while off-policy pairs prevent reward hacking on the judge's stylistic preferences.^[1]

Reinforcement learning with verifiable rewards

The final training stage, RLVR, is the recipe's signature contribution.^[1] It uses PPO with a deterministic verifier as the reward signal rather than a learned reward model.^[1] The verifier is a Python function that consumes a prompt and a model response and returns a numerical reward, typically binary.^[1] For mathematics the verifier extracts a final answer and compares it to ground truth via symbolic equality.^[1] For code it compiles and runs the response against a test suite.^[1] For instruction following it checks structural constraints such as whether the response is in JSON, contains an exact phrase, or has the requested number of bullet points.^[1]

The RLVR phase uses only prompts whose verifiers exist.^[1] The published RLVR mixture allenai/RLVR-GSM-MATH-IF-Mixed-Constraints contains roughly 30,000 prompts spanning GSM8K, MATH, and instruction following with structural constraints.^[7] Ai2 found that this relatively small prompt budget was sufficient to lift performance noticeably above the DPO checkpoint while avoiding overfitting on the verifier function.^[1]

How does RLVR work?

RLVR (Reinforcement Learning with Verifiable Rewards) was named and formalised in the Tülu 3 paper.^[1] The basic idea is that for any skill where the correctness of a response can be checked by a program, the noisy and expensive reward model used in conventional RLHF can be replaced by a deterministic checker.^[1] This contrasts with RLAIF, where the reward signal still comes from a model (an AI judge) rather than a ground-truth verifier. The policy is trained with PPO to maximise the verifier reward subject to a KL penalty against the SFT or DPO reference, the same objective form used in standard RLHF.^[1] The only difference is that the reward function is a callable verifier rather than a transformer based reward model.^[1]

The verifier reward formulation has three immediate consequences. First, the reward function is auditable; anyone can inspect the Python code that produces the training signal.^[8] Second, the verifier never goes out of distribution; it returns the same answer for the same input. Third, there is no reward model training cost, which removes one of the more delicate components of the RLHF pipeline. The trade off is that RLVR can only be applied where verifiers exist, which currently means mathematics, code, formal logic, and constrained generation.^[1] Conversation quality and helpfulness still rely on the SFT and DPO stages.

The Tülu 3 paper reports that RLVR delivered improvements of 1.7 points on MATH, 3.3 points on GSM8K, and 1.3 points on IFEval over the DPO checkpoint, with smaller spillover gains on tasks not directly trained.^[1] The gains came without obvious degradation elsewhere, which had been a recurring problem in earlier PPO based recipes. Ai2 attributes this stability to the determinism of the reward and a relatively short training horizon of 100,000 episodes at the 8B and 70B scales.^[1]

On the 405B model the training horizon was extended to 300,000 episodes and the mixture was simplified to mathematics only.^[3] Ai2 reported that at this scale the math-only mix worked better than the diverse mix that helped the smaller models, and that math gains were larger in absolute terms than at 8B or 70B.^[3] This pattern, where RLVR pays off more at larger model sizes, was noted as parallel to the same observation in the DeepSeek-R1 report.^[3]^[16]

What model sizes does Tülu 3 come in?

Ai2 released three primary model sizes, each accompanied by a separately released SFT and DPO checkpoint so that researchers can study the contribution of each pipeline stage in isolation.^[4]^[5]^[6]

Variant	Parameters	Release date	Base model	Final stage	HF repository
Llama 3.1 Tülu 3 8B	8 billion	November 21, 2024	meta-llama/Llama-3.1-8B	RLVR (PPO)	`allenai/Llama-3.1-Tulu-3-8B`
Llama 3.1 Tülu 3 70B	70 billion	November 21, 2024	meta-llama/Llama-3.1-70B	RLVR (PPO)	`allenai/Llama-3.1-Tulu-3-70B`
Llama 3.1 Tülu 3 405B	405 billion	January 30, 2025	meta-llama/Llama-3.1-405B	RLVR (PPO)	`allenai/Llama-3.1-Tulu-3-405B`

Each final model has matching SFT and DPO checkpoints published separately, for example allenai/Llama-3.1-Tulu-3-8B-SFT and allenai/Llama-3.1-Tulu-3-8B-DPO.^[4] There are also reward model and value model artefacts released to support reproducibility of the RLVR stage.^[4] Together with smaller derivative releases used in OLMo 2 post training, the Tülu 3 collection on Hugging Face exceeds 50 separate checkpoints.

Per model RLVR hyperparameters

Model	Learning rate	KL beta	Effective batch size	Total episodes	Max response length
Tülu 3 8B	3e-7	0.05	224	100,000	2,048
Tülu 3 70B	1e-7	0.07	640	100,000	2,048
Tülu 3 405B	1e-7	0.05	1,856	300,000	2,048

The 405B run used 32 nodes (256 H100 GPUs) for training and 16-way tensor-parallel inference through vLLM on a separate 240 GPU pool.^[3] Each RLVR iteration took roughly 550 seconds of inference, 25 seconds of weight transfer, and 1,500 seconds of training, for a wall clock cycle of about 35 minutes per update.^[3] Ai2 noted that 405B training was constrained more by inference throughput than by the optimiser.^[3] Ai2 described the milestone as "the first application of fully open post-training recipes to the largest open-weight models."^[3]

Benchmark performance

Ai2's evaluations were run through OLMES, the in-house Open Language Model Evaluation System used by the OLMo team.^[9] The headline tables in the Tülu 3 paper compare each variant against the Llama 3.1 Instruct model of the same size, and against the strongest peer at that scale.^[1] The 8B comparison focuses on Llama 3.1 8B Instruct and Qwen 2.5 7B Instruct, the 70B comparison on Llama 3.1 70B Instruct, and the 405B comparison on Llama 3.1 405B Instruct, DeepSeek V3, and GPT-4o.^[1]^[3]

Tülu 3 8B

Benchmark	Tülu 3 8B	Llama 3.1 8B Instruct	Qwen 2.5 7B Instruct
Average	64.8	62.2	57.8
MMLU (0-shot)	68.2	71.2	76.6
GSM8K (8-shot)	87.6	83.4	83.8
MATH (4-shot)	43.7	42.5	14.8
IFEval	82.4	80.6	74.7
HumanEval (pass@10)	83.9	86.3	93.1
AlpacaEval 2	34.5	24.2	29.0

The 8B is strongest on mathematics, instruction following, and AlpacaEval 2 chat preference.^[1] It trails Qwen 2.5 on MMLU and HumanEval, a gap Ai2 attributes to differences in the base models rather than the post-training recipe, since Qwen 2.5 base is itself stronger on these tasks than Llama 3.1 8B base.^[1]

Tülu 3 70B

Benchmark	Tülu 3 70B	Llama 3.1 70B Instruct
Average	76.0	73.4
MMLU (0-shot, CoT)	83.1	85.3
GSM8K (8-shot, CoT)	93.5	93.7
MATH (4-shot, Flex)	63.0	56.4
IFEval (prompt loose)	83.2	88.0
HumanEval (pass@10)	92.4	93.6
Safety (6-task average)	88.3	76.5

The 70B model recovers most of the gap to Llama 3.1 70B Instruct on general capabilities while overtaking it on mathematics and on Ai2's six task safety suite.^[1] Ai2 frames the safety improvement as a direct consequence of the explicit safety prompts included in both the SFT and DPO mixtures.^[1]

Tülu 3 405B

Benchmark	Tülu 3 405B	Llama 3.1 405B Instruct	DeepSeek V3	GPT-4o
Average (without safety)	80.0	78.1	79.0	80.5
Average (with safety)	80.7	79.0	75.9	81.6
MMLU (5-shot)	87.0	88.0	82.1	87.9
GSM8K (8-shot)	95.5	95.4	94.1	91.7
MATH (4-shot)	67.3	66.6	72.5	68.8
HumanEval (pass@10)	95.9	95.9	94.6	97.0
IFEval (prompt loose)	86.0	88.4	88.0	84.8
AlpacaEval 2	51.4	38.5	53.5	65.0
Safety (6-task average)	86.7	86.8	72.2	90.9

At the 405B scale Tülu 3 leads Llama 3.1 Instruct and DeepSeek V3 on the average score with safety included, posting 80.7 against DeepSeek V3's 75.9, and lands within striking distance of GPT-4o's 81.6.^[3]^[18] The DeepSeek V3 numbers used in this comparison come from Ai2's own evaluations through OLMES and may differ from the figures DeepSeek publishes in its own reports.^[3]

Is Tülu 3 open source?

The Tülu 3 release uses the term "fully open" in the same restrictive sense that Ai2 applies to the OLMo 2 family.^[2] A fully open post-training release in this vocabulary includes the model weights, the entire training data, the per stage code, the reward functions, the evaluation harness, and the recipe documentation including hyperparameter choices and dataset mixture weights.^[2] The intent is that a third party with sufficient compute could re-run the pipeline against a different base model or different data, and obtain a numerically comparable result.^[2] TechCrunch noted that the 405B model's components "can be freely accessed and replicated," distinguishing it from the more restrictively licensed GPT-4o and DeepSeek V3.^[18]

The artefacts published with Tülu 3 cover this scope. The 939,344 prompt collection is on Hugging Face under allenai/tulu-3-sft-mixture.^[1] The reward functions used during RLVR are in the Open Instruct repository under open_instruct/ground_truth_utils.py.^[8] The OLMES evaluation harness is at allenai/olmes.^[9] The intermediate SFT and DPO checkpoints are released as separate Hugging Face models.^[4] Even the persona descriptions used to generate the synthetic prompts are published, alongside the prompt templates used to call GPT-4o and Claude 3.5 Sonnet.^[1]

This level of openness has real costs. The collection occupies several terabytes of storage, and the licensing surface is more complex than for a single weights only release. Subsets of the prompt data carry additional terms inherited from the upstream sources (FLAN, OpenAssistant, WildChat, and so on), and the responses generated by GPT-4o and Claude 3.5 Sonnet are bound by the relevant terms of service for synthetic data use.^[1] Ai2 documents these per-subset terms in the dataset cards.^[1]

Licensing

Tülu 3 model weights are released under Meta's Llama 3.1 Community License, inherited from the Llama 3.1 base.^[4] This is permissive for most users but imposes an acceptable use policy and a 700 million monthly active user threshold above which a separate commercial agreement is required.^[4] Use of the models also requires acceptance of the Gemma Terms of Use and the Qwen License Agreement, because some synthetic training data was generated using those models.^[4]

The training and evaluation code in Open Instruct and OLMES is released under Apache 2.0.^[8]^[9] The datasets carry per-subset licensing because they aggregate material from multiple upstream sources.^[1] Ai2's general stance is that the recipe and code should be as openly licensed as possible to maximise reproducibility, while the weights inherit whatever constraint the base model imposes.^[2]

How does Tülu 3 compare to peer models?

Tülu 3 sits in a 2024 to 2025 open weight instruction-tuning landscape where the main reference points are Llama 3.1 Instruct, Qwen 2.5 Instruct, NVIDIA's Nemotron 70B, Mistral Instruct, and Nous Hermes 3.

Model	Parameters	Released	Post-training code public	RL stage
Tülu 3 8B and 70B	8B, 70B	November 21, 2024	Yes (Open Instruct)	RLVR (PPO)
Tülu 3 405B	405B	January 30, 2025	Yes (Open Instruct)	RLVR (PPO)
Llama 3.1 Instruct	8B, 70B, 405B	July 2024	No	DPO + Rejection sampling
Qwen 2.5 Instruct	7B to 72B	September 2024	No	DPO + PPO
Nemotron 70B Instruct	70B	October 2024	Partial	RLHF
Nous Hermes 3 405B	405B	November 2024	No	DPO

At the 8B scale Tülu 3 is the strongest open recipe model on mathematics and instruction following, while Qwen 2.5 Instruct is stronger on general knowledge and coding.^[1] At the 70B scale Tülu 3 closes most of the gap to Llama 3.1 70B Instruct and leads on math and safety.^[1] At the 405B scale Tülu 3 leads Llama 3.1 405B Instruct on Ai2's aggregate including safety, and is competitive with DeepSeek V3, though direct comparisons depend on the evaluation harness used.^[3]

Relationship to DeepSeek R1 and GRPO

The Tülu 3 paper introduced RLVR as a named pipeline stage on November 22, 2024.^[1] The DeepSeek R1 paper followed in January 2025 and used a similar idea, training with verifiable rewards on mathematics and coding to produce a long reasoning model with strong scores on competition benchmarks.^[16] DeepSeek used GRPO, a variant of PPO that drops the value function critic and instead computes a baseline from a group of sampled completions, rather than the standard PPO used in Tülu 3.^[16] This makes the DeepSeek approach more memory efficient at large scale, since the value model has been removed.^[16]

The two projects share the verifiable reward idea but emphasise different things. Tülu 3 frames RLVR as one stage in a broader general purpose post-training pipeline whose primary goal is matching Llama 3.1 Instruct across the board.^[1] DeepSeek R1 frames RLVR as the entire training story for a specialised reasoning model.^[16] The DeepSeek R1 paper does not cite Tülu 3 by name, but the Ai2 team observed that the math-only RLVR scaling behaviour they had documented for the 405B run matched the same observation in the R1 report.^[3]

In the months that followed, the broader open language model ecosystem converged on the RLVR-plus-GRPO combination as the default reinforcement learning stage for reasoning oriented training. Ai2's own OLMo 2 32B Instruct model switched from PPO based RLVR to GRPO for this reason. Ai2 and DeepSeek are usually credited together for popularising verifiable-reward RL, with Tülu 3 cited for naming the technique and shipping a reproducible recipe, and DeepSeek R1 for demonstrating how far the technique can be pushed when scaled.

Reception

The November 2024 release was covered by VentureBeat, GeekWire, and MarkTechPost, which led with the claim that Ai2 had matched or beaten the closed Llama 3.1 Instruct line using an entirely open recipe.^[12]^[14] Coverage emphasised three points: the named introduction of RLVR, the full release of training data and reward functions, and OLMES as a public evaluation harness that other groups could rerun.^[12]^[14] The release was discussed extensively on the Interconnects newsletter run by Nathan Lambert, one of the lead authors.

The 405B release on January 30, 2025 attracted more attention because it landed days after DeepSeek R1 had become the dominant story in open AI. GeekWire ran the headline that Ai2 was "challenging DeepSeek on key benchmarks."^[13] VentureBeat described Tülu 3 405B as the first fully open post-training recipe to beat DeepSeek V3 on aggregate benchmarks including safety.^[11] TechCrunch reported that "Ai2 says its new AI model beats one of DeepSeek's best."^[18] The Ai2 blog post discussed inference throughput as the binding constraint on RLVR iteration time at the 405B scale.^[3]

Ai2 used Tülu 3 as the canonical post training recipe for subsequent OLMo releases.^[10] The OLMo 2 7B and 13B Instruct models used Tülu 3 directly, while the OLMo 2 32B Instruct used a revised Tülu 3.1 mixture with GRPO replacing PPO. The recipe was also picked up by external groups for finetuning runs on Qwen, Mistral, and Phi base models.

Criticisms were narrower than the praise. Some commentators noted that Tülu 3 still trailed Llama 3.1 Instruct on MMLU and HumanEval at the 8B and 70B scales, suggesting that recipe parity did not always translate to capability parity when the upstream base model retained advantages from its own undocumented post-training.^[1] Others observed that the openness of the recipe was bounded by the upstream Llama license, so the model could not be used commercially without accepting Meta's acceptable use policy.^[4] Ai2 addressed both critiques in subsequent releases, by using Tülu 3 on top of the fully open OLMo 2 base models and by continuing to push the recipe's competitive position with the Tülu 3.1 update.

References

Lambert, Nathan et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv preprint 2411.15124, November 22, 2024. https://arxiv.org/abs/2411.15124 ↩
Allen Institute for AI. "Tülu 3 opens language model post-training up to more tasks and more people." November 21, 2024. https://allenai.org/blog/tulu-3 ↩
Allen Institute for AI. "Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3." January 30, 2025. https://allenai.org/blog/tulu-3-405B ↩
Allen Institute for AI. "allenai/Llama-3.1-Tulu-3-8B." Hugging Face model card. https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B ↩
Allen Institute for AI. "allenai/Llama-3.1-Tulu-3-70B." Hugging Face model card. https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B ↩
Allen Institute for AI. "allenai/Llama-3.1-Tulu-3-405B." Hugging Face model card. https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B ↩
Allen Institute for AI. "allenai/RLVR-GSM-MATH-IF-Mixed-Constraints." Hugging Face dataset card. https://huggingface.co/datasets/allenai/RLVR-GSM-MATH-IF-Mixed-Constraints ↩
Allen Institute for AI GitHub. "allenai/open-instruct." https://github.com/allenai/open-instruct ↩
Allen Institute for AI GitHub. "allenai/olmes." https://github.com/allenai/olmes ↩
Allen Institute for AI. "Tulu hub page." https://allenai.org/tulu ↩
Franzen, Carl. "Ai2 releases Tülu 3, a fully open-source model that bests DeepSeek v3, GPT-4o with novel post-training approach." VentureBeat, January 30, 2025. https://venturebeat.com/ai/ai2-releases-tulu-3-a-fully-open-source-model-that-bests-deepseek-v3-gpt-4o-with-novel-post-training-approach ↩
Bishop, Todd. "Ai2's new Tulu 3 model rivals tech giants in breakthrough for open-source AI post-training." GeekWire, November 21, 2024. https://www.geekwire.com/2024/ai2s-new-tulu-3-model-rivals-tech-giants-in-breakthrough-for-open-source-ai-post-training/ ↩
Bishop, Todd. "Allen Institute for AI challenges DeepSeek on key benchmarks with big new open-source AI model." GeekWire, January 30, 2025. https://www.geekwire.com/2025/allen-institute-for-ai-challenges-deepseek-on-key-benchmarks-with-big-new-open-source-ai-model/ ↩
MarkTechPost. "The Allen Institute for AI (AI2) Releases Tülu 3: A Set of State-of-the-Art Instruct Models with Fully Open Data, Eval Code, and Training Algorithms." November 21, 2024. https://www.marktechpost.com/2024/11/21/the-allen-institute-for-ai-ai2-releases-tulu-3-a-set-of-state-of-the-art-instruct-models-with-fully-open-data-eval-code-and-training-algorithms/ ↩
MarkTechPost. "The Allen Institute for AI (AI2) Releases Tülu 3 405B." January 31, 2025. https://www.marktechpost.com/2025/01/31/the-allen-institute-for-ai-ai2-releases-tulu-3-405b-scaling-open-weight-post-training-with-reinforcement-learning-from-verifiable-rewards-rlvr-to-surpass-deepseek-v3-and-gpt-4o-in-key-benchmarks/
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv preprint 2501.12948, January 22, 2025. https://arxiv.org/abs/2501.12948 ↩
Allen Institute for AI. "allenai/tulu-2-dpo-70b." Hugging Face model card. https://huggingface.co/allenai/tulu-2-dpo-70b ↩
Wiggers, Kyle. "Ai2 says its new AI model beats one of DeepSeek's best." TechCrunch, January 30, 2025. https://techcrunch.com/2025/01/30/ai2-says-its-new-ai-model-beats-one-of-deepseeks-best/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

HuggingFace TRL IFBench Instruction Tuning LIMA (Less Is More for Alignment)Molmo OLMo 2 OLMo 3 OLMoE RLVR Reward Model Self-Instruct UltraChat WildBench

When was Tülu 3 released?

Tülu 3 pipeline

Prompt curation

Supervised fine tuning

Direct preference optimisation

Reinforcement learning with verifiable rewards

How does RLVR work?

What model sizes does Tülu 3 come in?

Per model RLVR hyperparameters

Benchmark performance

Tülu 3 8B

Tülu 3 70B

Tülu 3 405B

Is Tülu 3 open source?

Licensing

How does Tülu 3 compare to peer models?

Relationship to DeepSeek R1 and GRPO

Reception

See also

References

Improve this article

Related Articles

Machine Intelligence Research Institute

Center for AI Safety

Alignment Research Center

Mila (Quebec AI Institute)

Kyutai

Meta FAIR

What links here

Related Articles

Machine Intelligence Research Institute

Center for AI Safety

Alignment Research Center

Mila (Quebec AI Institute)

Kyutai

Meta FAIR

What links here