OLMo 3
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,862 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,862 words
Add missing citations, update stale details, or suggest a clearer explanation.
OLMo 3 is the third generation of fully open language models released by the Allen Institute for AI (Ai2). The family was announced on November 20, 2025 and consists of dense decoder only Transformer models at two parameter scales, 7 billion and 32 billion, with four post-training variants per scale: Base, Think, Instruct, and RL Zero. Ai2 markets the release under the banner of the "model flow," by which it means the full chain of artefacts needed to reproduce the work, including pretraining corpora, mid training and long context data, post training mixtures, intermediate checkpoints, training logs, evaluation suites, and the source code for both the model and the surrounding infrastructure. Everything is released under the Apache 2.0 licence.
The headline claim from Ai2 is that OLMo 3 produces the strongest fully open 32B reasoning model available at launch, with the OLMo 3 Think 32B variant matching Qwen 3 32B on a battery of mathematics, coding, and reasoning benchmarks while training on roughly six times fewer tokens. A second claim is operational: by rebuilding the post training stack around an in house reinforcement learning system called OlmoRL, the team cut the wall clock training time of OLMo 3 RL Think from over 15 days to around 6 days, a roughly 2.5x improvement that the team attributes to in flight weight updates, continuous batching, and active refilling during asynchronous RL. The Ai2 team frames OLMo 3 as the project where the fully open release format catches up to the closed and open weight frontier on reasoning, rather than trailing it by a generation as the earlier OLMo and OLMo 2 releases had.
Ai2 has positioned the OLMo programme as a counterweight to what it sees as the gradual closing of the open weight ecosystem. By late 2025 the dominant open weight families, Meta Llama, Alibaba Qwen, Mistral, and Google Gemma, were still releasing weights under various community licences but were generally not publishing their training data or recipes in a form that allowed outside researchers to reproduce or ablate the runs. The OLMo lineage has consistently rejected that posture and shipped the data and recipe alongside the weights.
The original OLMo arrived in February 2024 with 1B and 7B base models trained on the Dolma corpus. It was followed in November 2024 by OLMo 2 7B and 13B and a 32B flagship in March 2025, all built on the Tulu 3 post training stack and the Dolma 2 data mixture. The 32B Instruct model was the first fully open model to outperform OpenAI's GPT-3.5 Turbo and GPT-4o mini on Ai2's aggregate of academic benchmarks, but it stopped short of the reasoning capabilities seen in DeepSeek R1, Qwen 3, and other late 2025 frontier models that used dedicated reasoning training pipelines.
OLMo 3 was conceived as the generation where the open release format would tackle reasoning head on. Through the first three quarters of 2025 Ai2 rebuilt several parts of its stack in parallel: a new pretraining corpus called Dolma 3, a new post training data suite called Dolci, a new reinforcement learning framework called OlmoRL, an open evaluation system called OLMES, and a provenance tracing tool called OlmoTrace. The November release bundled all of these into a single coordinated launch and used the term "model flow" to describe the resulting end to end pipeline.
The OLMo 3 family is organised as a grid with two parameter scales on one axis and four post training variants on the other. The Base models are the pure pretrained checkpoints, Instruct and Think are the two main user facing chat variants, and RL Zero is a research only family of checkpoints designed to make reinforcement learning ablations tractable.
| Variant | Parameters | Initial release | Purpose | HF repository |
|---|---|---|---|---|
| OLMo 3 Base 7B | 7 billion | November 20, 2025 | Foundation pretrained model | allenai/Olmo-3-1125-7B |
| OLMo 3 Base 32B | 32 billion | November 20, 2025 | Foundation pretrained model | allenai/Olmo-3-1125-32B |
| OLMo 3 Think 7B | 7 billion | November 20, 2025 | Reasoning with visible chain of thought | allenai/Olmo-3-1125-7B-Think |
| OLMo 3 Think 32B | 32 billion | November 20, 2025 | Flagship reasoning model | allenai/Olmo-3-1125-32B-Think |
| OLMo 3 Instruct 7B | 7 billion | November 20, 2025 | Chat, tool use, multi turn dialogue | allenai/Olmo-3-1125-7B-Instruct |
| OLMo 3 RL Zero 7B | 7 billion | November 20, 2025 | Four research checkpoints for math, code, IF, chat | allenai/Olmo-3-7B-RL-Zero-* |
| OLMo 3.1 Think 32B | 32 billion | December 12, 2025 | Updated reasoning checkpoint | allenai/Olmo-3.1-32B-Think |
| OLMo 3.1 Instruct 32B | 32 billion | December 12, 2025 | Updated chat model at 32B | allenai/Olmo-3.1-32B-Instruct |
The Base variant is intended as a foundation for further fine tuning. The Think variant produces an explicit reasoning trace before the final answer, in the style popularised by DeepSeek R1 and OpenAI's o1, and is the variant that Ai2 highlights against Qwen 3 in its benchmark tables. The Instruct variant is tuned for short, direct chat answers without an extended thinking phase, oriented towards production assistants and synthetic data generation. The RL Zero variant is a set of four checkpoints, each trained from the OLMo 3 Base with reinforcement learning on a single skill domain (mathematics, code generation, instruction following, and general chat), released specifically so that academics studying RL dynamics can ablate one component at a time without having to construct comparable starting points themselves.
A December 12, 2025 update bumped the 32B Think and Instruct models to the 3.1 designation. The update was reportedly the result of extended reinforcement learning training runs that Ai2 said produced gains of more than 5 points on AIME, more than 4 points on ZebraLogic, more than 4 points on IFEval, and more than 20 points on IFBench relative to the original November checkpoints, without changing the architecture or the pretrained base.
OLMo 3 uses a standard decoder only Transformer with pre normalisation and grouped query attention. The architecture is similar in spirit to OLMo 2 but doubles the context length to 65,536 tokens, increases the per layer attention head count, and adopts a tokeniser refresh and an updated positional encoding configuration suitable for long context training. The published model cards list the following hyperparameters for the two scales.
| Spec | OLMo 3 7B | OLMo 3 32B |
|---|---|---|
| Layers | 32 | 64 |
| Hidden size | 4096 | 5120 |
| Query attention heads | 32 | 40 |
| Key value heads | 8 | 8 |
| Context length | 65,536 tokens | 65,536 tokens |
| Parameter dtype | BF16 | BF16 |
| Tokeniser | byte pair encoding, ~100k vocab | byte pair encoding, ~100k vocab |
| Pretraining data cutoff | December 2024 | December 2024 |
The move from 4,096 token context in OLMo 2 to 65,536 tokens in OLMo 3 was one of the most requested changes from outside users. Ai2 implements the long context with a dedicated training stage rather than a post hoc position interpolation trick, which adds complexity to the data pipeline but produces a model that holds its accuracy on retrieval and reasoning tasks well into the 32k to 64k range, according to the team's own evaluation. The team reports that the 7B model trains at roughly 7,700 tokens per device per second on a 1,024 H100 cluster.
Grouped query attention is used at a query to key value ratio of 4 to 1 in the 7B and 5 to 1 in the 32B, which keeps the inference key value cache compact for long context serving. The team retained the QK normalisation and RMSNorm choices from OLMo 2 and did not change the activation function or the embedding initialisation.
OLMo 3 uses a three stage pretraining curriculum followed by a three stage post training pipeline. The structure is conceptually similar to OLMo 2 but each stage uses a distinct, separately named data mixture.
The primary pretraining stage uses Dolma 3 Mix, a roughly 5.9 trillion token corpus drawn from web pages, scientific PDFs processed with Ai2's own olmOCR pipeline, code repositories, and mathematics. Dolma 3 Mix is sampled from a larger pool, the full Dolma 3 corpus of about 9.3 trillion tokens, which Ai2 published alongside the trained mix and described as a superset that future runs can draw from. Compared with the OLMo Mix used for OLMo 2, Dolma 3 Mix has a noticeably higher fraction of code and mathematics, a deliberate choice intended to improve downstream reasoning performance.
A second pretraining stage uses Dolma 3 Dolmino, a roughly 100 billion token mid training mixture sampled from a 2.2 trillion token pool. Dolmino concentrates on academic content, question answering data, instruction style examples, and mathematics, and the learning rate is annealed down to zero during this stage. The construction follows the same model souping approach as OLMo 2: multiple parallel runs are trained on different slices of Dolmino and then merged to produce the final base weights.
The third stage is the long context extension, which uses Dolma 3 Longmino. The 7B model is trained on roughly 50 billion tokens drawn from a 639 billion token pool of long documents, while the 32B model uses roughly 100 billion tokens for the same purpose. Longmino is heavily weighted towards documents that are themselves long enough to fill the 65k token window, including books, long technical PDFs, code repositories with many files, and concatenated documentation.
The full pretraining stack was run on up to 1,024 NVIDIA H100 GPUs, with mid training using 128 and post training using 256 H100s. Total training tokens for the OLMo 3 32B Base reach approximately 5.5 trillion in stage 1, 200 billion in stage 2, and 100 billion in stage 3, with the date cutoff at December 2024.
The Instruct, Think, and RL Zero variants are built on a three stage post training pipeline that follows the Tulu 3 blueprint developed for OLMo 2. The three stages are supervised fine tuning, direct preference optimisation, and reinforcement learning with verifiable rewards, but the data mixtures and the RL engine are new for this release.
The Dolci post training data suite contains separate mixtures for each stage. Dolci SFT and Dolci DPO are general mixtures used for the Instruct variants, while Dolci Think SFT, Dolci Think DPO, and Dolci Think RLVR are reasoning specific mixtures used for the Think variants. The mixtures combine curated public datasets, persona based synthetic data generated with the OLMo 3 Base models, competition mathematics problems, and instruction following examples constructed against deterministic constraint checkers. Like the pretraining data, the Dolci mixtures are released openly on Hugging Face without licence restrictions.
Reinforcement learning is handled by OlmoRL, a new open infrastructure that Ai2 says brings a series of engineering and algorithmic improvements over the off the shelf RL trainers used for OLMo 2. The team highlights three changes: in flight weight updates that keep the actor and critic in sync during long generation steps, continuous batching that prevents shorter rollouts from blocking longer ones, and active refilling that maintains a constant generation flow during asynchronous RL training. Together, these changes are reported to reduce the wall clock RL time for OLMo 3 RL Think from over 15 days to around 6 days on the same hardware budget, and to increase the throughput of supervised fine tuning by roughly 8x compared with the OLMo 2 stack.
A distinctive feature of the OLMo 3 release is OlmoTrace, a tool that lets users highlight any span of model output in the Ai2 Playground and trace it back to specific documents in the pretraining corpus. OlmoTrace is intended to support several use cases: auditing apparent hallucinations against the actual training data, distinguishing reasoning from memorisation by looking for verbatim sources, detecting evaluation contamination, and studying how specific capabilities emerge from particular training material. Ai2 frames OlmoTrace as the practical payoff of releasing the full training corpus, since the same tool is not implementable for closed or open weight models whose training data is not published.
Ai2 evaluates OLMo 3 on its own OLMES suite plus the wider community standard reasoning and instruction benchmarks. Ai2 reports the following headline numbers for the base 32B model, which is the largest pretrained checkpoint in the family.
| Benchmark | Category | OLMo 3 Base 32B |
|---|---|---|
| GSM8k | Math | 80.5 |
| MATH | Math | 43.4 |
| HumanEval | Code | 66.5 |
| MBPP | Code | 60.2 |
| MMLU STEM | STEM knowledge | 70.8 |
| SQuAD | Reading | 98.2 |
| DROP | Reading | 81.0 |
| BBH | Reasoning | 77.6 |
The headline reasoning benchmarks for OLMo 3 Think 32B and the chat benchmarks for OLMo 3 Instruct 7B are reported separately, with the Think numbers measured with an extended thinking budget.
| Benchmark | Category | OLMo 3 Think 32B | OLMo 3 Instruct 7B |
|---|---|---|---|
| MATH | Math | 96.1 | 87.3 |
| AIME 2024 | Math | 76.8 | not reported |
| AIME 2025 | Math | 72.5 | not reported |
| BigBenchHard | Reasoning | 89.8 | 71.2 |
| HumanEvalPlus | Code | 91.4 | 77.2 |
| IFEval | Instruction following | 89.0 | 85.6 |
| MMLU | Knowledge | 85.4 | not reported |
| SimpleQA | Knowledge | not reported | 79.3 |
The December 12 update to OLMo 3.1 Think 32B improved several of these numbers. Ai2 reports gains of more than 5 points on AIME, more than 4 points on ZebraLogic, more than 4 points on IFEval, and more than 20 points on IFBench compared with the November checkpoint, attributed to a longer reinforcement learning run on the existing base rather than any architectural change. The RL Zero 7B Code and Math checkpoints were also updated as part of the 3.1 release with what Ai2 describes as longer and more stable training runs.
These results should be read with the usual caveats about evaluation harness, prompt formatting, and contamination, and Ai2 acknowledges in its release blog that benchmark scores for reasoning models are unusually sensitive to thinking budget, sampling temperature, and stop conditions. The team publishes its evaluation scripts as part of OLMES and a separate tool called OlmoBaseEval for pretraining era comparisons.
The OLMo 3 family enters a late 2025 landscape that includes Qwen 3 from Alibaba, Meta Llama 3.1 and Llama 3.3, Google Gemma 3, DeepSeek R1 distilled variants, and a small set of fully open competitors such as Marin and Apertus. Ai2 publishes side by side comparisons in its release blog and the OLMo 3 technical report.
| Model | Parameters | Released | Licence | Training data open | Reasoning variant |
|---|---|---|---|---|---|
| OLMo 3 Base and Think | 7B and 32B | November 2025 | Apache 2.0 | Yes (Dolma 3) | Yes (Think) |
| OLMo 2 7B, 13B, 32B | 7B to 32B | November 2024 to March 2025 | Apache 2.0 | Yes (Dolma 2) | No |
| Qwen 3 8B and 32B | 8B and 32B | 2025 | Apache 2.0 | No | Yes |
| Llama 3.1 8B and 70B | 8B and 70B | July 2024 | Llama 3.1 Community | No | No |
| Llama 3.3 70B | 70B | December 2024 | Llama 3.3 Community | No | No |
| Gemma 3 27B | 27B | 2025 | Gemma Terms of Use | No | No |
| DeepSeek R1 Distill 32B | 32B | January 2025 | MIT | No (R1 base not open) | Yes |
| Marin 32B | 32B | 2025 | Apache 2.0 | Yes | No |
| Apertus 70B | 70B | 2025 | Apache 2.0 | Yes | No |
At the base model level, Ai2 reports that OLMo 3 Base 32B outperforms the fully open Marin 32B and Apertus 70B, sits competitively with Qwen 2.5 32B and Gemma 3 27B on aggregate scores, and trails Llama 3.1 70B on several individual benchmarks. The Think 32B variant ties or exceeds Qwen 3 32B on several reasoning benchmarks, matches Qwen 3 VL 32B Thinking on the OMEGA suite, and outperforms the DeepSeek R1 Distill 32B on specific instruction following tasks. The Instruct 7B variant is described by Ai2 as matching or outperforming Qwen 2.5 7B, Gemma 3 7B, and Llama 3.1 8B at similar scale, with the gap to Qwen 3 8B reported as around 1 to 2 points overall.
The efficiency comparison is a recurring talking point. Ai2 reports that the OLMo 3 Base 32B was trained on roughly six times fewer tokens than Qwen 3 32B and that the OLMo 3 Base 7B is approximately 2.5 times more efficient to train than Meta's Llama 3.1 8B post trained model, measured in GPU hours per token. These ratios depend on the exact comparison points but were widely repeated in coverage of the release.
All OLMo 3 weights, datasets, and source code are released under the Apache 2.0 licence, the same permissive licence used for the earlier OLMo and OLMo 2 generations. Apache 2.0 allows commercial use, modification, and redistribution provided the licence notice and any patent grants are preserved. There are no per user thresholds, acceptable use restrictions, or commercial gate triggers of the kind found in Meta's Llama Community Licence or Google's Gemma Terms of Use.
Ai2 publishes a separate Responsible Use Guidelines document that asks users to apply common sense restrictions on misuse, but the document is advisory and not part of the licence. The Dolma 3 corpus is released without additional licence restrictions, although Ai2 notes that individual documents within the corpus retain their original copyright terms and that users are responsible for compliance with those terms when training or distributing derivative models. The Dolci post training mixtures are similarly released as raw downloads without further restriction.
The choice of Apache 2.0 across the entire stack is part of Ai2's broader policy argument for open source AI. The team has used the OLMo lineage in submissions to the United States National Telecommunications and Information Administration, the United Kingdom AI Safety Institute, and the European Union AI Office as a concrete example of how high quality models can be released under fully permissive terms without producing models that pose obvious uplift to misuse.
Reception in the AI research community was broadly positive, with attention focused on three themes. The first was the reasoning result. The OLMo 3 Think 32B was described by Nathan Lambert, the former Ai2 staffer who runs the Interconnects AI newsletter, as the first fully open reasoning model that was credibly competitive with Qwen 3 on academic mathematics and code benchmarks. Cameron Wolfe's open language model survey called the release the most significant entry in what he termed the open LLM renaissance of 2025. The Hacker News thread on the AI2 blog post drew several hundred comments, with multiple commenters singling out OlmoTrace as the feature that distinguished OLMo 3 from open weight competitors.
The second theme was the operational story around OlmoRL and the model flow concept. The roughly 2.5x reduction in RL training wall clock time, combined with the open release of the OlmoRL code, was seen as a contribution to the open reinforcement learning literature in its own right. The Simon Willison blog post described OLMo 3 as "a fully open LLM" and emphasised the practical implications of being able to inspect and modify every stage of the pipeline.
The third theme was the December 3.1 update, which arrived less than four weeks after the original release. Several commenters interpreted the rapid cadence as evidence that the OlmoRL stack made post training cheap enough that Ai2 could iterate on reasoning quality on a roughly monthly basis, rather than the six to nine month cadence common for closed lab releases. The update itself was generally well received, with the IFBench gain of more than 20 points highlighted as an unusually large improvement for a point release.
Criticism was targeted. The 32B Think model was acknowledged as still trailing the strongest closed reasoning systems including OpenAI's o3 and Gemini 3 Pro on the harder Olympiad mathematics and ARC-AGI suites. The Instruct 7B model was viewed as a step behind Qwen 3 8B on tool use specific benchmarks. Several reviewers also noted that the 65k token context length, while a substantial improvement over OLMo 2, still trailed the 128k and longer windows that had become standard in closed models by late 2025. Ai2 acknowledged these gaps in the release blog and pointed to the model flow architecture as the venue where they would be addressed in subsequent OLMo 3.x and OLMo 4 releases.