OLMo 2
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,643 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,643 words
Add missing citations, update stale details, or suggest a clearer explanation.
OLMo 2 (sometimes stylised 2 OLMo 2 Furious after the technical report of the same name) is the second generation of fully open large language models released by the Allen Institute for AI (Ai2). The family was first announced on November 26, 2024 with 7B and 13B parameter base and instruction-tuned variants, and was joined in March 2025 by a 32B flagship model that Ai2 described as the first fully open language model to outperform OpenAI's GPT-3.5 Turbo and GPT-4o mini on a broad academic benchmark suite. Like its predecessor OLMo, the OLMo 2 series ships every artefact needed to reproduce the work, including weights, training data, training code, evaluation harnesses, hundreds of intermediate checkpoints, and detailed training logs, all released under the permissive Apache 2.0 licence.
Ai2 pitched OLMo 2 as a response to the gradual closing of the so called open weight ecosystem. Llama, Qwen, Mistral and Gemma had become the default reference models for academic work by 2024, but their training data and recipes remained proprietary, which made it impossible to ablate dataset decisions or audit the provenance of training material. OLMo 2 attempts to close that gap at competitive performance levels, with the 7B and 13B variants matching or beating Meta's Llama 3.1 8B and Alibaba's Qwen 2.5 7B on several evaluation suites despite using fewer training FLOPs, and the 32B model trailing only frontier scale closed models on aggregate academic benchmarks.
The original OLMo, released in February 2024, established Ai2's commitment to releasing models with their full training recipe. That first generation included 1B and 7B base models trained on the Dolma corpus, an open 3 trillion token web mixture also published by Ai2. Even at launch the 7B was competitive with Meta's Llama 2 7B on most academic benchmarks, but it lagged the newer Llama 3, Mistral 7B and Qwen 1.5 7B that arrived in the following months. The original models also showed instability during long training runs, with intermittent loss spikes that forced engineers to manually intervene and restart from earlier checkpoints. A mid year refresh called OLMo 1.7 7B added a 24 point improvement on the MMLU benchmark by upgrading to a revised dataset (Dolma 1.7) and a longer 2 trillion token training run, but the architecture, optimiser and training hyperparameters were largely unchanged from the original release.
OLMo 2 was conceived as a clean redesign. The Ai2 OLMo team had spent most of 2024 cataloguing the sources of instability and inefficiency in the first generation, and used the new release to apply a coordinated set of fixes drawn from the wider open weights literature. The headline result was a family of base models that the team claimed sat on the Pareto frontier of performance versus training compute, beating contemporaneous open weight models at equivalent parameter counts while using one third to one half the training FLOPs.
| Variant | Parameters | Release date | Training tokens | HF repository |
|---|---|---|---|---|
| OLMo 2 7B | 7 billion | November 26, 2024 | 4 trillion | allenai/OLMo-2-1124-7B |
| OLMo 2 13B | 13 billion | November 26, 2024 | 5 trillion | allenai/OLMo-2-1124-13B |
| OLMo 2 32B | 32 billion | March 13, 2025 | 6 trillion | allenai/OLMo-2-0325-32B |
Each base model is accompanied by an SFT checkpoint (supervised fine tuning only), a DPO checkpoint, and a final Instruct release that adds reinforcement learning. The 7B and 13B Instruct models use PPO with verifiable rewards while the 32B Instruct model uses Group Relative Policy Optimisation (GRPO), the algorithm popularised by DeepSeek earlier in the year. Ai2 also released the reward models used in training, for example allenai/OLMo-2-1124-7B-RM, so that researchers can replicate or vary the preference tuning stage independently. The naming convention encodes the release date in the suffix, so 1124 refers to November 2024 and 0325 to March 2025.
The OLMo 2 base architecture is a decoder only Transformer in the now standard pre normalisation configuration, but the team made several focused changes intended to improve training stability and per token efficiency relative to the first OLMo generation.
| Spec | OLMo 2 7B | OLMo 2 13B | OLMo 2 32B |
|---|---|---|---|
| Layers | 32 | 40 | 64 |
| Hidden size | 4096 | 5120 | 5120 |
| Attention heads | 32 | 40 | 40 |
| Context length | 4096 tokens | 4096 tokens | 4096 tokens |
| Training FLOPs | not officially disclosed | 4.6 x 10^23 | 1.3 x 10^24 |
The most important architectural change was the switch from the non parametric layer normalisation used in OLMo 1 to RMSNorm, a simpler and slightly cheaper alternative that omits the mean centering step. Ai2 also added QK normalisation, which applies RMSNorm to the query and key projections inside each attention head before computing attention scores. This combination was reported in the OLMo 2 technical report to substantially reduce the frequency of loss spikes during long training runs, which had been a recurring operational headache during OLMo 1 training.
The positional encoding scheme was upgraded to rotary positional embeddings, replacing the absolute learned positional embeddings used in OLMo 1. Z loss regularisation was added to discourage the final softmax from saturating, a trick borrowed from the PaLM family. The team also revisited weight initialisation, choosing a scheme that preserves activation and gradient magnitudes across the depth of the network. Tokenisation continued to use the same byte pair encoding tokeniser as the original OLMo with a vocabulary of roughly 100k tokens, but the team retrained the embedding and output projection matrices from scratch rather than warm starting from OLMo 1.
Context length was held at 4096 tokens across all three variants for the initial release, a deliberately conservative choice that reflects Ai2's focus on reproducibility and stable evaluation rather than chasing long context benchmarks. Subsequent OLMo 2 derivatives released in 2025 by third parties extended this to 32k or 65k tokens using YaRN or similar position interpolation techniques.
OLMo 2 introduced a two stage pretraining curriculum that became one of the most discussed contributions of the project. The first stage uses a broad web heavy mixture called OLMo Mix 1124, which comprises roughly 3.9 trillion tokens drawn from the DataComp Language Model (DCLM) corpus, Dolma, the StarCoder code corpus, and Proof Pile II for mathematics. This stage consumes more than 90 percent of the total training budget and is run for slightly more than a single epoch for the 7B, 1.2 epochs for the 13B, and around 1.5 epochs for the 32B.
The second stage applies an annealing curriculum on Dolmino Mix 1124, a roughly 843 billion token corpus composed of about 50 percent high quality filtered web data combined with academic content, question answering datasets, instruction style examples, and mathematics. During this stage the learning rate is annealed from its first stage value down to zero. Ai2 trained multiple parallel runs on different slices of Dolmino Mix (the 7B used a single 50 billion token mix, while the 13B and 32B used three 100 billion token mixes plus one 300 billion token mix) and then merged the resulting checkpoints, a technique sometimes called model souping, to produce the final base weights. The Ai2 team argues this curriculum captures the benefit of staged training (sharpening the model on high quality data near the end) while keeping the bulk of the run on diverse web data where the optimisation dynamics are well understood.
The pretraining corpora were released alongside the model weights under the same Apache 2.0 licence, continuing the practice established by Dolma and the original OLMo. OLMo Mix 1124 supersedes the older Dolma 1.7 corpus and is often referred to colloquially as Dolma 2, though Ai2's official naming uses the dated 1124 suffix. The mix changes the balance of sources relative to Dolma 1.7, with a larger share of high quality filtered web data and a smaller share of patents, peS2o academic papers, and Project Gutenberg books. Ai2 published per source token counts, deduplication scripts, and the exact filtering pipelines, including the Quality Classifier model used to rank web pages.
The 7B and 13B models were trained on a mix of Ai2 internal clusters and partner clusters. The 32B was trained on the Augusta cluster, a 160 node deployment of 8 GPU H100 nodes on Google Cloud Engine connected by GPUDirect TCPXO. Ai2 reported a sustained training throughput of more than 1800 tokens per second per GPU for the 32B run, corresponding to roughly 38 percent model FLOPs utilisation. The team has stated publicly that the full 32B run consumed approximately one third of the training compute used for the comparable Qwen 2.5 32B, a comparison that became a recurring talking point in coverage of the release.
The instruction tuned and aligned variants of OLMo 2 were produced by applying the Tulu 3 post training recipe, also developed and released by Ai2. Tulu 3 is a permissively licensed alignment pipeline built around three stages: supervised fine tuning on a curated instruction mixture, preference optimisation with DPO, and a final reinforcement learning stage that Ai2 calls Reinforcement Learning with Verifiable Rewards (RLVR). RLVR replaces the learned reward model used in conventional RLHF with deterministic checkers for tasks where correctness can be verified programmatically, such as mathematical problem solving against ground truth answers and instruction following against format constraints.
For the 7B and 13B Instruct models, RLVR was implemented using PPO. The 32B Instruct model switched to GRPO, which removes the value function critic used by PPO and replaces it with a baseline computed from a group of sampled completions. The 32B Instruct release also used a revised post training mixture called Tulu 3.1 with improvements to the preference data and the RLVR reward functions.
The Tulu 3 SFT mixture combines public datasets such as the FLAN collection, OpenAssistant conversations, Tulu's own persona based synthetic data, and competition mathematics datasets. Ai2 released variant SFT mixtures specifically constructed for OLMo 2 base models, distinct from the original Tulu 3 mixtures, because the strongest mixture turned out to depend on the base model's pretraining distribution.
The headline benchmark numbers for the base models, reported by Ai2 in the OLMo 2 technical report and its accompanying blog posts, place the 7B and 13B variants ahead of similarly sized contemporaries on several core academic suites.
| Benchmark | OLMo 2 7B base | OLMo 2 13B base | OLMo 1 7B | OLMo 1.7 7B |
|---|---|---|---|---|
| MMLU | 63.7 | 67.5 | 28.3 (original) | 54.0 |
| AGIEval | 50.4 | 54.2 | not reported | not reported |
| GSM8K | not officially listed for base | 75.1 (base) | not reported | not reported |
For the Instruct variants, Ai2's published comparison tables include side by side scores against Llama 3.1, Qwen 2.5, Gemma 2, and Ai2's own Tulu 3 8B baseline. The OLMo 2 7B Instruct numbers come from the model card on Hugging Face.
| Model | MMLU | GSM8K | MATH | IFEval | BBH | AlpacaEval 2 LC | Average |
|---|---|---|---|---|---|---|---|
| OLMo 2 7B Instruct | 61.3 | 85.1 | 32.5 | 72.3 | 46.6 | 29.1 | 54.8 |
| Tulu 3 8B | 68.2 | 87.6 | 43.7 | 82.4 | 66.0 | 34.0 | 60.4 |
| Llama 3.1 8B Instruct | 71.3 | 83.4 | 42.5 | 80.6 | 69.7 | 25.8 | 58.9 |
| Qwen 2.5 7B Instruct | 76.6 | 83.8 | 69.9 | 74.7 | 25.3 | 29.7 | 57.1 |
| Gemma 2 9B Instruct | 69.1 | 79.7 | 29.8 | 69.9 | 2.5 | 43.7 | 51.9 |
The 32B Instruct model targets a different weight class and is compared against larger or stronger reference models, including OpenAI's GPT-3.5 Turbo and GPT-4o mini.
| Model | MMLU | GSM8K | MATH | IFEval | AlpacaEval 2 LC | Average |
|---|---|---|---|---|---|---|
| OLMo 2 32B Instruct | 77.3 | 87.6 | 49.7 | 85.6 | 42.8 | 68.8 |
| GPT-3.5 Turbo 0125 | 70.2 | 74.3 | 41.2 | 66.9 | 38.7 | 59.6 |
| GPT-4o mini 2024-07-18 | 82.2 | 83.0 | 67.9 | 83.5 | 49.7 | 65.7 |
| Qwen 2.5 32B | 84.7 | 87.5 | 77.9 | 82.4 | 39.1 | 66.5 |
| Llama 3.1 70B | 85.2 | 94.5 | 56.2 | 88.0 | 32.9 | 70.0 |
| Llama 3.3 70B | 85.9 | 93.6 | 71.8 | 90.8 | 36.5 | 73.0 |
These numbers should be read with the usual caveats about contamination, prompting format, and the choice of evaluation harness. Ai2 evaluates on its own Open Language Model Evaluation System (OLMES), a 20 task harness specifically designed for the OLMo development cycle. OLMES distinguishes between a smaller pool of development tasks tracked during training and a larger pool of held out tasks used only for final reporting, in an attempt to limit overfitting to the eval suite.
The term "fully open" is used by Ai2 in a specific and narrow sense. A fully open release in the OLMo 2 vocabulary includes the model weights, the complete training data, the training code, the training recipe (hyperparameters, mixture weights, schedule), training logs, and intermediate checkpoints. The intention is that an outside party with sufficient compute could reproduce the entire run and obtain a numerically similar model, or substitute in different data and rerun the recipe to ablate a specific choice. This standard sits noticeably above the "open weights" tier occupied by Llama, Qwen, Mistral and Gemma, all of which release weights but withhold the training data and the precise recipe.
For OLMo 2 the artefact list is unusually complete. The team published the full set of intermediate checkpoints saved every 1000 training steps, the merged Dolmino Mix files used in the annealing stage, the per source filtering scripts, the OLMES evaluation code, and the reward models used for preference tuning. Even relatively minor artefacts such as the configuration files for the merging step were released.
This approach has trade offs. The data and checkpoint releases consume tens of terabytes of storage on Hugging Face, and Ai2 has occasionally had to ask users to clone specific subfolders rather than the entire repository. The training data also exposes Ai2 to copyright and provenance questions that the open weights model providers can sidestep by keeping their corpora private, although the team has been explicit that nothing in OLMo Mix 1124 was scraped from sources that prohibit such use in their robots.txt.
All OLMo 2 weights, datasets, and source code are released under the Apache 2.0 licence. This is a permissive licence that allows commercial use, modification, and redistribution provided the licence notice is preserved. It places no restrictions on the use case or downstream user, unlike Meta's Llama Community Licence which imposes acceptable use policies and a 700 million monthly active user threshold above which the licensee must request a separate commercial agreement.
The Apache 2.0 choice was deliberate and is part of Ai2's broader policy argument for open source AI. In congressional testimony and in policy briefs, Ai2 has used OLMo and OLMo 2 as concrete examples of how high quality language models can be developed and released without the per user licensing terms common among other large labs. The 32B release in particular was framed as proof that fully open models could match closed system performance at academic benchmark scales.
The OLMo 2 family slots into a 2024 to 2025 open model landscape where the main competition came from Meta, Alibaba, Google DeepMind, and the smaller Mistral and Cohere labs. At roughly equivalent scale the closest peers are listed below.
| Model | Parameters | Released | Licence | Training data published | Training code published |
|---|---|---|---|---|---|
| OLMo 2 7B and 13B | 7B, 13B | Nov 2024 | Apache 2.0 | Yes (OLMo Mix 1124) | Yes |
| OLMo 2 32B | 32B | Mar 2025 | Apache 2.0 | Yes (OLMo Mix 1124) | Yes |
| Llama 3.1 8B and 70B | 8B, 70B | Jul 2024 | Llama 3.1 Community | No | No |
| Qwen 2.5 7B, 14B, 32B | 7B to 72B | Sep 2024 | Apache 2.0 (most) | No | No |
| Gemma 2 9B and 27B | 2B to 27B | Jun 2024 | Gemma Terms of Use | No | No |
| Mistral Small 3 | 24B | Jan 2025 | Apache 2.0 | No | No |
At the 7B parameter level the OLMo 2 7B is broadly competitive with Llama 3.1 8B on general knowledge benchmarks and ahead on GSM8K, but trails Qwen 2.5 7B on mathematics and MMLU. At 13B the OLMo 2 13B is the strongest of the three reference 13B class models on the OLMES suite, though Qwen 2.5 14B catches up on mathematics. The 32B Instruct model is the only fully open model in its weight class as of mid 2025; both Llama 3 and Qwen 2.5 use weights only licences, while Gemma never released a model above 27B in this generation.
In terms of pure benchmark headroom, OLMo 2 32B Instruct lands above the GPT-3.5 Turbo and GPT-4o mini line on aggregate scores but below Llama 3.3 70B and the 2025 frontier reasoning models. The Ai2 framing is that this is the expected position for the first fully open run at the 32B scale, and that the gap should close with later iterations such as OLMo 3.
Reception in the AI research community was generally positive, with particular attention given to the two stage curriculum, the open release of intermediate checkpoints, and the QK normalisation result. The 32B release in March 2025 drew the most coverage because it crossed a symbolic threshold by beating OpenAI's GPT-3.5 Turbo and GPT-4o mini on Ai2's chosen aggregate, while using roughly one third the training compute of Qwen 2.5 32B.
The technical report "2 OLMo 2 Furious" became one of the more widely cited open language model papers of early 2025. Cameron Wolfe's open language model survey, the Interconnects AI policy newsletter run by former Ai2 staffer Nathan Lambert, and the Marginalia and HuggingFace blogs all gave the release favourable writeups. A common observation was that the OLMo 2 13B was the first fully open model that could plausibly be used as a research baseline by labs that had previously relied on Llama 2 or Llama 3 for ablations, because it ships with enough surrounding artefacts to make controlled experiments tractable.
Criticisms were targeted rather than sweeping. The 4096 token context length was widely seen as too short for late 2024, when most other open models had moved to at least 32k tokens. Coding performance lagged the Qwen 2.5 Coder family, and the 7B Instruct trailed Llama 3.1 8B Instruct on MMLU and IFEval. The 32B Instruct was below Qwen 2.5 32B on the harder mathematics benchmarks despite leading on GSM8K. Ai2 acknowledged most of these gaps and pointed to subsequent releases, including the OLMoE mixture of experts model and the later OLMo 3 reasoning family, as the venues where they would be addressed.
OLMo 2 also became a focal point in the policy discussion around open source AI through 2025. Ai2 cited it in submissions to the United States National Telecommunications and Information Administration and the United Kingdom AI Safety Institute as evidence that fully open releases at the 30 billion parameter scale were feasible without producing models that posed obvious uplift to misuse. Whether that framing is correct is a separate debate, but the existence of the OLMo 2 release made it possible to have the debate on the basis of an actually existing artefact rather than a hypothetical one.