OLMo 2

AI Models Large Language Models Open Source AI Research Organizations

20 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 3,961 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OLMo 2 is the second generation of fully open large language models released by the Allen Institute for AI (Ai2), spanning 7B, 13B, and 32B parameter sizes. Unlike open weight models such as Llama or Qwen, OLMo 2 is "fully open": Ai2 publishes not just the weights but the complete training data, training code, recipe (hyperparameters, mixture weights, schedule), training logs, and hundreds of intermediate checkpoints, so the entire model can be inspected and reproduced ^[1]^[2]. The 7B and 13B variants were announced on November 26, 2024, and a 32B flagship followed on March 13, 2025; Ai2 describes the 32B as the first fully open model to outperform OpenAI's GPT-3.5 Turbo and GPT-4o mini on a broad academic benchmark suite ^[2]^[3].

The series (sometimes stylised 2 OLMo 2 Furious after the technical report of the same name) ships every artefact needed to reproduce the work under the permissive Apache 2.0 licence ^[1]^[4]. Ai2 defines its openness tier precisely as "models released with weights, training data, code, and evaluation in full, and thus can be fully inspected and reproduced" ^[2]. The 7B and 13B base models were trained on up to 5 trillion tokens and the 32B on up to 6 trillion tokens, with post-training built on Ai2's own Tulu 3 recipe ^[2]^[3].

Ai2 pitched OLMo 2 as a response to the gradual closing of the so called open weight ecosystem. Llama, Qwen, Mistral and Gemma had become the default reference models for academic work by 2024, but their training data and recipes remained proprietary, which made it impossible to ablate dataset decisions or audit the provenance of training material. OLMo 2 attempts to close that gap at competitive performance levels, with the 7B and 13B variants matching or beating Meta's Llama 3.1 8B and Alibaba's Qwen 2.5 7B on several evaluation suites despite using fewer training FLOPs, and the 32B model trailing only frontier scale closed models on aggregate academic benchmarks ^[1]^[2].

ELI5: what is OLMo 2 in simple terms?

Most AI labs hand you a finished cake (the model weights) but keep the recipe secret. Ai2 hands you the cake, the recipe, every ingredient, the oven settings, and photos of the cake at every stage of baking. OLMo 2 is that fully open cake: anyone with enough computers can bake an identical one or swap an ingredient to see what changes. The biggest version, OLMo 2 32B, was the first such open cake good enough to beat some of OpenAI's older paid models on school-test-style questions ^[3].

What is OLMo 2 and how did it come about?

The original OLMo, released in February 2024, established Ai2's commitment to releasing models with their full training recipe. That first generation included 1B and 7B base models trained on the Dolma corpus, an open 3 trillion token web mixture also published by Ai2. Even at launch the 7B was competitive with Meta's Llama 2 7B on most academic benchmarks, but it lagged the newer Llama 3, Mistral 7B and Qwen 1.5 7B that arrived in the following months. The original models also showed instability during long training runs, with intermittent loss spikes that forced engineers to manually intervene and restart from earlier checkpoints. A mid year refresh called OLMo 1.7 7B added a 24 point improvement on the MMLU benchmark by upgrading to a revised dataset (Dolma 1.7) and a longer 2 trillion token training run, but the architecture, optimiser and training hyperparameters were largely unchanged from the original release.

OLMo 2 was conceived as a clean redesign. The Ai2 OLMo team had spent most of 2024 cataloguing the sources of instability and inefficiency in the first generation, and used the new release to apply a coordinated set of fixes drawn from the wider open weights literature. The headline result was a family of base models that the team claimed sat on the Pareto frontier of performance versus training compute, beating contemporaneous open weight models at equivalent parameter counts while using one third to one half the training FLOPs ^[1].

Which models are in the OLMo 2 family?

Variant	Parameters	Release date	Training tokens	HF repository
OLMo 2 7B	7 billion	November 26, 2024	up to 5 trillion	`allenai/OLMo-2-1124-7B`
OLMo 2 13B	13 billion	November 26, 2024	up to 5 trillion	`allenai/OLMo-2-1124-13B`
OLMo 2 32B	32 billion	March 13, 2025	up to 6 trillion	`allenai/OLMo-2-0325-32B`

Each base model is accompanied by an SFT checkpoint (supervised fine tuning only), a DPO checkpoint, and a final Instruct release that adds reinforcement learning. The 7B and 13B Instruct models use PPO with verifiable rewards while the 32B Instruct model uses Group Relative Policy Optimisation (GRPO), the algorithm popularised by DeepSeek earlier in the year. Ai2 also released the reward models used in training, for example allenai/OLMo-2-1124-7B-RM, so that researchers can replicate or vary the preference tuning stage independently ^[7]. The naming convention encodes the release date in the suffix, so 1124 refers to November 2024 and 0325 to March 2025.

What architecture does OLMo 2 use?

The OLMo 2 base architecture is a decoder only Transformer in the now standard pre normalisation configuration, but the team made several focused changes intended to improve training stability and per token efficiency relative to the first OLMo generation ^[1].

Spec	OLMo 2 7B	OLMo 2 13B	OLMo 2 32B
Layers	32	40	64
Hidden size	4096	5120	5120
Attention heads	32	40	40
Context length	4096 tokens	4096 tokens	4096 tokens
Training FLOPs	not officially disclosed	4.6 x 10^23	1.3 x 10^24

The most important architectural change was the switch from the non parametric layer normalisation used in OLMo 1 to RMSNorm, a simpler and slightly cheaper alternative that omits the mean centering step. Ai2 also added QK normalisation, which applies RMSNorm to the query and key projections inside each attention head before computing attention scores. This combination was reported in the OLMo 2 technical report to substantially reduce the frequency of loss spikes during long training runs, which had been a recurring operational headache during OLMo 1 training ^[1].

The positional encoding scheme was upgraded to rotary positional embeddings, replacing the absolute learned positional embeddings used in OLMo 1. Z loss regularisation was added to discourage the final softmax from saturating, a trick borrowed from the PaLM family. The team also revisited weight initialisation, choosing a scheme that preserves activation and gradient magnitudes across the depth of the network. Tokenisation continued to use the same byte pair encoding tokeniser as the original OLMo with a vocabulary of roughly 100k tokens, but the team retrained the embedding and output projection matrices from scratch rather than warm starting from OLMo 1.

Context length was held at 4096 tokens across all three variants for the initial release, a deliberately conservative choice that reflects Ai2's focus on reproducibility and stable evaluation rather than chasing long context benchmarks. Subsequent OLMo 2 derivatives released in 2025 by third parties extended this to 32k or 65k tokens using YaRN or similar position interpolation techniques.

How was OLMo 2 trained?

Two stage curriculum

OLMo 2 introduced a two stage pretraining curriculum that became one of the most discussed contributions of the project. The first stage uses a broad web heavy mixture called OLMo Mix 1124, which comprises roughly 3.9 trillion tokens drawn from the DataComp Language Model (DCLM) corpus, Dolma, the StarCoder code corpus, and Proof Pile II for mathematics ^[1]^[9]. This stage consumes more than 90 percent of the total training budget and is run for slightly more than a single epoch for the 7B, 1.2 epochs for the 13B, and around 1.5 epochs for the 32B.

The second stage applies an annealing curriculum on Dolmino Mix 1124, a roughly 843 billion token corpus composed of about 50 percent high quality filtered web data combined with academic content, question answering datasets, instruction style examples, and mathematics ^[1]^[10]. During this stage the learning rate is annealed from its first stage value down to zero. Ai2 trained multiple parallel runs on different slices of Dolmino Mix (the 7B used a single 50 billion token mix, while the 13B and 32B used three 100 billion token mixes plus one 300 billion token mix) and then merged the resulting checkpoints, a technique sometimes called model souping, to produce the final base weights. The Ai2 team argues this curriculum captures the benefit of staged training (sharpening the model on high quality data near the end) while keeping the bulk of the run on diverse web data where the optimisation dynamics are well understood.

Dolma 2 and the open data story

The pretraining corpora were released alongside the model weights under the same Apache 2.0 licence, continuing the practice established by Dolma and the original OLMo ^[9]^[10]. OLMo Mix 1124 supersedes the older Dolma 1.7 corpus and is often referred to colloquially as Dolma 2, though Ai2's official naming uses the dated 1124 suffix. The mix changes the balance of sources relative to Dolma 1.7, with a larger share of high quality filtered web data and a smaller share of patents, peS2o academic papers, and Project Gutenberg books. Ai2 published per source token counts, deduplication scripts, and the exact filtering pipelines, including the Quality Classifier model used to rank web pages.

Training infrastructure

The 7B and 13B models were trained on a mix of Ai2 internal clusters and partner clusters. The 32B was trained on the Augusta cluster, a 160 node deployment of 8 GPU H100 nodes on Google Cloud Engine connected by GPUDirect TCPXO ^[1]. Ai2 reported a sustained training throughput of more than 1800 tokens per second per GPU for the 32B run, corresponding to roughly 38 percent model FLOPs utilisation. The team has stated publicly that "OLMo 2 32B takes only one third of the cost of training Qwen 2.5 32B while reaching similar performance," a comparison that became a recurring talking point in coverage of the release ^[3].

How does Tulu 3 post training work?

The instruction tuned and aligned variants of OLMo 2 were produced by applying the Tulu 3 post training recipe, also developed and released by Ai2. Tulu 3 is a permissively licensed alignment pipeline built around three stages: supervised fine tuning on a curated instruction mixture, preference optimisation with DPO, and a final reinforcement learning stage that Ai2 calls Reinforcement Learning with Verifiable Rewards (RLVR) ^[11]. RLVR replaces the learned reward model used in conventional RLHF with deterministic checkers for tasks where correctness can be verified programmatically, such as mathematical problem solving against ground truth answers and instruction following against format constraints.

For the 7B and 13B Instruct models, RLVR was implemented using PPO. The 32B Instruct model switched to GRPO, which removes the value function critic used by PPO and replaces it with a baseline computed from a group of sampled completions. The 32B Instruct release also used a revised post training mixture called Tulu 3.1 with improvements to the preference data and the RLVR reward functions ^[3].

The Tulu 3 SFT mixture combines public datasets such as the FLAN collection, OpenAssistant conversations, Tulu's own persona based synthetic data, and competition mathematics datasets. Ai2 released variant SFT mixtures specifically constructed for OLMo 2 base models, distinct from the original Tulu 3 mixtures, because the strongest mixture turned out to depend on the base model's pretraining distribution.

How does OLMo 2 perform on benchmarks?

The headline benchmark numbers for the base models, reported by Ai2 in the OLMo 2 technical report and its accompanying blog posts, place the 7B and 13B variants ahead of similarly sized contemporaries on several core academic suites ^[1]^[2].

Benchmark	OLMo 2 7B base	OLMo 2 13B base	OLMo 1 7B	OLMo 1.7 7B
MMLU	63.7	67.5	28.3 (original)	54.0
AGIEval	50.4	54.2	not reported	not reported
GSM8K	not officially listed for base	75.1 (base)	not reported	not reported

For the Instruct variants, Ai2's published comparison tables include side by side scores against Llama 3.1, Qwen 2.5, Gemma 2, and Ai2's own Tulu 3 8B baseline. The OLMo 2 7B Instruct numbers come from the model card on Hugging Face ^[7].

Model	MMLU	GSM8K	MATH	IFEval	BBH	AlpacaEval 2 LC	Average
OLMo 2 7B Instruct	61.3	85.1	32.5	72.3	46.6	29.1	54.8
Tulu 3 8B	68.2	87.6	43.7	82.4	66.0	34.0	60.4
Llama 3.1 8B Instruct	71.3	83.4	42.5	80.6	69.7	25.8	58.9
Qwen 2.5 7B Instruct	76.6	83.8	69.9	74.7	25.3	29.7	57.1
Gemma 2 9B Instruct	69.1	79.7	29.8	69.9	2.5	43.7	51.9

The 32B Instruct model targets a different weight class and is compared against larger or stronger reference models, including OpenAI's GPT-3.5 Turbo and GPT-4o mini ^[3].

Model	MMLU	GSM8K	MATH	IFEval	AlpacaEval 2 LC	Average
OLMo 2 32B Instruct	77.3	87.6	49.7	85.6	42.8	68.8
GPT-3.5 Turbo 0125	70.2	74.3	41.2	66.9	38.7	59.6
GPT-4o mini 2024-07-18	82.2	83.0	67.9	83.5	49.7	65.7
Qwen 2.5 32B	84.7	87.5	77.9	82.4	39.1	66.5
Llama 3.1 70B	85.2	94.5	56.2	88.0	32.9	70.0
Llama 3.3 70B	85.9	93.6	71.8	90.8	36.5	73.0

According to Ai2, "OLMo 2 32B is the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini" on a suite of multi-skill academic benchmarks ^[3]. These numbers should be read with the usual caveats about contamination, prompting format, and the choice of evaluation harness. Ai2 evaluates on its own Open Language Model Evaluation System (OLMES), a 20 task harness specifically designed for the OLMo development cycle. OLMES distinguishes between a smaller pool of development tasks tracked during training and a larger pool of held out tasks used only for final reporting, in an attempt to limit overfitting to the eval suite.

How open is OLMo 2?

The term "fully open" is used by Ai2 in a specific and narrow sense. Ai2 reserves it for "models released with weights, training data, code, and evaluation in full, and thus can be fully inspected and reproduced" ^[2]. In the OLMo 2 vocabulary that includes the model weights, the complete training data, the training code, the training recipe (hyperparameters, mixture weights, schedule), training logs, and intermediate checkpoints. The intention is that an outside party with sufficient compute could reproduce the entire run and obtain a numerically similar model, or substitute in different data and rerun the recipe to ablate a specific choice. This standard sits noticeably above the "open weights" tier occupied by Llama, Qwen, Mistral and Gemma, all of which release weights but withhold the training data and the precise recipe.

For OLMo 2 the artefact list is unusually complete. The team published the full set of intermediate checkpoints saved every 1000 training steps, the merged Dolmino Mix files used in the annealing stage, the per source filtering scripts, the OLMES evaluation code, and the reward models used for preference tuning ^[1]^[9]^[10]. Even relatively minor artefacts such as the configuration files for the merging step were released.

This approach has trade offs. The data and checkpoint releases consume tens of terabytes of storage on Hugging Face, and Ai2 has occasionally had to ask users to clone specific subfolders rather than the entire repository. The training data also exposes Ai2 to copyright and provenance questions that the open weights model providers can sidestep by keeping their corpora private, although the team has been explicit that nothing in OLMo Mix 1124 was scraped from sources that prohibit such use in their robots.txt.

Is OLMo 2 open source, and what licence does it use?

All OLMo 2 weights, datasets, and source code are released under the Apache 2.0 licence ^[1]^[4]. This is a permissive licence that allows commercial use, modification, and redistribution provided the licence notice is preserved. It places no restrictions on the use case or downstream user, unlike Meta's Llama Community Licence which imposes acceptable use policies and a 700 million monthly active user threshold above which the licensee must request a separate commercial agreement.

The Apache 2.0 choice was deliberate and is part of Ai2's broader policy argument for open source AI. In congressional testimony and in policy briefs, Ai2 has used OLMo and OLMo 2 as concrete examples of how high quality language models can be developed and released without the per user licensing terms common among other large labs. The 32B release in particular was framed as proof that fully open models could match closed system performance at academic benchmark scales ^[3].

How does OLMo 2 compare to Llama, Qwen, and Gemma?

The OLMo 2 family slots into a 2024 to 2025 open model landscape where the main competition came from Meta, Alibaba, Google DeepMind, and the smaller Mistral and Cohere labs. At roughly equivalent scale the closest peers are listed below.

Model	Parameters	Released	Licence	Training data published	Training code published
OLMo 2 7B and 13B	7B, 13B	Nov 2024	Apache 2.0	Yes (OLMo Mix 1124)	Yes
OLMo 2 32B	32B	Mar 2025	Apache 2.0	Yes (OLMo Mix 1124)	Yes
Llama 3.1 8B and 70B	8B, 70B	Jul 2024	Llama 3.1 Community	No	No
Qwen 2.5 7B, 14B, 32B	7B to 72B	Sep 2024	Apache 2.0 (most)	No	No
Gemma 2 9B and 27B	2B to 27B	Jun 2024	Gemma Terms of Use	No	No
Mistral Small 3	24B	Jan 2025	Apache 2.0	No	No

At the 7B parameter level the OLMo 2 7B is broadly competitive with Llama 3.1 8B on general knowledge benchmarks and ahead on GSM8K, but trails Qwen 2.5 7B on mathematics and MMLU. At 13B the OLMo 2 13B is the strongest of the three reference 13B class models on the OLMES suite, though Qwen 2.5 14B catches up on mathematics. The 32B Instruct model is the only fully open model in its weight class as of mid 2025; both Llama 3 and Qwen 2.5 use weights only licences, while Gemma never released a model above 27B in this generation.

In terms of pure benchmark headroom, OLMo 2 32B Instruct lands above the GPT-3.5 Turbo and GPT-4o mini line on aggregate scores but below Llama 3.3 70B and the 2025 frontier reasoning models ^[3]. The Ai2 framing is that this is the expected position for the first fully open run at the 32B scale, and that the gap should close with later iterations such as OLMo 3.

How was OLMo 2 received?

Reception in the AI research community was generally positive, with particular attention given to the two stage curriculum, the open release of intermediate checkpoints, and the QK normalisation result. The 32B release in March 2025 drew the most coverage because it crossed a symbolic threshold by beating OpenAI's GPT-3.5 Turbo and GPT-4o mini on Ai2's chosen aggregate, while using roughly one third the training compute of Qwen 2.5 32B ^[3]^[14].

The technical report "2 OLMo 2 Furious" became one of the more widely cited open language model papers of early 2025. Cameron Wolfe's open language model survey, the Interconnects AI policy newsletter run by former Ai2 staffer Nathan Lambert, and the Marginalia and HuggingFace blogs all gave the release favourable writeups. A common observation was that the OLMo 2 13B was the first fully open model that could plausibly be used as a research baseline by labs that had previously relied on Llama 2 or Llama 3 for ablations, because it ships with enough surrounding artefacts to make controlled experiments tractable.

Criticisms were targeted rather than sweeping. The 4096 token context length was widely seen as too short for late 2024, when most other open models had moved to at least 32k tokens. Coding performance lagged the Qwen 2.5 Coder family, and the 7B Instruct trailed Llama 3.1 8B Instruct on MMLU and IFEval. The 32B Instruct was below Qwen 2.5 32B on the harder mathematics benchmarks despite leading on GSM8K. Ai2 acknowledged most of these gaps and pointed to subsequent releases, including the OLMoE mixture of experts model and the later OLMo 3 reasoning family, as the venues where they would be addressed.

OLMo 2 also became a focal point in the policy discussion around open source AI through 2025. Ai2 cited it in submissions to the United States National Telecommunications and Information Administration and the United Kingdom AI Safety Institute as evidence that fully open releases at the 30 billion parameter scale were feasible without producing models that posed obvious uplift to misuse. Whether that framing is correct is a separate debate, but the existence of the OLMo 2 release made it possible to have the debate on the basis of an actually existing artefact rather than a hypothetical one.

References

OLMo Team. "2 OLMo 2 Furious." arXiv preprint 2501.00656, January 2025. https://arxiv.org/abs/2501.00656 ↩
Allen Institute for AI. "OLMo 2: The best fully open language model to date." November 26, 2024. https://allenai.org/blog/olmo2 ↩
Allen Institute for AI. "OLMo 2 32B: First fully open model to outperform GPT 3.5 and GPT 4o mini." March 13, 2025. https://allenai.org/blog/olmo2-32B ↩
Allen Institute for AI. "allenai/OLMo-2-1124-7B." Hugging Face model card. https://huggingface.co/allenai/OLMo-2-1124-7B ↩
Allen Institute for AI. "allenai/OLMo-2-1124-13B." Hugging Face model card. https://huggingface.co/allenai/OLMo-2-1124-13B
Allen Institute for AI. "allenai/OLMo-2-0325-32B." Hugging Face model card. https://huggingface.co/allenai/OLMo-2-0325-32B
Allen Institute for AI. "allenai/OLMo-2-1124-7B-Instruct." Hugging Face model card. https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct ↩
Allen Institute for AI. "allenai/OLMo-2-0325-32B-Instruct." Hugging Face model card. https://huggingface.co/allenai/OLMo-2-0325-32B-Instruct
Allen Institute for AI. "OLMo-Mix-1124 dataset." Hugging Face. https://huggingface.co/datasets/allenai/olmo-mix-1124 ↩
Allen Institute for AI. "Dolmino-Mix-1124 dataset." Hugging Face. https://huggingface.co/datasets/allenai/dolmino-mix-1124 ↩
Lambert, Nathan et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training." arXiv preprint 2411.15124, November 2024. https://arxiv.org/abs/2411.15124 ↩
Allen Institute for AI. "OLMo release notes." https://allenai.org/olmo/release-notes
Allen Institute for AI GitHub. "allenai/OLMo." https://github.com/allenai/OLMo
MarkTechPost. "Allen Institute for AI (AI2) Releases OLMo 32B." March 14, 2025. https://www.marktechpost.com/2025/03/14/allen-institute-for-ai-ai2-releases-olmo-32b-a-fully-open-model-to-beat-gpt-3-5-and-gpt-4o-mini-on-a-suite-of-multi-skill-benchmarks/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Best Small Language Models OLMo 3 OLMoE Open-Weight LLM License Comparison Pleias Tülu 3

ELI5: what is OLMo 2 in simple terms?

What is OLMo 2 and how did it come about?

Which models are in the OLMo 2 family?

What architecture does OLMo 2 use?

How was OLMo 2 trained?

Two stage curriculum

Dolma 2 and the open data story

Training infrastructure

How does Tulu 3 post training work?

How does OLMo 2 perform on benchmarks?

How open is OLMo 2?

Is OLMo 2 open source, and what licence does it use?

How does OLMo 2 compare to Llama, Qwen, and Gemma?

How was OLMo 2 received?

See also

References

Improve this article

Related Articles

OLMo 3

OLMoE

GPT-J

Vicuna (language model)

Llama 3

OLMo

What links here

Related Articles

OLMo 3

OLMoE

GPT-J

Vicuna (language model)

Llama 3

OLMo

What links here