OLMo
Last reviewed
May 2, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,243 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,243 words
Add missing citations, update stale details, or suggest a clearer explanation.
OLMo (Open Language Model) is a family of fully open large language models developed by the Allen Institute for AI (Ai2), first released in February 2024. Unlike most so called "open" language models that only release weights, OLMo ships with the complete pretraining data, training code, intermediate checkpoints, training logs, and evaluation suites under a permissive Apache 2.0 license. The project's goal is to make the science of language modeling reproducible, with releases scaling from 1B parameters in early 2024 up to a 32B model that, as of March 2025, was the first fully open model to outperform GPT-3.5 Turbo and GPT-4o mini on a suite of academic benchmarks. The third generation, OLMo 3, was announced on November 20, 2025 and added 7B and 32B reasoning, instruct, and base variants trained on the Dolma 3 corpus.
| Field | Value |
|---|---|
| Developer | Allen Institute for AI (Ai2) |
| Initial release | February 1, 2024 |
| Latest version | OLMo 3 (November 20, 2025); OLMo 3.1 (December 12, 2025) |
| Parameter sizes | 1B, 7B, 13B, 32B (dense); 1B active / 7B total (MoE) |
| Type | Decoder-only transformer language model |
| License | Apache 2.0 |
| Training data | Dolma (v1, v1.5, v1.7), OLMo-Mix-1124, Dolmino-Mix-1124, Dolma 3 |
| Tokenizer | Modified BPE based on GPT-NeoX-20B |
| Source | github.com/allenai/OLMo |
| Weights | huggingface.co/allenai |
The Allen Institute for AI announced OLMo on February 1, 2024 with a paper titled "OLMo: Accelerating the Science of Language Models," led by Dirk Groeneveld and a team of 43 authors. The first release included a 1B model and four 7B variants trained on at least 2 trillion tokens of the Dolma corpus. The release was unusual because it included not only model weights but also the complete training data, training and evaluation code, intermediate checkpoints from every 1,000 training steps, and per step training logs. This made OLMo one of the few language models that genuinely satisfied the Open Source Initiative's definition of openness when applied to AI systems.
On April 17, 2024, Ai2 released OLMo 1.7-7B (later renamed OLMo 7B 0424). The update extended the context window from 2,048 to 4,096 tokens, switched to the new Dolma 1.7 dataset (2.3 trillion tokens), and used a two stage training curriculum that decayed the learning rate to zero over a final 50 billion token "annealing" phase on a curated subset. The result was a 24 point jump in MMLU compared to the original OLMo 7B, putting the new model ahead of Llama 2 7B on MMLU and ahead of Llama 2 13B on GSM8K.
A further refresh followed on July 31, 2024 as OLMo 7B 0724, scoring 52 on MMLU and improving GSM8K and code benchmarks through additional data mixing and post training work. On September 4, 2024, Ai2 and Contextual AI released OLMoE (1B active / 7B total), the first sparse mixture of experts entry in the family.
The second generation, OLMo 2, launched on November 26, 2024 with 7B and 13B base and instruct models trained on up to 5 trillion tokens. The 32B variant arrived on March 13, 2025 and the 1B model on May 1, 2025. OLMo 3 followed on November 20, 2025 with 7B and 32B Base, Think, Instruct, and RL Zero variants, plus a 65K token context window, sixteen times the context length of OLMo 2. An incremental OLMo 3.1 update appeared on December 12, 2025 with extended reinforcement learning training for the 32B Think and Instruct models.
The distinguishing feature of OLMo is what the team calls "truly open" rather than merely "open weight." Most widely cited "open source" models, including Llama 2 and 3 from Meta and the Mistral base models, release weights under custom licenses without disclosing the training data, the data filtering code, or the training infrastructure. OLMo releases all of those, plus the optimizer state at intermediate checkpoints and the Weights and Biases training logs that record loss curves, gradient norms, and hyperparameter settings step by step.
This matters for two reasons. First, it lets outside researchers reproduce the work or run scientific experiments that need the full pipeline, for example studying how data composition affects emergent capabilities or auditing the training corpus for memorized personally identifiable information. Second, it puts OLMo on the right side of the Open Source Initiative's draft Open Source AI Definition, which requires release of "data information" sufficient for a skilled person to recreate a substantially equivalent system. Most models marketed as open fail that test.
The Apache 2.0 license applied to all OLMo artifacts is permissive enough to allow commercial use, redistribution, and derivative works, with no field of use restrictions of the kind that appear in the Llama Community License.
| Model | Release | Params | Training tokens | Context | License | Notes |
|---|---|---|---|---|---|---|
| OLMo 1B | Feb 2024 | 1B | 2T | 2,048 | Apache 2.0 | Initial release |
| OLMo 7B | Feb 2024 | 7B | 2.46T | 2,048 | Apache 2.0 | Four variants on different hardware |
| OLMo 7B Twin 2T | Feb 2024 | 7B | 2T | 2,048 | Apache 2.0 | Reproducibility experiment |
| OLMo 7B 0424 (1.7) | Apr 2024 | 7B | 2.05T (Dolma 1.7) | 4,096 | Apache 2.0 | Two stage curriculum, 24 point MMLU gain |
| OLMo 7B 0724 | Jul 2024 | 7B | ~2.75T | 4,096 | Apache 2.0 | MMLU 52, improved fine tunes |
| OLMoE 1B-7B-0924 | Sep 2024 | 1B active / 7B total | 5.1T | 4,096 | Apache 2.0 | Mixture of experts, 64 experts, 8 active |
| OLMo 2 7B | Nov 2024 | 7B | ~4T | 4,096 | Apache 2.0 | Two stage training on OLMo-Mix and Dolmino |
| OLMo 2 13B | Nov 2024 | 13B | ~5T | 4,096 | Apache 2.0 | Outperforms Qwen 2.5 14B Instruct |
| OLMo 2 32B | Mar 2025 | 32B | 6T | 4,096 | Apache 2.0 | First fully open model to beat GPT-3.5 and GPT-4o mini |
| OLMo 2 1B | May 2025 | 1B | 4T | 4,096 | Apache 2.0 | Smallest OLMo 2; beats Gemma 3 1B and Llama 3.2 1B |
| OLMo 3 7B / 32B | Nov 2025 | 7B, 32B | ~5.9T (Dolma 3 Mix) | 65,536 | Apache 2.0 | Base, Think, Instruct, RL Zero variants |
| OLMo 3.1 32B | Dec 2025 | 32B | (extended RL) | 65,536 | Apache 2.0 | Reasoning gains on AIME, IFBench |
Each release also ships matching SFT, DPO, RM, and Instruct (RLVR) checkpoints under the same license.
OLMo is a decoder-only transformer in the same family as GPT-2 and Llama, with several specific design choices documented in the original paper. The 7B base model has 32 layers, a hidden size of 4,096, and 32 attention heads. The 1B uses 16 layers, a hidden size of 2,048, and 16 heads. Across sizes the architecture removes all bias terms (a choice borrowed from PaLM that is said to improve training stability), uses non parametric layer normalization, applies the SwiGLU activation in the feed forward network, and replaces absolute positional embeddings with RoPE (rotary positional embeddings). The vocabulary is 50,280 tokens, drawn from a modified BPE tokenizer based on GPT-NeoX-20B that masks personally identifiable information.
OLMo 2 reworked several stability features after the original models showed loss spikes during long training runs. The team switched non parametric layer norm for RMSNorm, reordered the layer norm position relative to the residual stream, added QK-Norm to the attention computation, and added a Z-loss regularization term on the output logits. Initialization scales for activations and gradients were retuned. None of these are headline architectural innovations on their own, but the bundle made it possible to train the 13B and 32B models without instability, which earlier versions did not survive at that scale.
OLMo 3 keeps the dense decoder only design but adds long context support up to 65,536 tokens through extended position embedding scaling and continued pretraining on long documents. The Think variants add chain of thought style reasoning traces during post training rather than at the architecture level.
Dolma is Ai2's open pretraining corpus and the source of every OLMo model's pretraining data. The first public Dolma release (August 2023) contained roughly 3 trillion tokens drawn from seven sources, curated down from about 200 TB of raw text to roughly 11 TB after filtering. The dominant source is web text from Common Crawl, with code from The Stack (permissively licensed GitHub repositories), academic papers from peS2o (a Semantic Scholar derivative), books from Project Gutenberg, social media from Reddit, and encyclopedic content from English Wikipedia and Wikibooks.
| Source | Documents | Tokens |
|---|---|---|
| Common Crawl (web) | 3.4B | 2,006B |
| The Stack (code) | 210M | 411B |
| C4 (web) | 365M | 174B |
| 377M | 80B | |
| peS2o (papers) | 38.8M | 57B |
| Project Gutenberg | 0.05M | 5.3B |
| Wikipedia / Wikibooks | 6.2M | 3.7B |
Dolma 1.7, used for OLMo 7B 0424, expanded to 2.3 trillion tokens with additional sources including Refined Web, OpenWebMath, Stack Exchange, and StarCoder. The OLMo 2 7B and 13B models trained on OLMo-Mix-1124 (3.9T tokens) for the bulk of training, then on Dolmino-Mix-1124 (843B tokens) for a final stage that mixes 50 percent high quality web data with academic content, math, instruction style data, and other curated sources. OLMo 3 trains on Dolma 3 Mix (about 5.9T tokens) drawn from the larger 9.3 trillion token Dolma 3 corpus, with heavier representation of code, math, and scientific PDFs.
The training pipeline that produced these mixes is itself open source as the dolma toolkit on GitHub, including the language identification, quality classifiers, deduplication code, and content filters. This is unusual; most labs treat their data filtering pipeline as a trade secret.
The original OLMo 7B was trained in two parallel runs on different hardware to test reproducibility. One run used up to 256 nodes of the LUMI supercomputer in Finland, each node containing 4 AMD MI250X GPUs (128 GB memory each). The other used 27 nodes of an NVIDIA A100 (40 GB) cluster provided by MosaicML. Both runs used PyTorch's Fully Sharded Data Parallel framework with the ZeRO optimizer strategy and bfloat16 mixed precision. Batch sizes were around 4 million tokens. The fact that the two runs produced models with similar quality is one of the demonstrations the OLMo paper offers for the reproducibility claim.
Later releases moved to NVIDIA H100 hardware. OLMoE was trained on 256 H100 GPUs. OLMo 2 32B was trained on Augusta, a 160 node Google Cloud AI Hypercomputer with 8 H100 GPUs per node connected via GPUDirect-TCPXO, achieving about 1,800 tokens per second per GPU and roughly 38 percent model FLOPs utilization. The team reports that OLMo 2 32B reached comparable performance to Qwen 2.5 32B at roughly one third of the training compute cost.
Post training (instruction tuning, preference optimization, and reinforcement learning) for OLMo is handled by the Tulu family, also from Ai2. The current standard is Tulu 3, released in November 2024 alongside the OLMo 2 base models. Tulu 3 introduced a recipe with three stages: supervised fine tuning on a curated 939,344 prompt mix (57 percent public, 43 percent synthetic with persona conditioning), Direct Preference Optimization (DPO) on a preference pair dataset, and a novel Reinforcement Learning with Verifiable Rewards (RLVR) stage for tasks like math, code, and instruction following where correctness can be checked automatically.
The RLVR stage specifically improved GSM8K, MATH, and IFEval scores for the 7B and 13B Instruct models. A 405B Tulu 3 model was released in January 2025 as a demonstration that the recipe scales. For OLMo 3 the post training corpus was renamed Dolci and split into separate mixes for SFT, DPO, and RLVR. Tulu 3 also serves as a standalone post training recipe that can be applied to other base models, and the codebase (open-instruct) is widely used by researchers who do not want to start their post training pipeline from scratch.
Reported benchmark numbers from the OLMo 2 7B model card give a sense of how the family performs on standard academic evaluations. These are zero or few shot scores on the base model:
| Benchmark | OLMo 2 7B | OLMo 2 13B | OLMo 2 32B (Instruct) |
|---|---|---|---|
| MMLU | 63.7 | 67.5 | 78.0 (approximate) |
| GSM8K | 67.5 | 75.1 | 78.4 |
| ARC Challenge | 79.8 | 83.5 | (not reported) |
| HellaSwag | 83.8 | 86.4 | (not reported) |
| DROP | 60.8 | 63.4 | (not reported) |
| Natural Questions | 36.9 | 46.5 | (not reported) |
| TriviaQA | 78.0 | 81.9 | (not reported) |
| AGIEval | 50.4 | 54.2 | (not reported) |
| MMLU-Pro | 31.0 | 35.1 | 53.9 |
For OLMo 3 Think 32B, Ai2 reports MATH 96.1, AIME 2024 76.8, BigBenchHard 89.8, HumanEval+ 91.4, and IFEval 89.0, putting the model in the same ballpark as much larger frontier reasoning models.
The historically interesting datapoint is the original OLMo 7B from February 2024, which scored only 28.3 on MMLU. The 24 point jump to roughly 52 in OLMo 7B 0424 came almost entirely from data and training improvements rather than architectural changes, and is often cited as evidence that data quality dominates capability gains at the 7B scale.
| Model | Params | License | Open data | Open code | Open checkpoints | MMLU |
|---|---|---|---|---|---|---|
| OLMo 2 7B | 7B | Apache 2.0 | Yes | Yes | Yes (every 1k steps) | 63.7 |
| OLMo 2 13B | 13B | Apache 2.0 | Yes | Yes | Yes | 67.5 |
| OLMo 2 32B | 32B | Apache 2.0 | Yes | Yes | Yes | ~78 |
| Llama 3.1 8B | 8B | Llama 3.1 Community License | No | Partial | No | 61.8 |
| Mistral 7B | 7B | Apache 2.0 | No | No | No | 60.1 |
| Mistral Nemo 12B | 12B | Apache 2.0 | No | No | No | 66.9 |
| Pythia 6.9B | 6.9B | Apache 2.0 | Yes (The Pile) | Yes | Yes | 25.7 |
| Qwen 2.5 7B | 7B | Apache 2.0 / Tongyi Qianwen | No | Partial | No | 74.2 |
| Falcon 7B | 7B | Apache 2.0 | Partial | Partial | No | 27.8 |
Of the well known open weight families, only Pythia from EleutherAI matches OLMo on the openness axis. Pythia was released in 2023 specifically as a research artifact and tops out at 12B parameters, so OLMo 2 32B is the largest fully open language model. Llama and Mistral release weights and Apache 2.0 licensed inference code but treat their training data and training code as proprietary, which is why they sit in a different column of the comparison.
OLMo has been treated by parts of the research community as the open language model the field actually needs, even when its raw benchmarks lag behind closed or weight only competitors. The original release in February 2024 was covered as a milestone by GeekWire and the academic press, partly because it was published just as Meta was preparing the Llama 3 launch and partly because Ai2 explicitly framed the release in terms of openness rather than benchmark records.
The Open Source Initiative's draft Open Source AI Definition, finalized in late 2024, made the contrast between OLMo style "data information available" releases and Llama style "weights only" releases central to the policy debate over what "open AI" should mean. OLMo, Pythia, and a handful of other releases were the main worked examples of compliance.
For practitioners, OLMo's value is not always the model itself but the artifacts around it. The Tulu 3 post training recipe and the open-instruct codebase have been adopted by other groups doing post training on third party base models. The Dolma toolkit is used as a reference implementation for data filtering pipelines. The training logs and intermediate checkpoints have appeared in dozens of follow up papers studying loss curves, emergent capabilities, and data efficiency.
OLMo 2 32B, released in March 2025, was widely noted as the first fully open model to outperform GPT-3.5 Turbo and GPT-4o mini on a multi skill academic benchmark suite. The release got coverage in MarkTechPost, AIwire, and the Ai2 blog as a proof point that an open lab on a comparatively modest budget could reach the capability tier that closed labs occupied two years earlier.
Funding for the project comes from Ai2, the nonprofit research institute founded in 2014 by Microsoft cofounder Paul Allen and now headquartered in Seattle. The institute's broader mandate is open research in AI for the common good, and OLMo is the language model leg of a portfolio that also includes the Molmo multimodal models, the Aristo science reasoning system, and the Semantic Scholar academic search engine.