OLMoE
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,616 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 3,616 words
Add missing citations, update stale details, or suggest a clearer explanation.
OLMoE (Open Mixture-of-Experts) is a sparse Mixture of Experts large language model released by the Allen Institute for AI (Ai2) on September 3, 2024. The first release, OLMoE-1B-7B-0924, has 6.9 billion total parameters but activates only about 1.3 billion of them on any given token, by routing through 8 of 64 small experts in each transformer layer. Ai2 described it as the first fully open mixture of experts model, meaning the release includes not only the weights but also the pretraining data mixture, training and evaluation code, the Weights and Biases training logs, hundreds of intermediate checkpoints, and the post training data and scripts, all under the Apache 2.0 license. At launch the base model matched or beat all other open models in its active parameter weight class on standard academic benchmarks, and the Instruct variant tied or surpassed several much larger dense and sparse models including Llama 2 13B Chat and DeepSeekMoE 16B Chat.
The model arrived during a year in which sparse MoE designs had become the dominant approach for frontier language models, with Mixtral 8x7B from Mistral, DBRX from Databricks, Snowflake Arctic, Qwen 1.5 MoE, DeepSeek V2, and Grok 1 from xAI all using some variant of the architecture. None of those releases included the full training recipe or the underlying pretraining corpus, which made controlled scientific work on MoE design choices effectively impossible outside the labs that owned the models. OLMoE was framed as a deliberate counter to that pattern, an attempt to produce a competitive sparse model that could also serve as a research artifact for the wider community.
The Allen Institute for AI began the OLMo program in early 2024 with the goal of publishing fully reproducible large language models. The original OLMo release in February 2024 included a 1 billion and a 7 billion parameter dense decoder only transformer trained on the Dolma corpus, alongside training code, evaluation harnesses, and intermediate checkpoints. A mid year refresh called OLMo 1.7-7B (April 2024) and a further update called OLMo 7B 0724 (July 2024) tightened the architecture and added more training tokens, lifting the 7B model from 28 on MMLU at launch to 54.9 by July.
By mid 2024, sparse mixture of experts models had become the standard recipe for high capacity language models without proportional inference cost. Mixtral 8x7B in December 2023 was the first widely available open weight MoE, followed in 2024 by DBRX (132 billion total, 36 billion active), Snowflake Arctic (480 billion total, 17 billion active), Qwen 1.5 MoE A2.7B, DeepSeek V2 (236 billion total, 21 billion active), and Grok 1 (314 billion total, 86 billion active). In each case the released artifact was either the weights alone under a custom license or, in DeepSeek's case, weights and a high level model card but no training data, no exact reproduction recipe, and limited training infrastructure detail. Several research groups working on MoE routing and load balancing complained openly that this state of affairs made replication studies impossible without industrial scale compute.
OLMoE was conceived to fill that gap. The project was led by Niklas Muennighoff during his time at Ai2, in collaboration with Contextual AI and the wider OLMo team, including Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Nathan Lambert, and Hannaneh Hajishirzi. The technical report, posted to arXiv as 2409.02060 on September 3, 2024, described both the trained model and a long series of ablations on routing strategies, expert granularity, load balancing losses, and shared versus routed expert layouts, with the goal of producing a recipe other groups could use.
| Field | Value |
|---|---|
| Developer | Allen Institute for AI with Contextual AI |
| Initial release | September 3, 2024 |
| Model name | OLMoE-1B-7B-0924 |
| Architecture | Sparse Mixture of Experts decoder only transformer |
| Total parameters | 6.9 billion |
| Active parameters | approximately 1.3 billion per token |
| Experts per layer | 64 |
| Active experts per token | 8 (top-8 routing) |
| Layers | 16 |
| Hidden size | 2048 |
| Attention heads | 16 |
| Context length | 4096 tokens |
| Vocabulary | 50,304 (modified GPT-NeoX BPE) |
| Training tokens | approximately 5.1 trillion |
| Training data | OLMoE-mix-0924 (DCLM Baseline plus Dolma 1.7 components) |
| License | Apache 2.0 |
| Paper | arXiv:2409.02060 |
| Repository | github.com/allenai/OLMoE |
OLMoE-1B-7B is a 16 layer decoder only transformer in the pre normalization configuration. Each layer carries a multi head self attention block with 16 attention heads and a hidden size of 2048, then replaces the standard dense feed forward block with a routed mixture of experts module. There are 64 expert FFNs per layer, each with an intermediate dimension of 1024, and a learned router picks the top 8 experts for each token. The token's representation is then routed through those 8 experts, each producing an output, and the results are combined by the router scores.
The sparse activation pattern is what produces the 1B active versus 7B total split. A dense forward pass through one layer touches the full FFN, but in OLMoE only 8 of the 64 expert FFNs participate, so the effective FFN width per token is one eighth of the total. Combined with the 16 layer depth and the 2048 hidden size, this yields about 1.3 billion parameters touched per token versus 6.9 billion stored on disk. Inference is correspondingly cheaper than a dense 7B model of equivalent total capacity, though the full 7B set must still be held in memory.
The MoE design follows a then current best practice of using many small experts rather than a few large ones, sometimes called fine grained MoE. The 64 expert layout is denser than the 8 experts used by Mixtral 8x7B but consistent with the 16 routed experts plus 2 shared experts seen in DeepSeekMoE earlier in 2024. The Ai2 team ran an extensive ablation campaign to settle on these numbers, comparing 8, 16, 32, and 64 experts, and different top-k values, with the report concluding that 64 experts at top-8 routing produced the best balance between sparsity, load balance, and downstream accuracy at the 1B active scale.
Routing uses standard softmax gated top-k selection without shared experts. Load balancing is encouraged by a router auxiliary loss weighted at 0.01 in the training objective, which penalizes uneven token distribution across experts. The model uses rotary position embeddings with a base frequency of 10,000, SiLU activations in the expert FFNs, and the same modified GPT-NeoX BPE tokenizer used by previous OLMo releases, with a vocabulary of 50,304 tokens. Maximum context length is 4096 tokens, the same as OLMo 7B 0724.
OLMoE was trained on a dataset called OLMoE-mix-0924, a roughly 5.1 trillion token corpus that combines DCLM Baseline (Apple and the DataComp-LM team's curated web mix) with the higher quality components of Dolma 1.7. The mix leans heavily on filtered web pages but also includes code from the StarCoder corpus, mathematics from Proof Pile II and OpenWebMath, academic papers from peS2o, and books from Project Gutenberg and similar sources. Ai2 published per source token counts, the filtering and deduplication scripts, and the exact mixing ratios alongside the model. This dataset choice was a departure from earlier OLMo releases, which had used pure Dolma corpora; the team has been explicit that adding DCLM Baseline accounted for a substantial share of the MMLU gain over OLMo 7B 0724.
The main pretraining run used 128 NVIDIA H100 GPUs and processed 5.133 trillion tokens, corresponding to roughly 1.3 epochs over OLMoE-mix-0924. After this run, the model was further annealed for 100 billion additional tokens on a curated subset, with the learning rate decayed to zero. The annealing phase used a higher concentration of code, mathematics, and academic text than the main mix, mirroring the late stage curriculum strategy that would later be used systematically in OLMo 2. The released main branch on Hugging Face corresponds to the post anneal checkpoint, while earlier checkpoints were preserved for ablation work.
Training throughput was reported in the technical report as roughly two times faster per active parameter than a dense model of equivalent capability, a figure that mostly reflects the cheaper MoE forward pass at fixed active parameter count. Total training compute was approximately 3 x 10^22 FLOPs, several orders of magnitude below frontier scale 2024 closed models but consistent with the small to medium open model category.
A distinctive feature of the OLMoE release was the volume of published ablation work. The technical report includes more than 30 controlled experiments on MoE design choices, run at smaller scale and shorter token budgets to keep cost manageable. The ablations cover: number of experts and top-k value, the choice between shared and routed expert layouts, dropless versus token dropping routing, the magnitude of the load balancing loss, the use of router z loss, the effect of expert dropout, and the impact of upcycling a dense checkpoint versus training the MoE from scratch.
The upcycling experiment in particular drew attention because it directly contradicted the then prevailing assumption that initialising MoE experts from a pretrained dense checkpoint produced better final models. Ai2 reported that, at the scales they tested, training the MoE from scratch matched or beat upcycling on downstream benchmarks, with the gap closing only at very small token budgets where the initialization advantage had not yet been overwritten by gradient updates.
The headline benchmark numbers reported by Ai2 in the OLMoE technical report and the Hugging Face model card place the base model at or near the top of the 1 billion active parameter category, and competitive with much larger dense models. Pretraining benchmarks are reported zero shot except for MMLU, which uses the standard 5 shot prompt.
| Benchmark | OLMoE-1B-7B base | OLMo 7B 0724 | Llama 2 7B | DeepSeekMoE 16B | TinyLlama 1.1B |
|---|---|---|---|---|---|
| MMLU (5 shot) | 54.1 | 54.9 | 46.2 | 45.0 | 25.8 |
| HellaSwag | 80.0 | 80.5 | 78.9 | 79.8 | 60.3 |
| ARC Easy | 84.2 | 85.7 | 76.6 | 81.0 | 55.3 |
| ARC Challenge | 62.1 | 68.0 | 54.2 | 53.2 | 33.8 |
| PIQA | 79.8 | 79.3 | 77.5 | 80.4 | 73.3 |
| WinoGrande | 70.2 | 73.2 | 71.7 | 70.5 | 59.4 |
The base model trails the dense OLMo 7B 0724 by a small margin on most tasks despite activating roughly one fifth as many parameters per token, and outperforms every other 1B active parameter MoE that Ai2 benchmarked, including JetMoE-2B (active) and Qwen 1.5 MoE A2.7B. Against larger dense baselines such as Llama 2 7B and Pythia 1B, OLMoE is a clear win.
For the Instruct variant, Ai2 published a comparison against widely used chat models with similar or larger active parameter counts.
| Model | Active params | Average over Tulu eval suite |
|---|---|---|
| OLMoE-1B-7B Instruct | 1.3B | 57.7 |
| OLMo 7B Instruct (0724) | 7B | 50.1 |
| Llama 2 13B Chat | 13B | 53.3 |
| DeepSeekMoE 16B Chat | 2.8B | 51.1 |
| Qwen 1.5 1.8B Chat | 1.8B | 46.5 |
| Gemma 2 2B IT | 2B | 50.4 |
The Instruct number reported by Ai2 is an average over the post training evaluation suite used for the Tulu 2 line of models, including MMLU, GSM8K, BBH, TydiQA, Codex, AlpacaEval, and IFEval. The headline claim was that a 1.3 billion active parameter model could match or beat dense 13B chat models from the previous generation, a result attributed jointly to the strong base model and the DPO based post training pipeline.
Results outside the Ai2 suite are more mixed. Independent evaluation by the Hugging Face Open LLM Leaderboard placed OLMoE-1B-7B somewhat below Llama 3.1 8B and Gemma 2 9B on aggregate, which is consistent with the expectation that a much smaller active model cannot match the largest available dense weights on hard reasoning tasks. The OLMoE positioning is therefore better understood as state of the art for a given inference budget rather than state of the art absolutely.
The term "fully open" in the OLMoE release has the same narrow technical meaning as elsewhere in the OLMo program. A fully open release in this vocabulary requires the model weights, the complete pretraining data, the training code, the training recipe with all hyperparameters and mixture weights, the training logs, and a meaningful set of intermediate checkpoints. The intention is that a third party with enough compute could reproduce the run from scratch, ablate any single choice, and obtain a numerically similar model.
For OLMoE specifically, the released artifact set includes: model weights for the base, SFT, and Instruct variants; the OLMoE-mix-0924 corpus; the OLMo MoE training code on GitHub; the post training scripts and reward modeling code; Weights and Biases training logs with per step loss curves, gradient norms, expert load balance statistics, and routing entropy; intermediate checkpoints at regular intervals during the main pretraining run; and the full ablation experiment configurations. The Weights and Biases project link is published on the model card.
This level of release is unusual for any MoE model. As of the September 2024 release, no other publicly available sparse mixture of experts model included the training data in any form, much less the per source filtering scripts and the routing telemetry. Mixtral published only the weights and a technical report; DBRX published weights and a high level description; DeepSeek V2 published weights and a longer technical report but no data. OLMoE was the first model in the category that satisfied the Open Source Initiative's draft Open Source AI Definition, which requires release of "data information" sufficient for a skilled person to recreate a substantially equivalent system.
All OLMoE artifacts are released under the Apache 2.0 license. This includes the model weights, the SFT and Instruct variants, the OLMoE-mix-0924 dataset, the OLMo MoE training code, and the post training scripts. Apache 2.0 permits commercial use, redistribution, modification, and the creation of derivative works, subject to preservation of the license notice and attribution. There are no field of use restrictions, acceptable use clauses, or user count thresholds, in contrast to Meta's Llama Community License which gates very large commercial users behind a separate agreement.
The license choice was deliberate and consistent with Ai2's broader policy posture. In policy submissions and congressional testimony through 2024 and 2025, Ai2 has repeatedly used OLMo and OLMoE as evidence that high quality language models can be developed and released without the per user terms common at other large labs. The Apache 2.0 framing for the dataset is particularly notable because most pretraining corpora are released under more restrictive research only licenses or simply not released at all.
OLMoE entered a 2024 open mixture of experts landscape dominated by a few earlier releases, mostly far larger in total parameter count. The table below summarises the main contemporaries.
| Model | Active params | Total params | Experts | Active experts | Released | License | Data published |
|---|---|---|---|---|---|---|---|
| OLMoE-1B-7B | 1.3B | 6.9B | 64 | 8 | Sep 2024 | Apache 2.0 | Yes |
| Mixtral 8x7B | 13B | 47B | 8 | 2 | Dec 2023 | Apache 2.0 | No |
| Mixtral 8x22B | 39B | 141B | 8 | 2 | Apr 2024 | Apache 2.0 | No |
| DBRX | 36B | 132B | 16 | 4 | Mar 2024 | DBRX OS Licence | No |
| Qwen 1.5 MoE A2.7B | 2.7B | 14.3B | 60 + 4 shared | 4 + 4 shared | Mar 2024 | Tongyi Qianwen | No |
| DeepSeekMoE 16B | 2.8B | 16.4B | 64 + 2 shared | 6 + 2 shared | Jan 2024 | DeepSeek Licence | No |
| DeepSeek V2 | 21B | 236B | 160 + 2 shared | 6 + 2 shared | May 2024 | DeepSeek Licence | No |
| Snowflake Arctic | 17B | 480B | 128 | 2 | Apr 2024 | Apache 2.0 | No |
| Grok 1 | 86B | 314B | 8 | 2 | Mar 2024 | Apache 2.0 | No |
OLMoE is by some distance the smallest fully sparse model in this set on both active and total parameter counts. It is also the only entry that ships with the training data and a complete reproduction recipe. On benchmarks the natural points of comparison are DeepSeekMoE 16B and Qwen 1.5 MoE A2.7B, both of which OLMoE beats on the Ai2 evaluation suite while activating fewer parameters per token. At larger weight classes Mixtral 8x7B and DeepSeek V2 are clearly stronger in absolute terms but operate at roughly ten and fifteen times the active parameter count respectively.
Against dense baselines the picture is the standard MoE story. OLMoE-1B-7B trails the strongest dense 8B models such as Llama 3.1 8B and Gemma 2 9B on aggregate benchmarks, but does so while using a fraction of the inference compute. It is well ahead of dense 1B to 3B models including Pythia 1B, TinyLlama 1.1B, and Gemma 2 2B, which is the comparison Ai2 prefers because the active parameter count is the relevant proxy for serving cost.
Reception in the open language model community was strongly positive. The release was widely covered in the AI research press, including coverage from MarkTechPost, Hugging Face's blog, the Interconnects newsletter, and Sebastian Raschka's open model survey. The two most cited points of praise were the unusual completeness of the release (data, code, logs, checkpoints, ablations) and the volume of MoE specific ablation work, which several reviewers described as the most thorough public study of MoE design choices to date.
The model became a frequent reference baseline for academic work on mixture of experts routing, load balancing, and upcycling. The OLMoE technical report has been cited in the OLMo 2 report ("2 OLMo 2 Furious," January 2025), in the OLMo 3 release notes (November 2025), and in subsequent Ai2 work on MoE based reasoning models. Niklas Muennighoff's later work at Stanford and at other institutions has continued the line of research started in OLMoE, and the released ablation configurations have been reused in third party MoE studies. The Hugging Face Transformers library added a dedicated OlmoeForCausalLM model class in version 4.43, ahead of the public release, which gave OLMoE first class support in the most widely used open source inference stack from day one.
Critical reactions were targeted. The 4096 token context length was already short by September 2024 standards, when many competing open models had moved to 32k or longer. Coding performance lagged dedicated code MoEs and even dense baselines from the Qwen 2 family. The Instruct model's reliance on the older Tulu 2 post training pipeline (DPO without the verifiable rewards stage later introduced in Tulu 3) meant it was eclipsed within months by stronger post training on the same base, including community fine tunes that swapped in newer SFT mixtures.
Within Ai2 the project served as an architectural and tooling prototype for subsequent open MoE work. The OLMo MoE codebase forked from the main OLMo repository became the basis for later sparse experiments, and the OLMoE-mix dataset informed the design of OLMo Mix 1124 used in OLMo 2. Although Ai2 did not release a direct OLMoE successor in 2025, the team has indicated in public talks that a refreshed open MoE is on the OLMo program roadmap, building on the OLMoE recipe with longer context, updated data, and the verifiable rewards post training stack used in Tulu 3.