OLMoE

AI Models Large Language Models Mixture of Experts Open Source AI Research Organizations

20 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 4,011 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

OLMoE (Open Mixture-of-Experts) is a fully open sparse mixture of experts large language model released by the Allen Institute for AI (Ai2) on September 3, 2024 ^[1]^[2]. The first release, OLMoE-1B-7B-0924, has 6.9 billion total parameters but activates only about 1.3 billion of them on any given token, by routing each token through 8 of 64 small experts in every transformer layer ^[1]^[2]. In the words of the paper, OLMoE is "a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE)," one whose models "outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B" ^[1]. Ai2 described it as the first fully open mixture of experts model, meaning the release includes not only the weights but also the pretraining data mixture, the training and evaluation code, the Weights and Biases training logs, hundreds of intermediate checkpoints, and the post training data and scripts, all under the Apache 2.0 license ^[1]^[2]^[15].

At launch the base model matched or beat all other open models in its active parameter weight class on standard academic benchmarks, and the Instruct variant tied or surpassed several much larger dense and sparse models including Llama 2 13B Chat and DeepSeekMoE 16B Chat ^[1]. The Ai2 model card frames the trade off directly: OLMoE "yields state-of-the-art performance among models with a similar cost (1B) and is competitive with much larger models like Llama2-13B" ^[2].

The model arrived during a year in which sparse MoE designs had become the dominant approach for frontier language models, with Mixtral 8x7B from Mistral, DBRX from Databricks, Snowflake Arctic, Qwen 1.5 MoE, DeepSeek V2, and Grok 1 from xAI all using some variant of the architecture. None of those releases included the full training recipe or the underlying pretraining corpus, which made controlled scientific work on MoE design choices effectively impossible outside the labs that owned the models. OLMoE was framed as a deliberate counter to that pattern, an attempt to produce a competitive sparse model that could also serve as a research artifact for the wider community ^[1].

What is the background to OLMoE?

The Allen Institute for AI began the OLMo program in early 2024 with the goal of publishing fully reproducible large language models. The original OLMo release in February 2024 included a 1 billion and a 7 billion parameter dense decoder only transformer trained on the Dolma corpus, alongside training code, evaluation harnesses, and intermediate checkpoints. A mid year refresh called OLMo 1.7-7B (April 2024) and a further update called OLMo 7B 0724 (July 2024) tightened the architecture and added more training tokens, lifting the 7B model from 28 on MMLU at launch to 54.9 by July.

By mid 2024, sparse mixture of experts models had become the standard recipe for high capacity language models without proportional inference cost. Mixtral 8x7B in December 2023 was the first widely available open weight MoE, followed in 2024 by DBRX (132 billion total, 36 billion active), Snowflake Arctic (480 billion total, 17 billion active), Qwen 1.5 MoE A2.7B, DeepSeek V2 (236 billion total, 21 billion active), and Grok 1 (314 billion total, 86 billion active). In each case the released artifact was either the weights alone under a custom license or, in DeepSeek's case, weights and a high level model card but no training data, no exact reproduction recipe, and limited training infrastructure detail. Several research groups working on MoE routing and load balancing complained openly that this state of affairs made replication studies impossible without industrial scale compute.

OLMoE was conceived to fill that gap. The project was led by Niklas Muennighoff during his time at Ai2, in collaboration with Contextual AI and the wider OLMo team, including Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Nathan Lambert, and Hannaneh Hajishirzi ^[1]. The technical report, posted to arXiv as 2409.02060 on September 3, 2024, described both the trained model and a long series of ablations on routing strategies, expert granularity, load balancing losses, and shared versus routed expert layouts, with the goal of producing a recipe other groups could use ^[1].

Infobox

Field	Value
Developer	Allen Institute for AI with Contextual AI
Initial release	September 3, 2024
Model name	OLMoE-1B-7B-0924
Architecture	Sparse Mixture of Experts decoder only transformer
Total parameters	6.9 billion
Active parameters	approximately 1.3 billion per token
Experts per layer	64
Active experts per token	8 (top-8 routing)
Layers	16
Hidden size	2048
Attention heads	16
Context length	4096 tokens
Vocabulary	50,304 (modified GPT-NeoX BPE)
Training tokens	approximately 5.1 trillion
Training hardware	256 NVIDIA H100 GPUs, about 10 days
Training data	OLMoE-mix-0924 (DCLM Baseline plus Dolma 1.7 components)
License	Apache 2.0
Paper	arXiv:2409.02060
Repository	github.com/allenai/OLMoE

How does OLMoE use mixture of experts?

OLMoE-1B-7B is a 16 layer decoder only transformer in the pre normalization configuration. Each layer carries a multi head self attention block with 16 attention heads and a hidden size of 2048, then replaces the standard dense feed forward block with a routed mixture of experts module. There are 64 expert FFNs per layer, each with an intermediate dimension of 1024, and a learned router picks the top 8 experts for each token ^[1]^[2]. The token's representation is then routed through those 8 experts, each producing an output, and the results are combined by the router scores.

The sparse activation pattern is what produces the roughly 1.3B active versus 6.9B total split. A dense forward pass through one layer touches the full FFN, but in OLMoE only 8 of the 64 expert FFNs participate, so the effective FFN width per token is one eighth of the total. Combined with the 16 layer depth and the 2048 hidden size, this yields about 1.3 billion parameters touched per token versus 6.9 billion stored on disk ^[1]^[2]. Inference is correspondingly cheaper than a dense 7B model of equivalent total capacity, though the full 7B set must still be held in memory.

The MoE design follows a then current best practice of using many small experts rather than a few large ones, sometimes called fine grained MoE. The 64 expert layout is denser than the 8 experts used by Mixtral 8x7B but consistent with the 16 routed experts plus 2 shared experts seen in DeepSeekMoE earlier in 2024 ^[11]^[12]. The Ai2 team ran an extensive ablation campaign to settle on these numbers, comparing 8, 16, 32, and 64 experts, and different top-k values, with the report concluding that 64 experts at top-8 routing produced the best balance between sparsity, load balance, and downstream accuracy at the 1B active scale ^[1].

Routing uses standard softmax gated top-k selection without shared experts. Load balancing is encouraged by a router auxiliary loss weighted at 0.01 in the training objective, which penalizes uneven token distribution across experts ^[1]. The model uses rotary position embeddings with a base frequency of 10,000, SiLU activations in the expert FFNs, and the same modified GPT-NeoX BPE tokenizer used by previous OLMo releases, with a vocabulary of 50,304 tokens. Maximum context length is 4096 tokens, the same as OLMo 7B 0724.

How was OLMoE trained?

What data was OLMoE trained on?

OLMoE was trained on a dataset called OLMoE-mix-0924, a roughly 5.1 trillion token corpus that combines DCLM Baseline (Apple and the DataComp-LM team's curated web mix) with the higher quality components of Dolma 1.7 ^[5]^[14]. The mix leans heavily on filtered web pages but also includes code from the StarCoder corpus, mathematics from Proof Pile II and OpenWebMath, academic papers from peS2o, and books from Project Gutenberg and similar sources. Ai2 published per source token counts, the filtering and deduplication scripts, and the exact mixing ratios alongside the model ^[5]. This dataset choice was a departure from earlier OLMo releases, which had used pure Dolma corpora; the team has been explicit that adding DCLM Baseline accounted for a substantial share of the MMLU gain over OLMo 7B 0724 ^[1].

How much compute did OLMoE use?

The main pretraining run used 256 NVIDIA H100 GPUs for about 10 days with PyTorch FSDP and mixed precision training, processing 5.133 trillion tokens, corresponding to roughly 1.3 epochs over OLMoE-mix-0924 ^[16]. After this run, the model was further annealed for 100 billion additional tokens on a curated subset, with the learning rate decayed to zero ^[1]. The annealing phase used a higher concentration of code, mathematics, and academic text than the main mix, mirroring the late stage curriculum strategy that would later be used systematically in OLMo 2. The released main branch on Hugging Face corresponds to the post anneal checkpoint, while earlier checkpoints were preserved for ablation work ^[2].

Training throughput was reported in the technical report as roughly two times faster per active parameter than a dense model of equivalent capability, a figure that mostly reflects the cheaper MoE forward pass at fixed active parameter count ^[1]. Total training compute was approximately 3 x 10^22 FLOPs, several orders of magnitude below frontier scale 2024 closed models but consistent with the small to medium open model category.

What did the OLMoE ablations show?

A distinctive feature of the OLMoE release was the volume of published ablation work. The technical report includes more than 30 controlled experiments on MoE design choices, run at smaller scale and shorter token budgets to keep cost manageable ^[1]. The ablations cover: number of experts and top-k value, the choice between shared and routed expert layouts, dropless versus token dropping routing, the magnitude of the load balancing loss, the use of router z loss, the effect of expert dropout, and the impact of upcycling a dense checkpoint versus training the MoE from scratch.

The upcycling experiment in particular drew attention because it directly contradicted the then prevailing assumption that initialising MoE experts from a pretrained dense checkpoint produced better final models. Ai2 reported that, at the scales they tested, training the MoE from scratch matched or beat upcycling on downstream benchmarks, with the gap closing only at very small token budgets where the initialization advantage had not yet been overwritten by gradient updates ^[1].

How does OLMoE perform on benchmarks?

The headline benchmark numbers reported by Ai2 in the OLMoE technical report and the Hugging Face model card place the base model at or near the top of the 1 billion active parameter category, and competitive with much larger dense models ^[1]^[2]. Pretraining benchmarks are reported zero shot except for MMLU, which uses the standard 5 shot prompt.

Benchmark	OLMoE-1B-7B base	OLMo 7B 0724	Llama 2 7B	DeepSeekMoE 16B	TinyLlama 1.1B
MMLU (5 shot)	54.1	54.9	46.2	45.0	25.8
HellaSwag	80.0	80.5	78.9	79.8	60.3
ARC Easy	84.2	85.7	76.6	81.0	55.3
ARC Challenge	62.1	68.0	54.2	53.2	33.8
PIQA	79.8	79.3	77.5	80.4	73.3
WinoGrande	70.2	73.2	71.7	70.5	59.4

The base model trails the dense OLMo 7B 0724 by a small margin on most tasks despite activating roughly one fifth as many parameters per token, and outperforms every other 1B active parameter MoE that Ai2 benchmarked, including JetMoE-2B (active) and Qwen 1.5 MoE A2.7B ^[1]^[2]. Against larger dense baselines such as Llama 2 7B and Pythia 1B, OLMoE is a clear win.

For the Instruct variant, Ai2 published a comparison against widely used chat models with similar or larger active parameter counts.

Model	Active params	Average over Tulu eval suite
OLMoE-1B-7B Instruct	1.3B	57.7
OLMo 7B Instruct (0724)	7B	50.1
Llama 2 13B Chat	13B	53.3
DeepSeekMoE 16B Chat	2.8B	51.1
Qwen 1.5 1.8B Chat	1.8B	46.5
Gemma 2 2B IT	2B	50.4

The Instruct number reported by Ai2 is an average over the post training evaluation suite used for the Tulu 2 line of models, including MMLU, GSM8K, BBH, TydiQA, Codex, AlpacaEval, and IFEval ^[1]^[3]. The headline claim was that a 1.3 billion active parameter model could match or beat dense 13B chat models from the previous generation, a result attributed jointly to the strong base model and the DPO based post training pipeline ^[1].

Results outside the Ai2 suite are more mixed. Independent evaluation by the Hugging Face Open LLM Leaderboard placed OLMoE-1B-7B somewhat below Llama 3.1 8B and Gemma 2 9B on aggregate, which is consistent with the expectation that a much smaller active model cannot match the largest available dense weights on hard reasoning tasks. The OLMoE positioning is therefore better understood as state of the art for a given inference budget rather than state of the art absolutely ^[2].

How open is OLMoE?

The term "fully open" in the OLMoE release has the same narrow technical meaning as elsewhere in the OLMo program. A fully open release in this vocabulary requires the model weights, the complete pretraining data, the training code, the training recipe with all hyperparameters and mixture weights, the training logs, and a meaningful set of intermediate checkpoints. The intention is that a third party with enough compute could reproduce the run from scratch, ablate any single choice, and obtain a numerically similar model ^[1]. Ai2 summarised the achievement as making OLMoE "the first model to be on the Pareto frontier of performance and size, while also being released with open data, code, evaluations, logs, and intermediate training checkpoints" ^[15].

For OLMoE specifically, the released artifact set includes: model weights for the base, SFT, and Instruct variants; the OLMoE-mix-0924 corpus; the OLMo MoE training code on GitHub; the post training scripts and reward modeling code; Weights and Biases training logs with per step loss curves, gradient norms, expert load balance statistics, and routing entropy; intermediate checkpoints at regular intervals during the main pretraining run; and the full ablation experiment configurations ^[1]^[5]^[6]^[7]. The Weights and Biases project link is published on the model card ^[7].

This level of release is unusual for any MoE model. As of the September 2024 release, no other publicly available sparse mixture of experts model included the training data in any form, much less the per source filtering scripts and the routing telemetry. Mixtral published only the weights and a technical report; DBRX published weights and a high level description; DeepSeek V2 published weights and a longer technical report but no data ^[11]^[12]. OLMoE was the first model in the category that satisfied the Open Source Initiative's draft Open Source AI Definition, which requires release of "data information" sufficient for a skilled person to recreate a substantially equivalent system.

Is OLMoE open source and what is its license?

All OLMoE artifacts are released under the Apache 2.0 license. This includes the model weights, the SFT and Instruct variants, the OLMoE-mix-0924 dataset, the OLMo MoE training code, and the post training scripts ^[1]^[2]^[5]. Apache 2.0 permits commercial use, redistribution, modification, and the creation of derivative works, subject to preservation of the license notice and attribution. There are no field of use restrictions, acceptable use clauses, or user count thresholds, in contrast to Meta's Llama Community License which gates very large commercial users behind a separate agreement.

The license choice was deliberate and consistent with Ai2's broader policy posture. In policy submissions and congressional testimony through 2024 and 2025, Ai2 has repeatedly used OLMo and OLMoE as evidence that high quality language models can be developed and released without the per user terms common at other large labs. The Apache 2.0 framing for the dataset is particularly notable because most pretraining corpora are released under more restrictive research only licenses or simply not released at all.

How does OLMoE compare to other mixture of experts models?

OLMoE entered a 2024 open mixture of experts landscape dominated by a few earlier releases, mostly far larger in total parameter count. The table below summarises the main contemporaries.

Model	Active params	Total params	Experts	Active experts	Released	License	Data published
OLMoE-1B-7B	1.3B	6.9B	64	8	Sep 2024	Apache 2.0	Yes
Mixtral 8x7B	13B	47B	8	2	Dec 2023	Apache 2.0	No
Mixtral 8x22B	39B	141B	8	2	Apr 2024	Apache 2.0	No
DBRX	36B	132B	16	4	Mar 2024	DBRX OS Licence	No
Qwen 1.5 MoE A2.7B	2.7B	14.3B	60 + 4 shared	4 + 4 shared	Mar 2024	Tongyi Qianwen	No
DeepSeekMoE 16B	2.8B	16.4B	64 + 2 shared	6 + 2 shared	Jan 2024	DeepSeek Licence	No
DeepSeek V2	21B	236B	160 + 2 shared	6 + 2 shared	May 2024	DeepSeek Licence	No
Snowflake Arctic	17B	480B	128	2	Apr 2024	Apache 2.0	No
Grok 1	86B	314B	8	2	Mar 2024	Apache 2.0	No

OLMoE is by some distance the smallest fully sparse model in this set on both active and total parameter counts. It is also the only entry that ships with the training data and a complete reproduction recipe ^[1]. On benchmarks the natural points of comparison are DeepSeekMoE 16B and Qwen 1.5 MoE A2.7B, both of which OLMoE beats on the Ai2 evaluation suite while activating fewer parameters per token. At larger weight classes Mixtral 8x7B and DeepSeek V2 are clearly stronger in absolute terms but operate at roughly ten and fifteen times the active parameter count respectively.

Against dense baselines the picture is the standard MoE story. OLMoE-1B-7B trails the strongest dense 8B models such as Llama 3.1 8B and Gemma 2 9B on aggregate benchmarks, but does so while using a fraction of the inference compute. It is well ahead of dense 1B to 3B models including Pythia 1B, TinyLlama 1.1B, and Gemma 2 2B, which is the comparison Ai2 prefers because the active parameter count is the relevant proxy for serving cost ^[2].

How was OLMoE received?

Reception in the open language model community was strongly positive. The release was widely covered in the AI research press, including coverage from MarkTechPost, Hugging Face's blog, the Interconnects newsletter, and Sebastian Raschka's open model survey ^[10]. The two most cited points of praise were the unusual completeness of the release (data, code, logs, checkpoints, ablations) and the volume of MoE specific ablation work, which several reviewers described as the most thorough public study of MoE design choices to date.

The model became a frequent reference baseline for academic work on mixture of experts routing, load balancing, and upcycling. The OLMoE technical report has been cited in the OLMo 2 report ("2 OLMo 2 Furious," January 2025), in the OLMo 3 release notes (November 2025), and in subsequent Ai2 work on MoE based reasoning models ^[8]. Niklas Muennighoff's later work at Stanford and at other institutions has continued the line of research started in OLMoE, and the released ablation configurations have been reused in third party MoE studies. The Hugging Face Transformers library added a dedicated OlmoeForCausalLM model class in version 4.43, ahead of the public release, which gave OLMoE first class support in the most widely used open source inference stack from day one ^[9].

Critical reactions were targeted. The 4096 token context length was already short by September 2024 standards, when many competing open models had moved to 32k or longer. Coding performance lagged dedicated code MoEs and even dense baselines from the Qwen 2 family. The Instruct model's reliance on the older Tulu 2 post training pipeline (DPO without the verifiable rewards stage later introduced in Tulu 3) meant it was eclipsed within months by stronger post training on the same base, including community fine tunes that swapped in newer SFT mixtures.

Within Ai2 the project served as an architectural and tooling prototype for subsequent open MoE work. The OLMo MoE codebase forked from the main OLMo repository became the basis for later sparse experiments, and the OLMoE-mix dataset informed the design of OLMo Mix 1124 used in OLMo 2. Although Ai2 did not release a direct OLMoE successor in 2025, the team has indicated in public talks that a refreshed open MoE is on the OLMo program roadmap, building on the OLMoE recipe with longer context, updated data, and the verifiable rewards post training stack used in Tulu 3.

ELI5

Imagine a big team of 64 specialists sitting in a room, but for any single question you only call on the 8 who know that topic best. That is what OLMoE does. The model stores 64 small "experts" in each layer, but for each word it reads it asks only 8 of them to do work. So even though the whole model is about 7 billion parameters big, only about 1.3 billion of them actually run for any given word, which makes it cheap and fast to use while still being smart. The other special thing is that the team at Ai2 gave away everything: not just the finished brain, but the exact books it read, the recipe, the training notebooks, and snapshots taken along the way, so anyone can study how it learned or rebuild it from scratch.

References

Muennighoff, Niklas; Soldaini, Luca; Groeneveld, Dirk; Lo, Kyle; Morrison, Jacob; Min, Sewon; Shi, Weijia; Walsh, Pete; Tafjord, Oyvind; Lambert, Nathan; Gu, Yuling; Arora, Shane; Bhagia, Akshita; Schwenk, Dustin; Wadden, David; Wettig, Alexander; Hui, Binyuan; Dettmers, Tim; Kiela, Douwe; Farhadi, Ali; Smith, Noah A.; Koh, Pang Wei; Singh, Amanpreet; Hajishirzi, Hannaneh. "OLMoE: Open Mixture-of-Experts Language Models." arXiv:2409.02060, September 3, 2024. https://arxiv.org/abs/2409.02060 ↩
Allen Institute for AI. "allenai/OLMoE-1B-7B-0924." Hugging Face model card. https://huggingface.co/allenai/OLMoE-1B-7B-0924 ↩
Allen Institute for AI. "allenai/OLMoE-1B-7B-0924-Instruct." Hugging Face model card. https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct ↩
Allen Institute for AI. "allenai/OLMoE-1B-7B-0924-SFT." Hugging Face model card. https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT
Allen Institute for AI. "OLMoE-mix-0924 dataset." Hugging Face. https://huggingface.co/datasets/allenai/OLMoE-mix-0924 ↩
Allen Institute for AI GitHub. "allenai/OLMoE." https://github.com/allenai/OLMoE ↩
Allen Institute for AI. "OLMoE-1B-7B-0924 training run." Weights and Biases report. https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3 ↩
OLMo Team. "2 OLMo 2 Furious." arXiv:2501.00656, January 2025. https://arxiv.org/abs/2501.00656 ↩
Hugging Face Transformers documentation. "OLMoE." https://huggingface.co/docs/transformers/main/en/model_doc/olmoe ↩
MarkTechPost. "Allen AI Releases OLMoE: A Fully Open-Source Mixture-of-Experts Language Model." September 2024. https://www.marktechpost.com/2024/09/04/allen-ai-releases-olmoe-a-fully-open-source-mixture-of-experts-language-model/ ↩
DeepSeek-AI. "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." arXiv:2401.06066, January 2024. https://arxiv.org/abs/2401.06066 ↩
Mistral AI. "Mixtral of Experts." arXiv:2401.04088, January 2024. https://arxiv.org/abs/2401.04088 ↩
Soldaini, Luca et al. "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." arXiv:2402.00159, February 2024. https://arxiv.org/abs/2402.00159
Li, Jeffrey et al. "DataComp-LM: In search of the next generation of training sets for language models." arXiv:2406.11794, June 2024. https://arxiv.org/abs/2406.11794 ↩
Allen Institute for AI. "OLMoE: An open, small, and state-of-the-art mixture-of-experts model." Ai2 blog, September 3, 2024. https://allenai.org/blog/olmoe-an-open-small-and-state-of-the-art-mixture-of-experts-model-c258432d0514 ↩
Contextual AI. "Introducing OLMoE: fully open source Mixture of Experts LLM." September 2024. https://contextual.ai/research/olmoe-mixture-of-experts ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Molmo OLMo 2 OLMo 3 Sparse upcycling Tülu 3

What is the background to OLMoE?

Infobox

How does OLMoE use mixture of experts?

How was OLMoE trained?

What data was OLMoE trained on?

How much compute did OLMoE use?

What did the OLMoE ablations show?

How does OLMoE perform on benchmarks?

How open is OLMoE?

Is OLMoE open source and what is its license?

How does OLMoE compare to other mixture of experts models?

How was OLMoE received?

See also

ELI5

References

Improve this article

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here