Mixtral is a family of open-weight Sparse Mixture of Experts (SMoE) large language models developed by Mistral AI, a French artificial intelligence company founded in April 2023. The family contains two main models: Mixtral 8x7B, released on December 11, 2023, and Mixtral 8x22B, released on April 10, 2024. Both are decoder-only Transformer networks where each feedforward block is replaced by a set of eight expert subnetworks, with a learned router selecting two experts per token at every layer. The design lets the model carry a large total parameter count for capacity while activating only a fraction of those parameters during inference, so it runs much faster than a dense model of comparable quality and is cheaper to serve.
Mixtral 8x7B was the first open-weight Mixture of Experts model that clearly matched or beat much larger dense systems such as LLaMA 2 70B and GPT-3.5 Turbo on standard benchmarks, and its release under the Apache 2.0 license showed that sparse MoE could be a practical, fully open recipe rather than a research curiosity confined to internal Google papers. Mixtral 8x22B extended the same recipe with a larger network, a 64K context window, and stronger reasoning, math, and multilingual scores. Together the two releases helped trigger the wave of open MoE models that followed in 2024 (DBRX, Snowflake Arctic, Qwen 1.5 MoE, DeepSeek-V2 and later DeepSeek-V3) and made sparse routing one of the dominant designs in modern open-weight language models.
The Mixtral name is a portmanteau of Mistral and "mixture," which signals its lineage. Mixtral 8x7B reuses the Mistral 7B attention stack (sliding window attention, grouped query attention, RoPE, the SentencePiece tokenizer) almost verbatim, then swaps each dense feedforward layer for an MoE block of eight expert feedforward networks. That architectural continuity, plus the public weights and permissive license, made Mixtral easy to study, fine-tune, quantize, and integrate into existing inference engines such as llama.cpp, Ollama, and vLLM within days of release.
The Mixture of Experts idea predates modern deep learning. The 1991 paper "Adaptive Mixtures of Local Experts" by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton proposed splitting a network into specialist subnetworks managed by a gating function. The modern, deep learning version arrived in 2017 with Noam Shazeer's "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," which scaled language models to 137 billion parameters by routing each token to a small number of experts.
Google then ported MoE into Transformer-based language models. GShard (2020) trained a 600-billion-parameter multilingual translation model that replaced every other dense feedforward layer with a top-2 MoE block. The Switch Transformer (2021) simplified routing further to top-1 and pushed past one trillion parameters. GLaM (2021) reached 1.2 trillion parameters with similar techniques while activating only a fraction during inference. None of these models were released publicly. They lived inside Google's clusters, were described in papers, and the open-source community had no way to reproduce them or run them on its own hardware.
Mistral AI was founded in April 2023 by Arthur Mensch, Guillaume Lample, and Timothee Lacroix, who came from Google DeepMind and Meta AI. The company's first model, Mistral 7B, shipped in September 2023. It was a 7.3-billion-parameter dense Transformer released as a torrent under Apache 2.0, and it noticeably outperformed Llama 2 13B on most benchmarks while being roughly half the size. Mistral 7B introduced two architecture choices that would carry over directly into Mixtral: sliding window attention, which limits each token's attention to a moving window of recent context, and grouped query attention, which shares key and value projections across multiple query heads to cut the memory bandwidth needed during inference. When Mistral built Mixtral, it kept the Mistral 7B attention stack intact and only swapped the feedforward layers for MoE blocks.
Mistral AI announced Mixtral 8x7B on December 11, 2023, with a blog post titled "Mixtral of Experts." The actual model weights had appeared three days earlier, on December 8, 2023, when Mistral's official X (Twitter) account posted a single tweet containing only a magnet link. The torrent contained the raw model weights without code, documentation, or benchmarks. This unconventional drop became a recurring Mistral marketing pattern (the company had used the same magnet-link teaser for Mistral 7B in September 2023 and would repeat it for Mixtral 8x22B in April 2024) and gave the release a viral, hacker-culture flavor that contrasted sharply with the carefully staged announcements typical of large AI labs.
The official blog post on December 11 introduced Mixtral 8x7B as "a high-quality sparse mixture of experts model (SMoE) with open weights," claimed it outperformed Llama 2 70B on most benchmarks with 6x faster inference, and stated that it matched or exceeded GPT-3.5 on standard benchmarks. The instruction-tuned variant, Mixtral 8x7B Instruct v0.1, was released on the same day. Both the base and Instruct variants ship under the Apache 2.0 license, which permits commercial use, redistribution, fine-tuning, and modification with no field-of-use restrictions. Hugging Face uploaded the weights on December 11 and added official transformers library support, including a MixtralForCausalLM class, by December 12.
The technical paper followed a month later. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lelio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Theophile Gervet, Thibaut Lavril, Thomas Wang, Timothee Lacroix, and William El Sayed published "Mixtral of Experts" on arXiv on January 8, 2024 (arXiv:2401.04088). The paper was the first detailed account of the model's architecture, training, expert routing behavior, and benchmark numbers.
Mixtral 8x7B is a decoder-only Transformer with the same backbone as Mistral 7B except that each feedforward layer is replaced by a sparse MoE layer containing eight experts. Each expert is itself a SwiGLU feedforward block of the same shape as the dense FFN in Mistral 7B. The architectural numbers are:
| Parameter | Value |
|---|---|
| Hidden dimension (d_model) | 4,096 |
| Number of layers | 32 |
| Attention heads | 32 |
| Key-value heads | 8 (Grouped-Query Attention) |
| Head dimension | 128 |
| Intermediate (FFN) size per expert | 14,336 |
| Vocabulary size | 32,000 |
| Context length | 32,768 tokens |
| Number of experts per layer | 8 |
| Active experts per token | 2 (top-2 routing) |
| Total parameters | ~46.7 billion |
| Active parameters per token | ~12.9 billion |
| Activation function | SiLU (within SwiGLU) |
| Positional encoding | Rotary Position Embedding (RoPE), theta = 1,000,000 |
| Tokenizer | SentencePiece BPE, 32K vocab (shared with Mistral 7B) |
| Precision | bfloat16 |
Why 46.7 billion total parameters and not 56 billion (8 x 7B)? The "8x7B" name is a marketing convenience, not a literal multiplication. Only the feedforward layers are replicated eight times. The token embedding, attention layers, layer norms, and output projection are shared across experts. Each expert FFN contributes about 4.5 billion parameters, eight of them contribute roughly 36 billion across all 32 layers, and the remaining ~10.7 billion are the shared attention plus embedding stack. Active parameters per token (about 12.9 billion) reflect the shared stack plus the two experts selected by the router at each layer.
At every MoE layer, for every input token, the router selects two of the eight experts and combines their outputs as a weighted sum. The router is a single learned linear layer of shape (d_model x 8) followed by a softmax. Concretely, given an input vector x at an MoE layer:
This is the same top-2 routing introduced in GShard. Unlike GShard, which placed an MoE block on every other feedforward layer, Mixtral applies MoE on every feedforward layer in the model. Routing is independent at each layer, so a single token can be processed by 64 distinct experts on its way through 32 layers (two per layer). Because the router operates per token, neighboring tokens can take very different paths through the network, which is why MoE inference benefits enormously from large batch sizes that smooth out per-expert load.
A naive sparse MoE layer is unstable: nothing prevents the router from collapsing onto one or two favored experts and ignoring the rest. Mixtral uses an auxiliary load balancing loss originally introduced in the Switch Transformer paper. The loss for one MoE layer is:
L_balance = alpha * N * sum over experts i of (f_i * P_i)
where N is the number of experts (8), f_i is the fraction of tokens in the batch routed to expert i, P_i is the average router probability for expert i across the batch, and alpha is a small coefficient. In Mixtral 8x7B the coefficient was 0.02; for Mixtral 8x22B it was reduced to 0.001. The loss is added to the standard cross-entropy training objective. Minimizing L_balance pushes both the routing distribution f and the router probabilities P toward uniform, which keeps all eight experts trained.
The key practical property of Mixtral is that compute scales with active parameters while memory scales with total parameters. A forward pass over one token activates roughly 12.9 billion parameters, so latency per token is similar to a dense 13B model. But all 46.7 billion parameters must be resident in VRAM (or fast-loadable from CPU) because any token at any layer might pick any pair of experts. This gives Mixtral the inference cost profile of a 13B dense model and the memory footprint of a 47B dense model. The full bfloat16 checkpoint is approximately 87 GB, which is too large for a single consumer GPU but fits on two RTX 3090s or 4090s, or on a single A100 80GB or H100 80GB.
Mistral has disclosed very little about Mixtral's training corpus, total token count, or hardware. The arXiv paper says only that the model was "pretrained on data extracted from the open Web" and that the experts and the router were trained jointly, end to end. There is no published number for tokens consumed, no breakdown of language or domain mixture, and no description of the GPU cluster used. The company has cited competitive sensitivity for the silence, a stance shared by OpenAI, Anthropic, and other frontier labs.
The Instruct variant (Mixtral 8x7B Instruct v0.1) was created with two-stage post-training: supervised fine-tuning on instruction-response pairs followed by Direct Preference Optimization on a paired feedback dataset. DPO is a closed-form alternative to RLHF that avoids training a separate reward model, and Mixtral was one of the earliest widely deployed Instruct models trained with DPO rather than PPO. The combination produced an Instruct model that scored 8.30 on MT-Bench, the highest score for any openly available model at the time.
The Mixtral paper compared Mixtral 8x7B against LLaMA 2 70B and GPT-3.5 across a wide range of standard evaluations. The headline result was that Mixtral matched or beat LLaMA 2 70B on most benchmarks while activating roughly six times fewer parameters per token.
| Benchmark | Mixtral 8x7B | LLaMA 2 70B | GPT-3.5 |
|---|---|---|---|
| MMLU (5-shot) | 70.6% | 69.9% | 70.0% |
| HellaSwag (10-shot) | 86.7% | 87.1% | 85.5% |
| WinoGrande (5-shot) | 81.2% | 83.2% | 81.6% |
| ARC Challenge (25-shot) | 85.8% | 85.1% | 85.2% |
| PIQA (0-shot) | 83.6% | 82.6% | -- |
| ARC-Easy (0-shot) | 83.1% | 79.9% | -- |
| NaturalQuestions (5-shot) | 30.6% | 25.4% | -- |
| TriviaQA (5-shot) | 71.5% | 73.0% | -- |
| HumanEval (pass@1) | 40.2% | 29.3% | -- |
| MBPP (pass@1) | 60.7% | 49.8% | 52.2% |
| MATH (4-shot, maj@4) | 28.4% | 13.8% | -- |
| GSM8K (5-shot) | 58.4% | 53.6% | 57.1% |
| MT-Bench (Instruct) | 8.30 | 6.86 | 8.32 |
The gains were largest on code and math: Mixtral beat LLaMA 2 70B by roughly 11 points on HumanEval and more than doubled its score on the MATH dataset. On MMLU and most other knowledge benchmarks the margin was smaller, around half a point to a few points. Mixtral 8x7B Instruct's MT-Bench score of 8.30 placed it nearly even with GPT-3.5 Turbo (8.32) and well ahead of Llama 2 70B Chat (6.86), making it the strongest open-weight chat model on MT-Bench at release.
On the LMSYS Chatbot Arena leaderboard, Mixtral 8x7B Instruct reached an Elo of about 1121 in early 2024, ahead of Claude 2.1 (1117), GPT-3.5 Turbo (1117), and Gemini Pro (1111). For several months it was the highest-ranked openly available model on the Arena.
Multilingual evaluation showed that Mixtral substantially outperformed LLaMA 2 70B on French, German, Spanish, and Italian versions of HellaSwag, ARC Challenge, and MMLU. On the Bias Benchmark for QA (BBQ), Mixtral scored 56.0% versus LLaMA 2 70B's 51.5%, indicating somewhat lower social bias.
A natural question for any MoE model is whether the experts specialize. The Mixtral paper investigated this by tracking which experts were selected for tokens drawn from different domains: Python code, English Wikipedia, mathematics, French, German, and so on. The result was the opposite of what many expected. The router does not learn topical specialization. There is no "math expert" or "code expert." Instead, expert selection correlates strongly with token syntax and position. Adjacent tokens are often routed to the same expert pair, and certain experts dominate at the start of sentences, after punctuation, or for specific subword prefixes. Some weak topical bias does appear (slightly more activation of certain experts on Python tokens than on English prose tokens), but it never approaches strict partitioning. The routing pattern in middle layers is more diffuse than in the first or last layers.
This finding has been repeatedly reproduced in subsequent MoE work and is now widely cited as the standard observation: token-level routers tend to discover syntactic, positional, or low-level lexical patterns rather than human-interpretable topics. Later models such as DeepSeek-MoE attacked the problem with finer-grained experts and shared "always-on" experts to encourage more specialization, while still tolerating the diffuse routing behavior that Mixtral demonstrated.
Mixtral 8x7B has a native 32,768-token context window. The Mixtral paper reports a passkey retrieval test in which a random short passkey is inserted at a random position inside a long prompt. Mixtral 8x7B achieved 100% retrieval accuracy across the full 32K window, which means the model can locate and copy a string from anywhere in its context in this synthetic test. Real-world long-context performance (multi-document QA, long-form summarization) is somewhat weaker than the synthetic passkey result suggests, a pattern observed in essentially every long-context model evaluated since 2023.
Mistral released Mixtral 8x22B on April 10, 2024, again as a magnet link posted to its official X account. The torrent contained 281 GB of raw weights with no code or documentation. Hugging Face uploads followed within hours. Mistral published the official blog post ("Cheaper, Better, Faster, Stronger") on April 17, 2024, alongside the Instruct variant Mixtral 8x22B Instruct v0.1. As before, both base and Instruct were released under Apache 2.0.
The release landed in a busy window for open models. Meta released Llama 3 8B and 70B on April 18, 2024, eight days after Mixtral 8x22B's torrent and one day after Mistral's blog post. Snowflake announced its Arctic 480B MoE on April 24. April 2024 ended up being one of the most concentrated months in open-weight LLM history, and Mixtral 8x22B briefly held the title of best open MoE before Llama 3 70B's MMLU score of 82.0 took the lead.
Mixtral 8x22B keeps the same sparse MoE recipe but scales every dimension. It still uses eight experts per layer and top-2 routing, so the headline 8x22B name reflects expert-FFN size, not a deeper change in design.
| Parameter | Mixtral 8x7B | Mixtral 8x22B |
|---|---|---|
| Hidden dimension | 4,096 | 6,144 |
| Number of layers | 32 | 56 |
| Attention heads | 32 | 48 |
| Key-value heads | 8 | 8 |
| Head dimension | 128 | 128 |
| Intermediate (FFN) size per expert | 14,336 | 16,384 |
| Vocabulary size | 32,000 | 32,768 |
| Context length | 32,768 | 65,536 |
| Experts per layer | 8 | 8 |
| Active experts per token | 2 | 2 |
| Total parameters | ~46.7B | ~141B |
| Active parameters per token | ~12.9B | ~39B |
| RoPE theta | 1,000,000 | 1,000,000 |
| Load balancing coefficient | 0.02 | 0.001 |
The move to 56 layers (up from 32), 6,144 hidden dimension (up from 4,096), 48 attention heads (up from 32), and a 32,768-token expanded vocabulary brings the total parameter count to about 141 billion, with about 39 billion activated per token. The context window doubled to 65,536 tokens. Grouped query attention is preserved at the same 8 key-value heads, which now serve 48 query heads (a 6:1 ratio).
Mistral's announcement highlighted four main capabilities: fluency in English, French, Italian, German, and Spanish; strong code and math performance; native function calling support in the Instruct variant; and constrained-output / structured-generation support. Function calling and JSON-mode generation made the Instruct variant directly usable as a tool-using agent without external scaffolding, an important practical feature in 2024 when most open chat models still required prompt engineering to emit clean JSON.
The Mistral announcement and community evaluations report the following key numbers for Mixtral 8x22B:
| Benchmark | Mixtral 8x22B (Base) |
|---|---|
| MMLU (5-shot) | 77.7% |
| HellaSwag (0-shot, acc_norm) | 86.2% |
| ARC Challenge (0-shot, acc_norm) | 63.7% |
| WinoGrande (0-shot) | 79.8% |
| GSM8K (8-shot) | 76.5% |
| TriviaQA | 82.1% |
| HumanEval (pass@1) | 45.1% |
| MBPP (pass@1) | 71.2% |
| MATH (4-shot) | 41.8% |
For the Instruct variant, Mistral reported an MT-Bench score of 8.66, GSM8K with majority voting (maj@8) of 90.8%, and MATH (maj@4) of 44.6%. Compared to Mixtral 8x7B, the 8x22B improved MMLU by about 7 points, more than doubled the base MATH score, and pulled significantly ahead on code generation.
Llama 3 70B (also April 2024) reported MMLU 82.0%, GSM8K 93.0%, and HumanEval 81.7% (the Instruct variant). On raw benchmark scores Llama 3 70B beat Mixtral 8x22B in most categories, despite using a fully dense architecture with 70 billion parameters. The trade-off was that Llama 3 70B activates all 70B parameters per token while Mixtral 8x22B activates only 39B, so Mixtral retained an inference-speed advantage. The two models occupied different points on the same Pareto frontier: Llama 3 70B for raw quality, Mixtral 8x22B for efficiency at scale, and at the time of release Llama 3's larger pretraining corpus (15 trillion tokens versus an undisclosed but presumably smaller Mixtral 8x22B corpus) was widely credited with the gap.
The table below sets Mixtral against the major open-weight and proprietary models of late 2023 and 2024.
| Model | Developer | Release | Total params | Active params | Context | Architecture | MMLU | License |
|---|---|---|---|---|---|---|---|---|
| Mixtral 8x7B | Mistral AI | Dec 2023 | 46.7B | 12.9B | 32K | Sparse MoE (top-2 / 8) | 70.6% | Apache 2.0 |
| Mixtral 8x22B | Mistral AI | Apr 2024 | 141B | 39B | 65K | Sparse MoE (top-2 / 8) | 77.7% | Apache 2.0 |
| LLaMA 2 70B | Meta AI | Jul 2023 | 70B | 70B | 4K | Dense | 69.9% | Meta Community |
| Llama 3 70B | Meta AI | Apr 2024 | 70B | 70B | 8K | Dense | 82.0% | Meta Community |
| Falcon 180B | TII | Sep 2023 | 180B | 180B | 2K | Dense | 68.7% | Falcon TII |
| Qwen 72B | Alibaba | Nov 2023 | 72B | 72B | 32K | Dense | 74.4% | Qwen License |
| Qwen 1.5 MoE A2.7B | Alibaba | Mar 2024 | 14.3B | 2.7B | 32K | Sparse MoE (top-4 / 60) | ~62% | Apache 2.0 |
| DBRX Instruct | Databricks | Mar 2024 | 132B | 36B | 32K | Sparse MoE (top-4 / 16) | 73.7% | DBRX Open |
| Grok-1 | xAI | Mar 2024 | 314B | 86B | 8K | Sparse MoE (top-2 / 8) | 73.0% | Apache 2.0 |
| Snowflake Arctic | Snowflake | Apr 2024 | 480B | 17B | 4K | Dense + Sparse MoE (top-2 / 128) | 67.3% | Apache 2.0 |
| DeepSeek-V2 | DeepSeek | May 2024 | 236B | 21B | 128K | Sparse MoE (top-6 / 162) | 78.5% | DeepSeek Custom |
| DeepSeek-V3 | DeepSeek | Dec 2024 | 671B | 37B | 128K | Sparse MoE (top-8 / 256 + shared) | 88.5% | DeepSeek Open |
| GPT-3.5 Turbo | OpenAI | Mar 2023 | undisclosed | undisclosed | 16K | undisclosed | 70.0% | Proprietary |
A few patterns stand out. First, the basic Mixtral recipe (eight experts, top-2) was simple compared to what came later. Snowflake Arctic uses 128 experts with top-2; DeepSeek-V3 uses 256 fine-grained experts with top-8 plus shared always-on experts; Qwen 1.5 MoE uses 60 experts with top-4. The trend after Mixtral was clearly toward more, smaller experts and finer-grained routing. Second, Mixtral's permissive Apache 2.0 license set a baseline that most of the post-Mixtral open MoE models adopted (Llama's Community License remains a notable exception). Third, the gap in raw quality between dense and sparse models at fixed total parameter budget closed through 2024 and reversed by 2025: DeepSeek-V3 at 88.5 MMLU is higher than any dense open model of comparable cost.
For a model with 47 billion (or 141 billion) parameters that is intended for the open-source community, the inference story matters as much as the model itself. Mixtral was unusually well supported from day one because it shared its tokenizer and attention stack with Mistral 7B.
Hugging Face Transformers added a MixtralForCausalLM class on December 12, 2023, the day after the public announcement. The implementation uses standard PyTorch matrix multiplications for the experts, gathers token activations per expert, and is therefore not particularly fast for batch size 1, but it is correct and easy to fine-tune.
vLLM added Mixtral support in version 0.2.7 (December 2023), using fused MoE kernels initially borrowed from Megablocks. vLLM's PagedAttention combined with batched expert dispatch lets a single A100 80GB or H100 80GB serve Mixtral 8x7B at hundreds of tokens per second per request, and even higher aggregate throughput across many concurrent requests. As of 2025 vLLM remains the most common production server for Mixtral.
llama.cpp added Mixtral support on December 13, 2023, two days after release, including GGUF quantization. The community user TheBloke published quantized GGUF files within days. A 4-bit quantized Mixtral 8x7B (Q4_K_M) is approximately 26 GB and runs on a single 24 GB consumer GPU with some CPU offload, on dual 24 GB GPUs without offload, or fully on CPU at a few tokens per second using AVX-512 or Apple Silicon. Ollama packages a Mixtral 8x7B model that wraps the llama.cpp backend and is one of the simplest ways for a developer to run Mixtral locally.
Quantization has been particularly important for Mixtral because of the unusual memory profile. Full bfloat16 Mixtral 8x7B is 87 GB; 4-bit quantization brings it to about 24 to 26 GB; 2-bit experimental quantizations have been demonstrated as low as 14 GB at significant quality cost. Mixtral 8x22B at full precision needs roughly 262 GB; 4-bit Q4_K_M is approximately 80 GB and requires multiple consumer GPUs. Community experiments found that MoE models tolerate aggressive quantization slightly better than dense models because rarely-activated experts contribute less to any given token's prediction, so quantization error in those experts is partly hidden.
Other engines. TensorRT-LLM, MLC-LLM, ExLlamaV2, AutoAWQ, GPTQ-for-LLaMA, and Apple's MLX framework all added Mixtral support during 2024. AWS released Mixtral 8x22B as part of Amazon SageMaker JumpStart on April 23, 2024, and Together AI, Anyscale, Replicate, Fireworks, and Groq all offered hosted inference, in many cases at lower per-token cost than GPT-3.5 Turbo.
Within Mistral's own product line, Mixtral 8x7B was originally served behind the mistral-small API endpoint on La Plateforme (Mistral's developer API), priced as the company's mid-tier model below the proprietary Mistral Large. Mixtral 8x22B served the higher-tier mistral-large slot for a brief period in early 2024 before being replaced by the proprietary Mistral Large 2 in July 2024.
Le Chat, Mistral's consumer chatbot launched in beta in February 2024, used Mixtral 8x7B as its default model for free users at launch and offered Mixtral 8x22B and the proprietary Mistral Next on a paid tier. Le Chat later moved to newer Mistral models (Mistral Large, then Mistral Large 2, then Mistral 3) but Mixtral remained the default for several months.
Le Chat Enterprise, the business version aimed at corporate deployments, also surfaced Mixtral 8x22B as a self-hosted option for customers who wanted on-premises inference under Apache 2.0 without depending on Mistral's hosted API. This open-weight option turned out to be one of the main commercial draws of the Mixtral family for European enterprises with strict data residency rules.
As of 2025 and 2026, Mistral has progressively deprecated Mixtral on La Plateforme in favor of Mistral Large 2, Mistral Medium 3, Ministral, and the Mistral Small 3 family. Both Mixtral models remain freely downloadable, fine-tunable, and self-hostable; the company has stated that Apache models continue to be available via the mistral-inference and mistral-finetune SDKs even after the corresponding hosted endpoints are turned off.
Mixtral 8x7B is widely credited with starting the open-source mixture-of-experts era. Before December 2023, the strongest publicly described MoE language models (GShard, Switch Transformer, GLaM) were all internal Google projects. Mixtral was the first MoE language model with strong benchmark scores, full open weights, a permissive license, and same-week support in the standard inference stack. Within five months of Mixtral 8x7B's release, the open community shipped:
And by the end of 2024 and into 2025, DeepSeek-V3 (671B / 37B active) and the Mistral / Llama / Qwen successors all converged on sparse MoE as the default architecture for the largest open-weight models. DeepSeek-V3's technical report cites Mixtral as the proof of concept that motivated the company's MoE work. Llama 4 (2025) also moved Meta's flagship line to MoE.
Mixtral 8x7B became one of the most actively fine-tuned base models of 2024. Notable community-released variants include:
The permissive Apache 2.0 license (no field-of-use restriction, no monthly active user threshold like Llama's) made Mixtral particularly attractive for downstream commercial use. Several startups built products directly on Mixtral fine-tunes during 2024, including coding tools, translation services, and chat applications.
Mixtral inspired a new category of model engineering called MoE merging, sometimes nicknamed "FrankenMoE" or "MoErge." The MergeKit library by Charles Goddard at Arcee AI added support for assembling custom MoE models out of multiple existing fine-tuned dense Mistral 7B models. The recipe is to take the attention and norm weights from a base model, plug different fine-tuned models in as the eight experts, and then either initialize the router from positive and negative example prompts or train it on a small dataset. The result is a Mixtral-architecture model whose experts have known specializations (one fine-tuned on math, one on code, one on roleplay, etc.). FrankenMoE models such as Beyonder and Mixtral_AI_Cyber became popular community releases through 2024 and demonstrated that Mixtral's architecture was useful even without retraining from scratch.
In August 2024, MLCommons added Mixtral 8x7B as the official MoE workload in MLPerf Inference v4.1. The choice cemented Mixtral as the reference model for MoE inference hardware comparisons. By v5.0, every major hardware vendor (NVIDIA, AMD, Intel, Google, Cerebras, SambaNova, Untether) had submitted Mixtral 8x7B numbers.
The routing pattern at the heart of MoE makes small-batch inference inefficient. A single token at a single layer activates only two experts; the other six experts contribute nothing to that forward pass but still occupy memory and bandwidth. At batch size one, the active compute fraction can drop below 25% of theoretical peak FLOPs because each matrix multiply per expert is a small operation that fails to saturate the GPU. As batch size grows, more tokens reach each expert, the matrix multiplies get larger, and utilization rises. Mixtral therefore reaches its full speed advantage in serving scenarios with high concurrency, not in single-user chat sessions. This is one reason production Mixtral deployments use vLLM or TensorRT-LLM with continuous batching; without it, the speedup over a dense 13B model is much smaller.
Despite the load balancing loss, real Mixtral checkpoints show some imbalance. Some experts in some layers are selected slightly more often than others. Expert pruning experiments have shown that removing the least-used expert from a few middle layers degrades quality only modestly, suggesting that those experts are partly redundant with their neighbors. Conversely, removing the most-used expert from the same layers causes a much larger quality drop. This kind of post-hoc analysis has informed the design of follow-up MoE models such as DeepSeek-MoE, which use shared experts that are always active to absorb the parts of the computation that benefit from a single shared specialist.
Researchers have argued about how to count parameters for MoE models in scaling-law contexts. The classic Chinchilla scaling law was derived for dense models and uses total parameters. For MoE, it is not obvious whether to substitute total parameters, active parameters, or some compromise. Empirical results from Mixtral, DBRX, and DeepSeek suggest that active parameters underestimate effective capacity (a Mixtral 8x7B at 12.9B active is clearly stronger than a dense 13B), but total parameters overestimate it (Mixtral 8x7B at 47B total is weaker than a dense 47B would be). A common rule of thumb is to compute the geometric mean of active and total parameters as an effective dense-equivalent size, which puts Mixtral 8x7B around 25 billion equivalents and Mixtral 8x22B around 75 billion equivalents.
Mixtral has several recognized limitations:
After the Mixtral releases, Mistral pursued a parallel track of dense and MoE models, mostly under proprietary licenses for the largest releases.
| Model | Date | Architecture | Params (active / total) | License |
|---|---|---|---|---|
| Mistral Large | Feb 2024 | Dense | undisclosed | Proprietary |
| Mistral Large 2 (2407) | Jul 2024 | Dense | 123B | Mistral Research / Commercial |
| Mistral NeMo | Jul 2024 | Dense | 12B | Apache 2.0 |
| Codestral 22B | May 2024 | Dense | 22B | Mistral Non-Production |
| Codestral Mamba | Jul 2024 | Mamba (state space) | 7.3B | Apache 2.0 |
| Mathstral 7B | Jul 2024 | Dense | 7.3B | Apache 2.0 |
| Pixtral 12B | Sep 2024 | Multimodal dense | 12B | Apache 2.0 |
| Ministral 3B / 8B | Oct 2024 | Dense | 3B / 8B | Mistral Research / Commercial |
| Mistral Small 3 | Jan 2025 | Dense | 24B | Apache 2.0 |
| Mistral Large 3 | 2025 | Dense | undisclosed | Proprietary |
Mistral has not released a direct "Mixtral 3" successor in the same naming line. The company appears to use sparse MoE inside its proprietary frontier models (Mistral Large 3 and Mistral Medium 3 reportedly use MoE techniques internally) but has not released another Apache-licensed MoE checkpoint as of early 2026. Mixtral 8x7B and 8x22B therefore remain the most recent Mistral MoE models with public weights.