Mixtral 8x22B
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,594 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,594 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mixtral 8x22B is a sparse mixture-of-experts large language model released by the French AI company Mistral AI in April 2024. It is the second and largest member of the Mixtral family of mixture-of-experts models, following the smaller Mixtral 8x7B released in December 2023. The model has approximately 141 billion total parameters but uses only about 39 billion per token at inference, thanks to a sparse routing design that activates two of eight experts for each token. Mistral published the weights for both a base checkpoint and an instruction-tuned checkpoint under the Apache 2.0 license, making Mixtral 8x22B one of the most permissively licensed frontier-class open-weight models available at the time of release.
Mistral first made the model available on April 10, 2024 by posting a torrent magnet link to the social media platform X, a distribution method the company had used previously for Mixtral 8x7B. A formal launch blog post followed on April 17, 2024, accompanied by the instruction-tuned variant Mixtral-8x22B-Instruct-v0.1 and an official Hugging Face release. The base model is published as mistralai/Mixtral-8x22B-v0.1 and the instruct version as mistralai/Mixtral-8x22B-Instruct-v0.1. The context window is 64,000 tokens, the model natively supports five languages (English, French, Italian, German, and Spanish), and the instruct version was natively trained for function calling.
At launch, Mixtral 8x22B was positioned as the strongest open-weight model on coding and mathematics benchmarks, with reported scores of about 77.3 percent on MMLU, 88.9 percent on HellaSwag, and 90.8 percent on GSM8K (maj@8) for the instruction-tuned version. The model competed directly with two other big open-weight releases that arrived in the same month, Meta's Llama 3 70B and Databricks' DBRX, as well as the earlier Command R+ from Cohere. Mistral's own framing of the release, the title of the launch blog post, was 'Cheaper, Better, Faster, Stronger,' emphasizing the inference-cost advantage of a sparse MoE over a dense model of similar quality.
Mistral AI was founded in April 2023 in Paris by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, three former research scientists from Google DeepMind and Meta AI. The company quickly became the public face of European foundation-model development, building a reputation for releasing capable open-weight models on tighter compute budgets than its U.S. peers. Mistral's first major release was the dense 7-billion-parameter Mistral 7B in September 2023, distributed via torrent under Apache 2.0. The release set the template that would define Mistral's first year: surprise distribution through a magnet link on social media, open weights under a permissive license, and a focus on raw performance per parameter.
In December 2023, Mistral released Mixtral 8x7B, the first open-weight sparse mixture-of-experts model to attract widespread attention. Mixtral 8x7B combined eight 7-billion-parameter experts with top-two routing to deliver roughly 47 billion total parameters and about 13 billion active per token. The model matched or exceeded Llama 2 70B and the original GPT-3.5 on most public benchmarks while requiring much less inference compute, and it became one of the most widely deployed open-weight models in the first half of 2024. The Mixtral 8x7B release validated the Mixture of Experts approach as a practical path to capable open-source models, and several other labs subsequently shipped their own sparse-MoE flagships.
Mixtral 8x22B follows directly from that success. Where Mixtral 8x7B had been engineered to fit into roughly the parameter and serving footprint of a Llama 2 13B at active-parameter count while delivering near-70B performance, Mixtral 8x22B was scaled up to a much larger total parameter count while keeping the same eight-experts-with-top-two-routing architecture. The intent was to deliver performance competitive with the strongest closed dense models of the time while keeping inference costs in the same range as a dense 39-billion-parameter model. The release also arrived in a remarkably crowded month: Meta's Llama 3 8B and 70B shipped on April 18, 2024, one day after Mistral's announcement, and Databricks had released DBRX in late March. Cohere's Command R+ had also launched on April 4, 2024 with a 104-billion-parameter dense architecture and similar enterprise positioning.
Mixtral 8x22B uses the same general design as Mixtral 8x7B but scaled to a much larger expert size. The model is a decoder-only transformer with sparse mixture-of-experts feed-forward blocks substituted for the dense FFN layers found in most transformer language models. There are 56 transformer layers in total, and each MoE feed-forward block contains eight experts, with a router selecting the top two experts for each token. The expert size of approximately 22 billion parameters gives the model its '8x22B' name, although the total parameter count is not eight times 22 billion; the experts share several components across layers, which keeps the total at roughly 141 billion rather than the naive 176 billion.
The model uses grouped-query attention with 48 query heads and 8 key-value heads, rotary position embeddings (RoPE), RMSNorm for normalization, and SwiGLU activations in the expert feed-forward networks. The published configuration on Hugging Face lists a hidden dimension of 6,144 and an FFN intermediate size of 16,384 per expert. The vocabulary is 32,000 tokens and uses the same byte-level BPE tokenizer family as earlier Mistral and Mixtral models, accessed through the mistral-common library or the Hugging Face Tokenizers integration. Two experts are selected per token, so the active parameter count for any single forward pass is about 39 billion rather than the 141 billion that would be touched in a dense model of the same nominal size.
The context window is 65,536 tokens, which was substantially larger than most open-weight models in April 2024. The base checkpoint supports a maximum sequence length of 64K, and Mistral did not initially apply sliding-window attention to Mixtral 8x22B as it had with the original Mistral 7B. The model is distributed in BF16 by default, and the full checkpoint weighs in at roughly 281 gigabytes in BF16, putting it out of reach of single-GPU serving but well within the capacity of a multi-GPU node with adequate memory. Several quantized community releases in Q4, Q5, and Q8 GGUF formats appeared on Hugging Face within days of the initial torrent drop.
The instruction-tuned variant added special tokens for tool use, including [TOOL_CALLS], [AVAILABLE_TOOLS], [/AVAILABLE_TOOLS], [TOOL_RESULTS], and [/TOOL_RESULTS]. Mistral explicitly designed the model to be natively capable of function calling without a separate fine-tuning stage on top, which differentiated it from many earlier open-weight models that required community fine-tunes to handle structured output and tool selection reliably.
Mistral has historically been more restrained about disclosing training details than several of its open-weight peers, and Mixtral 8x22B is no exception. The launch blog post does not specify the pretraining token count, the data mixture, the optimizer settings, or the compute budget. The Hugging Face model card likewise reports only the architectural configuration and licensing terms, and Mistral has not published a research paper or detailed technical report for the model. The available public information is therefore narrow, and most discussions of the training procedure draw inferences from architecture details and from what is known about earlier Mixtral models.
The broad outlines that Mistral has confirmed include that the model is a pretrained base checkpoint with a separately released instruction-tuned variant. The instruction tuning targeted general conversational ability, function calling, and structured output rather than long chain-of-thought reasoning. Mistral has stated that pretraining covered the five primary supported languages (English, French, Italian, German, and Spanish), with sufficient coverage to outperform Llama 2 70B on translated versions of HellaSwag, ARC Challenge, and MMLU in French, German, Spanish, and Italian. Code data was included in the pretraining mixture, which is consistent with the model's strong performance on HumanEval and MBPP.
The model is released as version 0.1 in both base and instruct forms, with the file naming Mixtral-8x22B-v0.1 and Mixtral-8x22B-Instruct-v0.1. The v0.1 designation is consistent with Mistral's release versioning practice for its earlier models and does not imply a particularly early stage of training; it is the standard initial public release.
Mixtral 8x22B was evaluated on the standard suite of open-weight LLM benchmarks at launch, and Mistral published a set of comparison plots showing the model's performance against other open models in its class. The table below summarizes the most widely cited numbers, drawn from Mistral's launch blog post and from the community benchmark thread on Hugging Face. Scores are for the base model unless noted.
| Benchmark | Score | Notes |
|---|---|---|
| MMLU (5-shot) | ~77.3% | Massive Multitask Language Understanding |
| HellaSwag | 88.9% | Commonsense reasoning, acc_norm |
| ARC Challenge | 70.5% | AI2 Reasoning Challenge, acc_norm |
| Winogrande | 79.8% | Commonsense reasoning |
| PIQA | 84.9% | Physical commonsense |
| BoolQ | 87.8% | Yes/no question answering |
| GSM8K (maj@8) | 90.8% | Grade-school math word problems, instruct version |
| MATH (maj@4) | 44.6% | Competition math, instruct version |
| HumanEval | strong open-source result | Python code generation, pass@1 |
| MBPP | strong open-source result | Python code generation |
| MT-Bench | 8.66 | Instruct version, multi-turn dialogue |
At launch, Mistral framed the model as the strongest open-weight model on coding and mathematics benchmarks. On GSM8K with majority voting at 8 samples, the instruct version reached 90.8 percent, which was a clear lead over Llama 2 70B and most other open-weight peers available at the time. On HumanEval and MBPP for code generation, Mistral reported that Mixtral 8x22B outperformed all other open models in its evaluation set, although exact pass@1 numbers were given as bars on a chart rather than as a tabular score in the blog post. On the MATH benchmark with majority voting at 4 samples, the model reached 44.6 percent, which was likewise a leading number among open-weight models in April 2024.
On the multilingual suite, Mistral published scores for HellaSwag, ARC Challenge, and MMLU translated into French, German, Spanish, and Italian, showing that Mixtral 8x22B outperformed Llama 2 70B on every language-benchmark pair. This multilingual headroom was a deliberate result of the pretraining data mixture and was one of the model's clearer differentiators compared with the contemporaneous Llama 3 release, which had a strong English-centric pretraining recipe.
For the instruction-tuned variant, the most widely cited public score is the MT-Bench result of 8.66, reported in the community benchmark thread. That figure placed Mixtral-8x22B-Instruct-v0.1 below Claude 3 Opus (9.43), GPT-4-1106-Preview (9.32), Claude 3 Sonnet (9.18), and the community fine-tune WizardLM-2 8x22B (9.12), but ahead of most other open-weight chat models at the time. Mixtral-8x22B-v0.1 also achieved top-performing pretrained model status on the Open LLM Leaderboard at the time of release.
Where the model trailed contemporaries was on raw English-only knowledge benchmarks. Llama 3 70B, released the day after Mixtral 8x22B, posted higher MMLU and HumanEval numbers, particularly in its instruction-tuned form. Mixtral 8x22B's value proposition therefore landed on cost efficiency, multilingual coverage, and open license rather than on outright benchmark leadership in English.
Mixtral 8x22B is released under the Apache 2.0 license, which is one of the most permissive open-source licenses in widespread use. Apache 2.0 permits commercial use, modification, redistribution, private use, and sublicensing without royalty obligations, and it does not require derivative works to be released under the same license. The license also includes an explicit patent grant from contributors, which provides some protection against patent litigation for users of the model.
This was a notable choice at the time. The other big open-weight release of April 2024, Llama 3, came with a custom Meta Llama 3 license that imposed several use restrictions, including a clause requiring large platforms (over 700 million monthly active users) to obtain a separate commercial license from Meta. DBRX was released under the Databricks Open Model License, which similarly carried use restrictions. Cohere's Command R+ was released under a non-commercial CC-BY-NC-4.0 license for research use only. Against this comparison set, Mixtral 8x22B's Apache 2.0 release was the most permissive commercially available option in its capability tier.
Mistral framed the Apache 2.0 choice as continuing its commitment to open-source software. The company has used Apache 2.0 for most of its open-weight releases, with the notable exception of Mistral Large 2 in 2024 and a handful of other models that were released under the Mistral Research License. The Apache 2.0 designation made Mixtral 8x22B usable in enterprise settings without separate commercial licensing arrangements, which was a meaningful advantage for production deployment.
Both the base and instruct weights are subject to the same license, and Mistral has been clear that fine-tuning and continued pretraining on the published weights is permitted. Several third-party fine-tunes appeared on Hugging Face within days of the release, including Microsoft's WizardLM-2 8x22B (which posted an MT-Bench score above the official Mistral instruct version), Nous Research's Hermes fine-tunes, and several domain-specific adaptations for coding, math, and biomedical text.
April 2024 was an unusually busy month for big open-weight model releases, and Mixtral 8x22B shipped into direct competition with several models in roughly the same capability tier. The table below collects headline specifications for the most relevant peers. Some figures are approximate, and benchmark scores are for the instruction-tuned variants where available.
| Model | Developer | Architecture | Total / active params | Context window | License | Release |
|---|---|---|---|---|---|---|
| Mixtral 8x22B | Mistral AI | Sparse MoE | 141B / 39B | 64K | Apache 2.0 | April 2024 |
| Llama 3 70B | Meta | Dense | 70B / 70B | 8K (initial) | Meta Llama 3 License | April 2024 |
| DBRX | Databricks | Fine-grained MoE | 132B / 36B | 32K | Databricks Open Model License | March 2024 |
| Command R+ | Cohere | Dense | 104B / 104B | 128K | CC-BY-NC-4.0 | April 2024 |
| Mixtral 8x7B | Mistral AI | Sparse MoE | 47B / 13B | 32K | Apache 2.0 | December 2023 |
Against Llama 3 70B, Mixtral 8x22B traded raw English benchmark scores for inference efficiency and multilingual coverage. Llama 3 70B led on MMLU, HumanEval, and several other English-centric suites in the instruction-tuned comparison, while Mixtral 8x22B led on French, German, Spanish, and Italian benchmarks. Llama 3 70B's initial 8K context window was substantially smaller than Mixtral 8x22B's 64K, although Meta later extended the context length in subsequent variants. On inference cost, the sparse MoE design gave Mixtral 8x22B a clear advantage at comparable batch sizes; serving a 39B-active model is meaningfully cheaper than serving a 70B dense model when memory bandwidth is the binding constraint.
Against DBRX, the comparison was closer in architectural terms. DBRX uses a fine-grained mixture-of-experts design with 16 experts and 4 active per token, which contrasts with Mixtral 8x22B's 8 experts and 2 active. DBRX has about 132 billion total parameters and 36 billion active, slightly smaller than Mixtral 8x22B in both dimensions. Public benchmark comparisons typically gave Mixtral 8x22B a small lead on most reasoning and coding benchmarks, although the gap narrowed on math and on long-context tasks where DBRX's specific training mixture helped. The licensing comparison was more favorable to Mixtral 8x22B; the Apache 2.0 release removed the use restrictions present in the Databricks Open Model License.
Against Command R+, the design choice was the most divergent. Command R+ was a dense 104B model with a 128K context window, optimized for enterprise RAG and tool use rather than for raw benchmark leadership. Command R+'s longer context window was an advantage for long-document tasks, but its dense architecture made inference more expensive per token than Mixtral 8x22B's sparse design. The CC-BY-NC-4.0 license restricted Command R+ to research and non-commercial use, while Mixtral 8x22B's Apache 2.0 release was usable in production. Most independent comparisons placed the two models in roughly the same capability tier on standard benchmarks, with the choice between them often driven by license and context-length requirements rather than by quality.
Against its smaller sibling Mixtral 8x7B, the new model offered a step up in both quality and inference cost. The 39B active parameter count was three times the 13B active count of Mixtral 8x7B, which translated to higher serving costs but also to substantially better performance on reasoning, math, and code benchmarks. Many deployments that did not need the additional headroom continued to run Mixtral 8x7B for its lower inference footprint, while Mixtral 8x22B was reserved for higher-quality use cases. Both models share the same eight-experts-with-top-two-routing design, which made it relatively easy to migrate code and serving infrastructure between them.
The initial reception of Mixtral 8x22B was strongly positive across the open-source community. The April 10 magnet link drop generated considerable excitement on X and Reddit, with developers downloading and quantizing the weights within hours of the initial post. The formal April 17 launch consolidated this momentum with a detailed blog post and the simultaneous release of the instruction-tuned variant, which made the model immediately usable for production-style evaluation without waiting for community fine-tunes.
Developer-focused outlets generally framed the release as a successful continuation of the open-source MoE story that Mistral had started with Mixtral 8x7B. Analytics Vidhya, DataCamp, NVIDIA Technical Blog, and AWS Machine Learning Blog all covered the release within the first month, with consistent emphasis on the cost-efficiency story and the Apache 2.0 license. NVIDIA highlighted Mixtral 8x22B's integration into the company's NIM and API platform, and AWS announced availability of the model on SageMaker JumpStart in May 2024. The model was also added to most major inference platforms within days of release, including Together AI, Fireworks, OpenRouter, and Mistral's own La Plateforme.
Reception in the academic and research community focused on the architectural choices and the multilingual benchmark headroom. Several researchers noted that the eight-experts-with-top-two-routing design had now been validated at two distinct scales (Mixtral 8x7B and Mixtral 8x22B), which provided useful evidence for designers of subsequent sparse-MoE systems. The multilingual benchmark scores on French, German, Spanish, and Italian were widely cited as a counterexample to the assumption that strong multilingual performance required dedicated multilingual pretraining recipes.
The most pointed criticism was that the English-only benchmark gap with Llama 3 70B grew narrower than Mistral's launch framing suggested once Meta's model became available the following day. Independent reviewers running head-to-head comparisons on instruction-following benchmarks often found Llama 3 70B Instruct slightly ahead on most English-centric tasks, although the gap was generally small and varied by benchmark. The narrative of Mixtral 8x22B as the unambiguous open-weight leader did not hold for long, and the model settled into a more nuanced position as one of several strong options in the open-source frontier, distinguished primarily by multilingual coverage, inference efficiency, and license terms.
A secondary criticism concerned the lack of technical documentation. Mistral did not publish a research paper for Mixtral 8x22B, did not disclose the training data mixture, did not specify the compute budget, and did not provide a detailed architecture description beyond the high-level launch blog post. Several researchers compared this unfavorably with the more detailed technical reports that accompanied DBRX and Llama 3. The criticism mirrored similar complaints about earlier Mistral releases and reflected an ongoing tension in the open-weight community between Mistral's pragmatic 'release weights and let the community figure it out' approach and the more documentation-heavy practices of some peers.
Within the broader European AI conversation, the release was framed as a continuing signal that Mistral was capable of competing with the largest U.S. labs at frontier scale. Mistral's announcement of a $640 million Series B funding round in June 2024 cited Mixtral 8x22B and the company's open-source flywheel as central to the investment thesis. The model also became one of the most downloaded large open-weight checkpoints on Hugging Face in the months following its release, with community fine-tunes accounting for a significant share of the additional downloads.
Mistral subsequently shifted its flagship release strategy toward dense models with the launch of Mistral Large 2 in July 2024, and later returned to MoE at much larger scale with Mistral Large 3 in December 2025. Mixtral 8x22B itself remained the strongest open-weight MoE model in Mistral's lineup until that point and continued to see deployment in production settings well into 2025. Mistral documentation lists a retirement date of March 30, 2025 for the hosted Mixtral 8x22B API endpoint, with Mistral Small as the recommended replacement, although the open weights remain available for self-hosting under Apache 2.0 indefinitely.