Mixtral 8x22B
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,823 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,823 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mixtral 8x22B is a sparse mixture-of-experts (MoE) large language model released by the French AI company Mistral AI in April 2024. It is the second and largest member of the Mixtral family of mixture-of-experts models, following the smaller Mixtral 8x7B released in December 2023.[1][2] The model has approximately 141 billion total parameters but activates only about 39 billion per token at inference, thanks to a sparse routing design in which a learned router selects two of eight expert feed-forward networks for each token.[1] Mistral published the weights for both a base checkpoint and an instruction-tuned checkpoint under the Apache 2.0 license, making Mixtral 8x22B one of the most permissively licensed frontier-class open-weight models available at the time of release.[3][4]
Mistral first made the model available on April 10, 2024 by posting a torrent magnet link to the social media platform X, a distribution method the company had used previously for Mistral 7B and Mixtral 8x7B.[5][6] A formal launch blog post titled "Cheaper, Better, Faster, Stronger" followed on April 17, 2024, accompanied by the instruction-tuned variant Mixtral-8x22B-Instruct-v0.1 and an official Hugging Face release.[1] The base model is published as mistralai/Mixtral-8x22B-v0.1 and the instruct version as mistralai/Mixtral-8x22B-Instruct-v0.1.[3][4] The context window is 65,536 tokens, the model natively supports five languages (English, French, Italian, German, and Spanish), and the instruct version was natively trained for function calling.[1][4]
At launch, Mistral positioned Mixtral 8x22B as the strongest open-weight model on coding and mathematics benchmarks, with reported scores of about 77 percent on MMLU, 88.9 percent on HellaSwag (as reported in the blog post), and 90.8 percent on GSM8K (maj@8) for the instruction-tuned version.[1] The model competed directly with two other major open-weight releases that arrived in the same month, Meta's Llama 3 70B and Databricks' DBRX, as well as Cohere's Command R+ from earlier that month. Mistral's framing of the release emphasized the inference-cost advantage of a sparse MoE relative to a dense model of similar quality.[1]
Mistral AI was founded in April 2023 in Paris by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, three former research scientists from Google DeepMind and Meta AI. The company rapidly became the public face of European foundation-model development, building a reputation for releasing capable open-weight models on tighter compute budgets than its U.S. peers. Mistral's first major release was the dense 7-billion-parameter Mistral 7B in September 2023, distributed via torrent under Apache 2.0. That release set the template that would define Mistral's first year of activity: surprise distribution through a magnet link on social media, open weights under a permissive license, and a focus on raw performance per parameter.
In December 2023, Mistral released Mixtral 8x7B, the first open-weight sparse mixture-of-experts model to attract widespread attention. Mixtral 8x7B combined eight 7-billion-parameter experts with top-two routing to deliver roughly 47 billion total parameters and about 13 billion active per token. The model matched or exceeded Llama 2 70B and the original GPT-3.5 on most public benchmarks while requiring substantially less inference compute, and it became one of the most widely deployed open-weight models in the first half of 2024.[7] The Mixtral 8x7B release validated the mixture-of-experts approach as a practical path to capable open-source models, and several other labs subsequently shipped their own sparse-MoE flagships.
Mixtral 8x22B follows directly from that success. Where Mixtral 8x7B had been engineered to fit roughly into the parameter and serving footprint of a Llama 2 13B at active-parameter count while delivering near-70B performance, Mixtral 8x22B was scaled up to a much larger total parameter count while keeping the same eight-experts-with-top-two-routing architecture.[1][2] The intent was to deliver performance competitive with the strongest closed dense models of the time while keeping inference costs in the same range as a dense 39-billion-parameter model.[1] The release also arrived in a remarkably crowded month: Meta's Llama 3 8B and 70B shipped on April 18, 2024, one day after Mistral's formal announcement; Databricks had released DBRX in late March; and Cohere's Command R+ had launched on April 4, 2024 with a 104-billion-parameter dense architecture and similar enterprise positioning.
Mistral's release of Mixtral 8x22B followed the company's now-established pattern of a low-ceremony weights drop followed by a curated marketing rollout. On April 10, 2024, the company posted a torrent magnet link to its X account, allowing the open-source community to begin downloading the raw base checkpoint immediately.[5][6][8] The torrent payload was approximately 281 gigabytes of safetensors files in BF16 precision, and quantized GGUF community releases in 4-bit, 5-bit, and 8-bit formats appeared on Hugging Face within hours of the initial post.[9]
On April 17, 2024, Mistral published its formal launch blog, "Cheaper, Better, Faster, Stronger," and simultaneously released the instruction-tuned Mixtral-8x22B-Instruct-v0.1 on Hugging Face.[1][4] The blog presented the model as the result of an effort to "push the frontier of AI and make it accessible to all," and highlighted four properties: native fluency in English, French, Italian, German, and Spanish; native function-calling capability with a 64K context window; the best cost-performance ratio among open-weight models in its class; and a permissive Apache 2.0 license.[1] The official model card on Hugging Face for the base model was uploaded the same day, formalizing the previously torrent-only release.[3]
The two-step release format gave community fine-tuners a head start. Microsoft's WizardLM team shipped a fine-tune named WizardLM-2-8x22B on the day of the official announcement, and Nous Research published Hermes Pro variants shortly afterward, so by the time the official instruct version landed, several competing instruction-tuned offshoots were already available.[10]
Mixtral 8x22B uses the same general design as Mixtral 8x7B but scaled to a much larger expert size. The model is a decoder-only transformer with sparse mixture-of-experts feed-forward blocks substituted for the dense FFN layers found in standard transformer language models.[11] The published config.json lists 56 transformer layers, and each MoE feed-forward block contains eight expert sub-networks, with a router selecting the top two experts for each token (num_local_experts: 8, num_experts_per_tok: 2).[11] The expert size of approximately 22 billion parameters gives the model its "8x22B" name, although the total parameter count is not exactly eight times 22 billion; the experts share several components across layers, which keeps the total at roughly 141 billion rather than the naive 176 billion.[1][11]
The model uses grouped-query attention (GQA) with 48 query heads and 8 key-value heads (num_attention_heads: 48, num_key_value_heads: 8), rotary position embeddings (RoPE) with theta set to 1,000,000, RMSNorm for normalization, and SwiGLU activations in the expert feed-forward networks.[11] The published configuration lists a hidden dimension of 6,144 and an FFN intermediate size of 16,384 per expert.[11] The vocabulary size is 32,000 tokens and uses the same byte-level BPE tokenizer family as earlier Mistral and Mixtral models, accessed through the mistral-common library or the Hugging Face Tokenizers integration. Two experts are selected per token, so the active parameter count for any single forward pass is approximately 39 billion rather than the 141 billion that would be touched in a dense model of the same nominal size.[1]
The context window is 65,536 tokens (max_position_embeddings: 65536), which was substantially larger than most open-weight models in April 2024.[11] The base checkpoint supports a maximum sequence length of 64K, and Mistral did not initially apply sliding-window attention to Mixtral 8x22B as it had with the original Mistral 7B (sliding_window: null in the config).[11] The model is distributed in BF16 by default, and the full checkpoint weighs in at roughly 281 gigabytes in BF16, putting it out of reach of single-GPU serving but well within the capacity of a multi-GPU node with adequate memory.[4] Several quantized community releases in Q4, Q5, and Q8 GGUF formats appeared on Hugging Face within days of the initial torrent drop, alongside AWQ and GPTQ variants for the vLLM and TensorRT-LLM serving stacks.[9]
The instruction-tuned variant added special tokens for tool use, including [TOOL_CALLS], [AVAILABLE_TOOLS], [/AVAILABLE_TOOLS], [TOOL_RESULTS], and [/TOOL_RESULTS].[4] Mistral explicitly designed the model to be natively capable of function calling without a separate fine-tuning stage on top, which differentiated it from many earlier open-weight models that required community fine-tunes to handle structured output and tool selection reliably.[1][4] Mistral published example code in the model card showing how to wire the special tokens into the transformers chat-template machinery, with support starting in transformers v4.42.0.[4]
Mistral has historically been more restrained about disclosing training details than several of its open-weight peers, and Mixtral 8x22B is no exception. The launch blog post does not specify the pretraining token count, the data mixture, the optimizer settings, or the compute budget.[1] The Hugging Face model card likewise reports only the architectural configuration and licensing terms, and Mistral has not published a research paper or detailed technical report for the model.[3][4] The available public information is therefore narrow, and most discussions of the training procedure draw inferences from architecture details and from what is known about earlier Mixtral models.
The broad outlines that Mistral has confirmed include that the model is a pretrained base checkpoint with a separately released instruction-tuned variant.[1] The instruction tuning targeted general conversational ability, function calling, and structured output rather than long chain-of-thought reasoning. Mistral has stated that pretraining covered the five primary supported languages (English, French, Italian, German, and Spanish), with sufficient coverage to outperform Llama 2 70B on translated versions of HellaSwag, ARC Challenge, and MMLU in French, German, Spanish, and Italian.[1] Code data was included in the pretraining mixture, which is consistent with the model's strong performance on HumanEval and MBPP.[1]
The model is released as version 0.1 in both base and instruct forms, with the file naming Mixtral-8x22B-v0.1 and Mixtral-8x22B-Instruct-v0.1.[3][4] The v0.1 designation is consistent with Mistral's release versioning practice for its earlier models and does not imply a particularly early stage of training; it is the standard initial public release. Mistral has not released subsequent point updates to the Mixtral 8x22B weights, in contrast to several other Mistral models (Mistral 7B v0.2 and v0.3, Mixtral 8x7B v0.1 then refresh) that received small post-launch revisions.
Mixtral 8x22B was evaluated on the standard suite of open-weight LLM benchmarks at launch, and Mistral published a set of comparison plots showing the model's performance against other open models in its class. The table below summarizes the most widely cited numbers, drawn from Mistral's launch blog post and from the community benchmark thread on Hugging Face. Scores are for the base model unless noted; instruct-version numbers are clearly marked.
| Benchmark | Score | Notes |
|---|---|---|
| MMLU (5-shot) | ~77% | Massive Multitask Language Understanding[1] |
| HellaSwag (10-shot) | 88.9% | Commonsense reasoning, as reported by Mistral[1] |
| HellaSwag (0-shot) | 86.17% | acc_norm, community measurement[10] |
| ARC Challenge (0-shot) | 63.65% | acc_norm, community measurement[10] |
| ARC Easy (0-shot) | 84.01% | acc_norm, community measurement[10] |
| Winogrande | 79.8% | Commonsense reasoning[10] |
| PIQA | 84.87% | Physical commonsense, acc_norm[10] |
| BoolQ | 87.80% | Yes/no question answering[10] |
| OpenBookQA | 49.60% | acc_norm[10] |
| AGIEval (avg) | 52.23% | Multi-subtask average[10] |
| GSM8K (maj@8) | 90.8% | Grade-school math, instruct version[1] |
| MATH (maj@4) | 44.6% | Competition math, instruct version[1] |
| HumanEval | leading open-source result | Python code, pass@1[1] |
| MBPP | leading open-source result | Python code generation[1] |
| MT-Bench | 8.66 | Instruct version, multi-turn dialogue[10] |
At launch, Mistral framed the model as the strongest open-weight model on coding and mathematics benchmarks. On GSM8K with majority voting at 8 samples, the instruct version reached 90.8 percent, which was a clear lead over Llama 2 70B and most other open-weight peers available at the time.[1] On HumanEval and MBPP for code generation, Mistral reported that Mixtral 8x22B outperformed all other open models in its evaluation set, although exact pass@1 numbers were given as bars on a chart rather than as a tabular score in the blog post.[1] On the MATH benchmark with majority voting at 4 samples, the model reached 44.6 percent, which was likewise a leading number among open-weight models in April 2024.[1]
On the multilingual suite, Mistral published scores for HellaSwag, ARC Challenge, and MMLU translated into French, German, Spanish, and Italian, showing that Mixtral 8x22B outperformed Llama 2 70B on every language-benchmark pair.[1] This multilingual headroom was a deliberate result of the pretraining data mixture and was one of the model's clearer differentiators compared with the contemporaneous Llama 3 release, which had a strong English-centric pretraining recipe.
For the instruction-tuned variant, the most widely cited public score is the MT-Bench result of 8.66, reported in the community benchmark thread.[10] That figure placed Mixtral-8x22B-Instruct-v0.1 below Claude 3 Opus (9.43), GPT-4-1106-Preview (9.32), and Claude 3 Sonnet (9.18), but ahead of most other open-weight chat models at the time.[10] The base Mixtral-8x22B-v0.1 also briefly achieved top-performing pretrained-model status on the Open LLM Leaderboard at the time of release.[10] Note that the community thread later updated the ARC Challenge measurement from an initial 70.5 percent (reported in some early summaries) to the more rigorous acc_norm value of 63.65 percent after methodology issues with the original evaluation were caught.[10]
Where the model trailed contemporaries was on raw English-only knowledge benchmarks. Llama 3 70B Instruct, released the day after Mixtral 8x22B's formal announcement, posted higher MMLU and HumanEval numbers in head-to-head comparisons. Mixtral 8x22B's value proposition therefore landed on cost efficiency, multilingual coverage, and open license rather than on outright benchmark leadership in English.
Mixtral 8x22B was natively pretrained on a multilingual mixture covering English, French, Italian, German, and Spanish, and Mistral's launch blog presented the five-language coverage as a first-class feature rather than as a secondary capability layered on after English-only pretraining.[1] On Mistral's published multilingual benchmark plots, the model outperformed Llama 2 70B on translated versions of HellaSwag, ARC Challenge, and MMLU in all four non-English supported languages.[1] The gap was largest on the European Romance languages (French, Italian, Spanish), reflecting both the geographic origin of Mistral's training data sources and the linguistic proximity of those languages to English.
The five-language design was a deliberate continuation of Mistral's positioning as a European AI champion. The selection of supported languages corresponded approximately to the major economies of Western Europe, and the company has consistently emphasized European-language coverage in its product marketing. In practice, Mixtral 8x22B also handles many other languages with varying quality due to incidental coverage in web-scale training data, but the five-language list represents the set for which Mistral makes explicit quality guarantees.[1] Subsequent Mistral models including Mistral NeMo and Mistral Small would expand multilingual coverage further; Mistral NeMo, released in July 2024 in collaboration with NVIDIA, claimed support for more than 100 languages with strong performance on a similar five-language European core.
The instruction-tuned variant Mixtral-8x22B-Instruct-v0.1 was natively trained to call functions and produce structured outputs, a capability Mistral built into the model rather than left to community fine-tunes.[1] The model card lists a dedicated set of special tokens for tool use, including [TOOL_CALLS], [AVAILABLE_TOOLS], [/AVAILABLE_TOOLS], [TOOL_RESULTS], and [/TOOL_RESULTS], with sample code showing how to wire them through the Hugging Face transformers chat template starting in version 4.42.0.[4]
Function-calling support meant Mixtral 8x22B could be slotted into agentic systems and retrieval-augmented generation pipelines without an intermediary fine-tuning stage. LangChain, LlamaIndex, and Mistral's own mistral-inference and mistral-common libraries added first-class support shortly after launch. The design also influenced subsequent Mistral models, including Mistral Large, where Mistral standardized the tool-use special-token vocabulary across the lineup. Independent reviewers found tool-use on Mixtral 8x22B reliable for single-tool selection and moderately reliable for multi-tool scenarios, although it lacked the JSON-mode strict-validation features that OpenAI and Anthropic offered through their hosted APIs at the time.
Mixtral 8x22B is released under the Apache 2.0 license, which is one of the most permissive open-source licenses in widespread use.[3][4] Apache 2.0 permits commercial use, modification, redistribution, private use, and sublicensing without royalty obligations, and it does not require derivative works to be released under the same license. The license also includes an explicit patent grant from contributors, which provides some protection against patent litigation for users of the model.
This was a notable choice at the time. The other big open-weight release of April 2024, Meta's Llama 3, came with a custom Meta Llama 3 license that imposed several use restrictions, including a clause requiring large platforms (over 700 million monthly active users) to obtain a separate commercial license from Meta. Databricks' DBRX was released under the Databricks Open Model License, which similarly carried use restrictions. Cohere's Command R+ was released under a non-commercial CC-BY-NC-4.0 license for research use only. Against this comparison set, Mixtral 8x22B's Apache 2.0 release was the most permissive commercially available option in its capability tier.[1]
Mistral framed the Apache 2.0 choice as continuing its commitment to open-source software.[1] The company has used Apache 2.0 for most of its open-weight releases through mid-2024, with the notable exception of Mistral Large 2 (released in July 2024 under the more restrictive Mistral Research License) and a handful of other models. The Apache 2.0 designation made Mixtral 8x22B usable in enterprise settings without separate commercial licensing arrangements, which was a meaningful advantage for production deployment.
Both the base and instruct weights are subject to the same license, and Mistral has been clear that fine-tuning and continued pretraining on the published weights is permitted.[4] Several third-party fine-tunes appeared on Hugging Face within days of the release, including Microsoft's WizardLM-2 8x22B (which posted an MT-Bench score above the official Mistral instruct version), Nous Research's Hermes fine-tunes, and several domain-specific adaptations for coding, math, and biomedical text.[10]
Mixtral 8x22B was distributed across multiple channels from launch. The primary distribution route was Hugging Face, where the base checkpoint at mistralai/Mixtral-8x22B-v0.1 and the instruct checkpoint at mistralai/Mixtral-8x22B-Instruct-v0.1 were available for direct download under Apache 2.0.[3][4] The initial torrent magnet link on X served as a parallel distribution channel for the base model in advance of the formal Hugging Face upload.[5]
Mistral also offered Mixtral 8x22B through its own hosted inference platform, La Plateforme, with the model name open-mixtral-8x22b for the base and open-mixtral-8x22b-instruct for the instruction-tuned variant.[12] Third-party inference providers added support within days of the release. Together AI, Fireworks, OpenRouter, Anyscale Endpoints, and Perplexity all offered hosted Mixtral 8x22B endpoints by the end of April 2024, with per-token pricing in the range of $0.65–$1.20 per million tokens at the time of launch (substantially cheaper than equivalent dense-model offerings due to the sparse MoE design).[13][14]
Cloud providers followed shortly after. NVIDIA integrated Mixtral 8x22B into its NIM and API platform,[15] Amazon Web Services added the model to SageMaker JumpStart in May 2024,[16] and Microsoft Azure AI Studio and Google Cloud Vertex AI Model Garden subsequently listed the model in their catalogs. For local deployment, the model required approximately 281 GB of memory in BF16; community 4-bit GGUF quantizations reduced this to roughly 70 GB.[9]
The initial reception of Mixtral 8x22B was strongly positive across the open-source community. The April 10 magnet-link drop generated considerable excitement on X and Reddit, with developers downloading and quantizing the weights within hours of the initial post.[5][6] The formal April 17 launch consolidated this momentum with a detailed blog post and the simultaneous release of the instruction-tuned variant, which made the model immediately usable for production-style evaluation without waiting for community fine-tunes.[1]
Developer-focused outlets framed the release as a successful continuation of the open-source MoE story that Mistral had started with Mixtral 8x7B. Analytics Vidhya, DataCamp, NVIDIA's developer blog, and AWS's machine learning blog all covered the release within the first month, with consistent emphasis on the cost-efficiency story and the Apache 2.0 license.[13][15][16][17] Reception in the research community focused on the architectural choices and the multilingual benchmark headroom: the eight-experts-with-top-two-routing design had now been validated at two distinct scales, providing useful evidence for designers of subsequent sparse-MoE systems.
The most pointed criticism was that the English-only benchmark gap with Llama 3 70B grew narrower than Mistral's launch framing suggested once Meta's model became available the following day. Independent reviewers running head-to-head comparisons on instruction-following benchmarks often found Llama 3 70B Instruct slightly ahead on most English-centric tasks. The narrative of Mixtral 8x22B as the unambiguous open-weight leader did not hold for long, and the model settled into a more nuanced position as one of several strong options in the open-source frontier, distinguished primarily by multilingual coverage, inference efficiency, and license terms. A secondary criticism concerned the lack of technical documentation: Mistral did not publish a research paper, did not disclose the training data mixture, did not specify the compute budget, and did not provide a detailed architecture description beyond the high-level launch blog post.[1] Several researchers compared this unfavorably with the more detailed technical reports that accompanied DBRX and Llama 3.
Within the broader European AI conversation, the release was framed as a continuing signal that Mistral was capable of competing with the largest U.S. labs at frontier scale. Mistral's announcement of a $640 million Series B funding round in June 2024 cited Mixtral 8x22B and the company's open-source flywheel as central to the investment thesis. The model became one of the most downloaded large open-weight checkpoints on Hugging Face in the months following its release.[4]
April 2024 was an unusually busy month for big open-weight model releases, and Mixtral 8x22B shipped into direct competition with several models in roughly the same capability tier. The table below collects headline specifications for the most relevant peers. Some figures are approximate, and benchmark scores are for the instruction-tuned variants where available.
| Model | Developer | Architecture | Total / active params | Context window | License | Release |
|---|---|---|---|---|---|---|
| Mixtral 8x22B | Mistral AI | Sparse MoE | 141B / 39B | 64K | Apache 2.0 | April 2024 |
| Llama 3 70B | Meta | Dense | 70B / 70B | 8K (initial) | Meta Llama 3 License | April 2024 |
| DBRX | Databricks | Fine-grained MoE | 132B / 36B | 32K | Databricks Open Model License | March 2024 |
| Command R+ | Cohere | Dense | 104B / 104B | 128K | CC-BY-NC-4.0 | April 2024 |
| Mixtral 8x7B | Mistral AI | Sparse MoE | 47B / 13B | 32K | Apache 2.0 | December 2023 |
Against Llama 3 70B, Mixtral 8x22B traded raw English benchmark scores for inference efficiency and multilingual coverage. Llama 3 70B led on MMLU, HumanEval, and several other English-centric suites in the instruction-tuned comparison, while Mixtral 8x22B led on French, German, Spanish, and Italian benchmarks. Llama 3 70B's initial 8K context window was substantially smaller than Mixtral 8x22B's 64K, although Meta later extended the context length in subsequent variants. The sparse MoE design gave Mixtral 8x22B a clear inference-cost advantage at comparable batch sizes; serving a 39B-active model is meaningfully cheaper than serving a 70B dense model when memory bandwidth is the binding constraint.
Against DBRX, the comparison was closer architecturally. DBRX uses a fine-grained MoE design with 16 experts and 4 active per token, contrasting with Mixtral 8x22B's 8 experts and 2 active, with about 132 billion total and 36 billion active. Public benchmark comparisons typically gave Mixtral 8x22B a small lead on most reasoning and coding tasks, although DBRX often narrowed the gap on math. The licensing comparison was more favorable to Mixtral 8x22B; the Apache 2.0 release removed the use restrictions present in the Databricks Open Model License. Against Command R+, the design choice was the most divergent: Command R+ was a dense 104B model with a 128K context window optimized for enterprise RAG, with a longer-context advantage but a more expensive dense forward pass and a non-commercial license.
Against its smaller sibling Mixtral 8x7B, the new model offered a step up in both quality and inference cost. The 39B active parameter count was three times the 13B active count of Mixtral 8x7B, translating to higher serving costs but substantially better performance on reasoning, math, and code benchmarks. Many deployments that did not need the additional headroom continued to run Mixtral 8x7B, while Mixtral 8x22B was reserved for higher-quality use cases. Both models share the same eight-experts-with-top-two-routing design, which made it straightforward to migrate code and serving infrastructure between them.
Mistral's release strategy shifted in the months following Mixtral 8x22B. The company's flagship effort moved toward dense models with the launch of Mistral Large 2 in July 2024 (a 123B dense model released under the Mistral Research License rather than Apache 2.0), followed by a series of further proprietary frontier checkpoints. In parallel, Mistral introduced Mistral NeMo in July 2024 in collaboration with NVIDIA, a 12B dense model trained jointly with NVIDIA infrastructure and released under Apache 2.0, which served as the recommended replacement for Mistral 7B in many deployments.
The Mistral Small family launched later in 2024 and early 2025 with dense architectures and a focus on cost-effective serving for general-purpose workloads. Through several subsequent point releases, Mistral Small became the company's recommended open-weight replacement for many of the earlier Mixtral generation hosted API endpoints, including Mixtral 8x22B itself.[12] Mistral has not released a refresh of the Mixtral 8x22B weights, and the v0.1 base and instruct checkpoints remained the only official versions of the model.
Mistral subsequently returned to large MoE designs at much larger scale in later releases, but the eight-experts-with-top-two-routing topology specific to Mixtral 8x22B was not directly repeated. Several other labs adopted similar designs for their own open-weight MoE models, including Snowflake's Arctic and various Chinese open-weight MoE families, and the design template established by Mixtral 8x22B influenced the broader sparse-MoE landscape through 2024 and 2025.
Mistral documentation indicates that the hosted Mixtral 8x22B API endpoint on La Plateforme was deprecated on November 30, 2024, with a retirement date of March 30, 2025, after which the hosted endpoint was no longer accessible.[12] Mistral Small 3.2 was designated as the recommended replacement for general-purpose workloads on the hosted API.[12] The open weights remain available for self-hosting under Apache 2.0 indefinitely; Hugging Face continues to serve both the base and instruct checkpoints, and the license terms permit ongoing commercial use without renewal.[3][4]
For cost-conscious deployments, Mixtral 8x22B remained a competitive option through 2024 and into 2025, particularly for workloads where the 64K context window and native function-calling support fit the application, where the Apache 2.0 license simplified compliance review, and where multilingual coverage across the five supported European languages was a hard requirement. Several third-party hosting providers continued to offer Mixtral 8x22B endpoints after Mistral's own hosted retirement, including Together AI, Fireworks, and OpenRouter, with pricing typically in the same range as smaller dense models due to the active-parameter advantage of the sparse design.[14]
In the longer historical view, Mixtral 8x22B represents a high-water mark for the first-generation open-weight MoE design template. Its eight-experts-with-top-two-routing topology was the standard for sparse MoE releases through mid-2024, after which fine-grained MoE designs with many more experts (DBRX, Snowflake Arctic, DeepSeek-V2 and -V3) became more common. The model also marked the last open-weight frontier release from Mistral under the company's original "Apache 2.0 by default" policy; subsequent flagship releases moved to more restrictive licensing as Mistral expanded its commercial product line. Mixtral 8x22B thus occupies a transitional position in the Mistral lineup: the final permissively licensed frontier-class checkpoint from the company, and a benchmark against which subsequent open-weight MoE releases were measured.