Mixtral is a family of open-weight Sparse Mixture of Experts (SMoE) large language models developed by Mistral AI, a French artificial intelligence company. The family includes Mixtral 8x7B, released in December 2023, and Mixtral 8x22B, released in April 2024. Both models use a decoder-only Transformer architecture where each feedforward network (FFN) layer is replaced by a set of eight expert networks, with a learned router selecting two experts per token at each layer. This design allows the models to maintain a large total parameter count for capacity while activating only a fraction of those parameters during inference, resulting in faster processing and lower computational cost compared to equivalently sized dense models.
Mixtral 8x7B was one of the first open-weight MoE models to match or exceed the performance of much larger dense models such as LLaMA 2 70B and GPT-3.5 Turbo on standard benchmarks. Its release under the permissive Apache 2.0 license demonstrated that Sparse MoE architectures could be both practical and competitive in the open-source ecosystem. Mixtral 8x22B extended this approach with a larger architecture, supporting a 65,536-token context window and delivering improved reasoning and multilingual performance. Both models have had a significant influence on the open-source AI community, inspiring extensive fine-tuning, quantization, and model-merging experimentation.
The Mixture of Experts concept has a long history in machine learning, dating back to the 1991 paper "Adaptive Mixtures of Local Experts" by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton. The core idea involves dividing a neural network into multiple specialized sub-networks ("experts") managed by a gating network that routes inputs to the most relevant experts. In 2017, Noam Shazeer and colleagues at Google proposed applying sparsely-gated MoE layers to recurrent neural network language models, scaling to 137 billion parameters while keeping computation manageable.
Subsequent work at Google scaled MoE to Transformer-based models. GShard (2020) introduced a 600-billion-parameter multilingual model that replaced every other feedforward network layer with a top-2 MoE layer. The Switch Transformer (2021) simplified routing to top-1 selection and scaled to over one trillion parameters. GLaM (2021) reached 1.2 trillion parameters with a similar approach. However, these models were not released publicly, limiting community experimentation.
Mistral AI, founded in April 2023 by former researchers from Google DeepMind and Meta AI, first released Mistral 7B in September 2023. That 7.3-billion-parameter dense model introduced several architectural innovations including Sliding Window Attention and Grouped-Query Attention. Mixtral 8x7B built directly on the Mistral 7B architecture, replacing each dense FFN layer with an MoE layer containing eight experts.
Mistral AI initially released Mixtral 8x7B on December 8, 2023, posting the model weights via a torrent link on social media, a method the company had also used for Mistral 7B. The official blog post and technical paper followed on January 8, 2024, with the paper published on arXiv as "Mixtral of Experts" (arXiv:2401.04088). Both the base model and an instruction-tuned variant (Mixtral 8x7B Instruct v0.1) were released under the Apache 2.0 license, making them freely available for commercial and research use without restrictions.
Mixtral 8x7B uses a decoder-only Transformer architecture that is identical to Mistral 7B except that each FFN layer is replaced by a Sparse Mixture of Experts layer. The model's architectural dimensions are as follows:
| Parameter | Value |
|---|---|
| Hidden dimension | 4,096 |
| Number of layers | 32 |
| Attention heads | 32 |
| Key-value heads | 8 (Grouped-Query Attention) |
| Head dimension | 128 |
| Intermediate (FFN) size | 14,336 |
| Vocabulary size | 32,000 |
| Context length | 32,768 tokens |
| Number of experts per layer | 8 |
| Active experts per token | 2 (top-2 routing) |
| Total parameters | ~46.7 billion |
| Active parameters per token | ~12.9 billion |
| Activation function | SiLU |
| Positional encoding | Rotary Position Embedding (RoPE) |
At each layer, for every input token, a router network (a learned linear layer followed by a softmax) computes a probability distribution over the eight experts and selects the top two. The outputs of these two experts are combined as a weighted sum, with the weights determined by the router's softmax probabilities. Mathematically, for a given input x at layer l, the output y is:
y = sum over top-2 experts i of G(x)_i * E_i(x)
where G(x)_i is the gating weight for expert i and E_i(x) is the output of expert i applied to x. This top-2 routing strategy follows the approach used by GShard, which Mistral AI cited as the primary inspiration for their MoE implementation. However, unlike GShard, which applied MoE to every other FFN layer, Mixtral applies MoE to every FFN layer in the model.
The model uses Grouped-Query Attention (GQA) with 32 attention heads and 8 key-value heads, reducing memory bandwidth requirements during inference. It also uses a byte-pair encoding (BPE) tokenizer with a vocabulary of 32,000 tokens and supports RoPE (Rotary Position Embeddings) for positional encoding with a theta value of 1,000,000.
Because only 2 of the 8 experts are activated per token, the computational cost of each forward pass is equivalent to that of a dense model with roughly 12.9 to 14 billion parameters, even though the total parameter count is approximately 46.7 billion. However, all parameters must still be loaded into memory (or distributed across devices), so the VRAM requirement is similar to that of a 47-billion-parameter dense model.
Mistral AI has disclosed limited information about the training procedure for Mixtral 8x7B. The paper states that the model was pre-trained on data extracted from the open web, but the company has not revealed the specific datasets, total token count, or hardware configuration used for training. The model was trained using bfloat16 precision. The router auxiliary loss coefficient was set to 0.02, which helps balance expert utilization during training by encouraging the gating network to distribute tokens more evenly across experts.
The Mixtral 8x7B Instruct variant was created by applying supervised fine-tuning (SFT) on instruction-response pairs, followed by Direct Preference Optimization (DPO) on a paired feedback dataset. Mistral AI did not specify the exact datasets used for SFT and DPO, though community investigation suggested the use of curated conversational data.
The Mixtral paper reported results on a wide range of standard benchmarks, comparing Mixtral 8x7B against LLaMA 2 70B and GPT-3.5 Turbo. The following table reproduces key results from the paper:
Table: Mixtral 8x7B vs. LLaMA 2 70B and GPT-3.5 (from the Mixtral paper, Table 2 and Table 3)
| Benchmark | Mixtral 8x7B | LLaMA 2 70B | GPT-3.5 |
|---|---|---|---|
| MMLU (5-shot) | 70.6% | 69.9% | 70.0% |
| HellaSwag (10-shot) | 86.7% | 87.1% | 85.5% |
| WinoGrande (5-shot) | 81.2% | 83.2% | 81.6% |
| ARC Challenge (25-shot) | 85.8% | 85.1% | 85.2% |
| PIQA (0-shot) | 83.6% | 82.6% | -- |
| ARC-Easy (0-shot) | 83.1% | 79.9% | -- |
| NaturalQuestions (5-shot) | 30.6% | 25.4% | -- |
| TriviaQA (5-shot) | 71.5% | 73.0% | -- |
| HumanEval (pass@1) | 40.2% | 29.3% | -- |
| MBPP (pass@1) | 60.7% | 49.8% | 52.2% |
| MATH (4-shot, maj@4) | 28.4% | 13.8% | -- |
| GSM8K (5-shot) | 58.4% | 53.6% | 57.1% |
| MT-Bench | 8.30 | 6.86 | 8.32 |
Several patterns emerge from these results. Mixtral 8x7B matched or exceeded LLaMA 2 70B on the majority of benchmarks while using roughly 5 to 6 times fewer active parameters during inference. The improvements were especially pronounced in code generation (HumanEval: 40.2% vs. 29.3%) and mathematical reasoning (MATH: 28.4% vs. 13.8%; GSM8K: 58.4% vs. 53.6%). On MMLU, Mixtral achieved 70.6% compared to LLaMA 2 70B's 69.9%, a modest but notable margin given that Mixtral activates far fewer parameters per token.
Compared to GPT-3.5, Mixtral 8x7B performed similarly on MMLU (70.6% vs. 70.0%) and GSM8K (58.4% vs. 57.1%), while outperforming it on MBPP (60.7% vs. 52.2%). The Mixtral 8x7B Instruct model achieved an MT-Bench score of 8.30, nearly matching GPT-3.5's 8.32 and significantly exceeding LLaMA 2 70B Chat's 6.86. At the time, this made Mixtral 8x7B Instruct the highest-scoring openly available model on MT-Bench.
Mixtral also demonstrated strong multilingual performance. The paper reported that it significantly outperformed LLaMA 2 70B on HellaSwag, ARC Challenge, and MMLU across French, German, Spanish, and Italian. The model showed reduced bias compared to LLaMA 2 on the Bias Benchmark for QA (BBQ), scoring 56.0% vs. LLaMA 2's 51.5%.
The Mixtral paper analyzed expert assignment patterns and found that while experts do show some degree of specialization, the pattern is not strictly domain-based. The analysis revealed that the router's expert selection depends more on token syntax and position in the sentence than on the topic of the text. For instance, certain experts tend to be selected more frequently for tokens at the beginning of sentences or for specific grammatical constructs, rather than being assigned exclusively to, say, mathematics or code. However, the paper also noted that some layers do exhibit more specialized routing patterns than others.
The authors visualized expert assignments across different text domains (Python code, English prose, mathematics, and multilingual text) and observed that adjacent tokens are often assigned to the same expert pair, suggesting that the router learns to recognize local syntactic patterns. Interestingly, while no single expert was exclusively dedicated to a particular domain, certain experts appeared with higher frequency in code-related tokens or mathematical expressions. This partial specialization without strict partitioning is consistent with findings in other MoE research, where experts tend to develop overlapping but loosely specialized roles.
The Mixtral paper also evaluated the model's ability to retrieve information from long contexts using a passkey retrieval task. In this test, a random passkey (a sequence of digits) is inserted at a random position within a long prompt, and the model must extract and return it. Mixtral 8x7B achieved 100% retrieval accuracy across its full 32,768-token context window, demonstrating effective utilization of its entire context length for information retrieval tasks.
Mistral AI released Mixtral 8x22B on April 10, 2024, initially as raw model weights, followed by the official announcement and the Instruct variant on April 17, 2024. Like its predecessor, the model was released under the Apache 2.0 license.
Mixtral 8x22B scales up the same Sparse MoE design with significantly larger dimensions:
| Parameter | Mixtral 8x7B | Mixtral 8x22B |
|---|---|---|
| Hidden dimension | 4,096 | 6,144 |
| Number of layers | 32 | 56 |
| Attention heads | 32 | 48 |
| Key-value heads | 8 | 8 |
| Head dimension | 128 | 128 |
| Intermediate (FFN) size | 14,336 | 16,384 |
| Vocabulary size | 32,000 | 32,000 |
| Context length | 32,768 | 65,536 |
| Number of experts per layer | 8 | 8 |
| Active experts per token | 2 | 2 |
| Total parameters | ~46.7B | ~141B |
| Active parameters per token | ~12.9B | ~39B |
| RoPE theta | 1,000,000 | 1,000,000 |
The 8x22B variant retains the same top-2 routing mechanism and 8 experts per layer. Its major scaling comes from increasing the number of layers from 32 to 56, expanding the hidden dimension from 4,096 to 6,144, and increasing the number of attention heads from 32 to 48. This results in a total parameter count of approximately 141 billion, with roughly 39 billion active parameters per token. The context window was doubled from 32,768 to 65,536 tokens, enabling the model to process significantly longer documents.
The model continues to use Grouped-Query Attention with 8 key-value heads shared across 48 query heads (a 6:1 ratio), the same BPE tokenizer with a 32,000-token vocabulary, and bfloat16 precision.
Mistral AI highlighted several capabilities for Mixtral 8x22B in their announcement:
The following table summarizes available benchmark results for Mixtral 8x22B:
| Benchmark | Mixtral 8x22B (Base) |
|---|---|
| MMLU | 77.3% |
| HellaSwag (0-shot, acc_norm) | 86.2% |
| ARC Challenge (0-shot, acc_norm) | 63.7% |
| WinoGrande (0-shot) | 79.8% |
| GSM8K (base) | 76.5% |
| MT-Bench (Instruct) | 8.66 |
| GSM8K maj@8 (Instruct) | 90.8% |
| MATH maj@4 (Instruct) | 44.6% |
Compared to Mixtral 8x7B, the 8x22B model showed substantial improvements across most benchmarks. The MMLU score increased from 70.6% to 77.3%. The Instruct variant's GSM8K score with majority voting (maj@8) reached 90.8%, and its MATH score (maj@4) reached 44.6%, representing a large improvement in mathematical reasoning.
The MT-Bench score for the Instruct version of 8x22B was 8.66, compared to 8.30 for the 8x7B Instruct version, indicating improved instruction-following ability.
The following table compares the Mixtral models with other prominent open-weight and proprietary models from the same time period:
| Model | Developer | Release | Total Params | Active Params | Context | Architecture | MMLU | License |
|---|---|---|---|---|---|---|---|---|
| Mixtral 8x7B | Mistral AI | Dec 2023 | 46.7B | 12.9B | 32K | Sparse MoE | 70.6% | Apache 2.0 |
| Mixtral 8x22B | Mistral AI | Apr 2024 | 141B | 39B | 65K | Sparse MoE | 77.3% | Apache 2.0 |
| LLaMA 2 70B | Meta AI | Jul 2023 | 70B | 70B (dense) | 4K | Dense | 69.9% | Meta Community |
| Falcon 180B | TII | Sep 2023 | 180B | 180B (dense) | 2K | Dense | 68.7% | Falcon TII |
| Qwen 72B | Alibaba | Nov 2023 | 72B | 72B (dense) | 32K | Dense | 74.4% | Qwen License |
| GPT-3.5 Turbo | OpenAI | Mar 2023 | Undisclosed | Undisclosed | 16K | Undisclosed | 70.0% | Proprietary |
This comparison highlights several key points. Mixtral 8x7B achieved MMLU performance comparable to LLaMA 2 70B and GPT-3.5 while activating only 12.9 billion parameters per token, roughly one-fifth of LLaMA 2 70B's 70 billion. Falcon 180B, despite having nearly four times more parameters, scored lower on MMLU (68.7% vs. 70.6%) and required significantly more computational resources. The original Qwen 72B outperformed Mixtral 8x7B on MMLU (74.4% vs. 70.6%) but used a fully dense architecture requiring all 72 billion parameters during inference.
Mixtral 8x22B, with 77.3% on MMLU and 39 billion active parameters, delivered performance competitive with much larger dense models while maintaining the computational efficiency advantages of the MoE approach.
Mistral AI has been notably opaque about its training methodology. For both Mixtral 8x7B and 8x22B, the company has confirmed that the models were pre-trained on data extracted from the open web but has declined to disclose specific dataset compositions, total training tokens, or the hardware used for pre-training, citing competitive considerations.
What is known:
The Mixtral 8x7B Instruct variant was released alongside the base model in December 2023. It was fine-tuned using supervised fine-tuning (SFT) on curated instruction-response pairs, followed by Direct Preference Optimization (DPO) on paired human preference data. The model achieved an MT-Bench score of 8.30, placing it above Claude 2.1 (8.18), all versions of GPT-3.5 Turbo at the time, and Gemini Pro on the LMSYS Chatbot Arena leaderboard. On the LMSYS Arena Elo ratings, Mixtral 8x7B Instruct achieved a rating of 1121 at its peak, above Claude 2.1 (1117), GPT-3.5 Turbo (1117), and Gemini Pro (1111).
The Mixtral 8x22B Instruct variant was released on April 17, 2024. In addition to the standard SFT and DPO pipeline, this model included native function calling support, enabling it to generate structured tool-use outputs. The Instruct variant showed markedly improved math performance, achieving 90.8% on GSM8K with majority voting (maj@8) and 44.6% on MATH (maj@4).
Mixtral 8x7B is widely recognized as the model that proved open-source Sparse MoE was viable and practical. Before its release, the most prominent MoE language models (GShard, Switch Transformer, GLaM) were internal projects at Google that were never publicly released. Mixtral demonstrated that an MoE model could be released as open weights, run on consumer and enterprise hardware (with appropriate quantization), and achieve competitive performance against both proprietary models and dense open-source alternatives.
The release triggered a wave of interest in MoE architectures across the open-source community and the broader AI industry. Subsequent open MoE models, including DeepSeek-MoE (January 2024), DBRX by Databricks (March 2024), Grok-1 by xAI (March 2024), Qwen 1.5 MoE (March 2024), and later DeepSeek-V2 (May 2024), were all released in the months following Mixtral. While each of these models introduced its own innovations (DeepSeek-MoE used finer-grained experts with shared expert isolation; DBRX used 16 experts with top-4 routing), Mixtral's success in demonstrating competitive open-weight MoE performance helped catalyze this broader trend.
The Apache 2.0 licensing was also significant. Previous large open models like LLaMA 2 used Meta's Community License, which imposed restrictions on commercial use for applications with over 700 million monthly active users. Falcon 180B used a custom TII license with its own limitations. Mixtral's fully permissive Apache 2.0 license set a standard that many subsequent open-weight releases followed.
Mixtral became one of the most actively fine-tuned model families in the open-source community. Notable community-created variants include:
These community efforts benefited from the Apache 2.0 license, which imposes no restrictions on derivative works or commercial use.
Due to the large total parameter count (46.7B for 8x7B and 141B for 8x22B), quantization has been particularly important for making Mixtral models accessible on consumer hardware. The community produced numerous quantized variants in formats including GGUF (for llama.cpp and Ollama), GPTQ, AWQ, and EXL2. The 8x7B model at full bfloat16 precision requires approximately 87 GB of VRAM, but when quantized to 4-bit precision, it can fit in approximately 24-26 GB of VRAM, making it runnable on high-end consumer GPUs such as the NVIDIA RTX 4090 (24 GB) or dual RTX 3090s. The 8x22B model at full precision requires roughly 262 GB of VRAM, and even at 4-bit quantization it needs around 80 GB, necessitating multi-GPU configurations or specialized inference hardware.
MoE models present unique quantization considerations. Because each expert's weights are accessed less frequently than weights in a dense model (only when that expert is selected by the router), quantization errors in rarely-activated experts have less overall impact on output quality. Several community experiments found that Mixtral tolerated aggressive quantization (down to 2-3 bits for some layers) better than dense models of comparable size.
Mixtral also catalyzed innovation in model merging. The MergeKit library by Arcee AI added support for creating custom MoE models (sometimes called "FrankenMoEs" or "MoErges") by combining expert layers from different fine-tuned models. This technique involves taking the attention and normalization layers from a base model while mixing FFN layers from different specialized fine-tunes as individual experts, then training or initializing a router to select among them. This approach allowed practitioners to create specialized MoE models without training from scratch.
In August 2024, MLCommons added Mixtral 8x7B as an official benchmark in the MLPerf Inference suite, recognizing it as a representative MoE model for measuring inference hardware performance. This inclusion further cemented Mixtral's role as a standard reference model in the AI ecosystem.
A defining characteristic of Sparse MoE models like Mixtral is the separation between memory requirements and computational cost. While Mixtral 8x7B activates only 12.9 billion parameters per token, all 46.7 billion parameters must reside in memory (or be available for quick loading). This means:
For Mixtral 8x22B, these trade-offs are even more pronounced: 39 billion active parameters provide the compute profile of a large but not enormous model, while the full 141 billion parameters require substantial memory resources.
A known challenge with MoE models is ensuring that tokens are distributed evenly across experts. If certain experts receive disproportionately more tokens ("expert collapse"), training becomes inefficient and model quality degrades. Mixtral addresses this through an auxiliary load-balancing loss added to the training objective. This loss penalizes uneven expert utilization, encouraging the router to distribute tokens more uniformly.
Sparse MoE models can experience reduced hardware utilization at small batch sizes because the expert routing creates irregular computation patterns. At larger batch sizes, the workload across experts becomes more balanced, leading to better GPU utilization and higher throughput. This means that Mixtral models tend to be most efficient in serving scenarios with high concurrent request volumes.
Despite its strong benchmark performance, Mixtral has several recognized limitations:
Following the Mixtral releases, Mistral AI continued developing both dense and MoE architectures:
Mistral AI has not released a direct successor to the Mixtral MoE line as of early 2025, though the company's proprietary API offerings may incorporate MoE techniques internally.