Mistral Large is a family of large language models developed by Mistral AI, the Paris-based AI company founded in 2023. The family comprises the original Mistral Large (released February 2024), Mistral Large 2 (July 2024), Mistral Large 2.1 (November 2024), and the concurrent Mistral Medium 3 (May 2025). Throughout its iterations, the family has been positioned as Mistral AI's flagship commercial and enterprise offering, targeting high-complexity tasks such as reasoning, multilingual document analysis, code generation, and agentic workflows.
Mistral Large 2, released on July 24, 2024, is the most widely referenced version of the family. It has 123 billion parameters, a 128,000-token context window, and open weights released under the Mistral Research License. At launch, Mistral AI positioned it as competitive with GPT-4o and Claude 3 Opus on coding and reasoning benchmarks. The model weights are freely downloadable for research and non-commercial use, while commercial API access is available through Mistral's own La Plateforme as well as Microsoft Azure, Amazon Web Services, Google Cloud Vertex AI, and IBM watsonx.
Mistral AI was founded in April 2023 by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, all former researchers from DeepMind and Meta AI. The company's early reputation was built on a series of efficient, open-weight models: Mistral 7B (September 2023) demonstrated that a 7-billion-parameter model could outperform Llama 2 13B on most benchmarks, largely due to architectural choices like grouped-query attention (GQA) and sliding window attention (SWA). Mixtral 8x7B followed in December 2023 using a sparse mixture-of-experts (MoE) design that activated only 12.9 billion parameters per forward pass despite having 46.7 billion total parameters.
Those models established a pattern of releasing competitive open-weight models that outperformed much larger alternatives in both quality and inference cost. Mistral Large represented a departure from that pattern in one important respect: unlike Mistral 7B and Mixtral 8x7B, the original Mistral Large (February 2024) did not have open weights at launch. It was released as a closed-source API model, positioned as a commercial flagship to compete directly against GPT-4 and Claude 2.
The transition from that closed-source approach began with Mistral Large 2 in July 2024, which released model weights under the Mistral Research License. That move placed the weights in a middle ground: freely available for research and inspection, but requiring a separate commercial license from Mistral AI for any deployment that charges for access.
Mistral Large 1, also known by its model identifier mistral-large-2402, was announced on February 26, 2024, alongside the formation of a distribution partnership with Microsoft. The partnership made Mistral Large the first non-OpenAI commercial language model available through Azure AI Studio and Azure Machine Learning, and Microsoft accompanied the announcement with a reported $16 million investment in Mistral AI.
The model was introduced with a 32,000-token context window, native fluency in English, French, Spanish, German, and Italian, and built-in function calling and JSON mode for structured outputs. Mistral AI described it as the world's second-best generally available model at the time of launch, trailing only GPT-4 on the MMLU benchmark with a reported score of 81.2%. On reasoning benchmarks including HellaSwag, WinoGrande, Arc Challenge, and TriviaQA, the model substantially outperformed LLaMA 2 70B across all five languages it natively supported.
Mistral AI simultaneously released Mistral Small, a lower-latency model positioned below Mistral Large in capability but above Mixtral 8x7B in performance. Together, the two models formed the first explicit tier structure in Mistral's commercial lineup.
The original Mistral Large was a closed model. Its weights were not released publicly, and access was available only through La Plateforme and Azure. That approach reflected Mistral AI's effort to establish a commercial revenue stream alongside its open-weight research models.
Pricing for Mistral Large 1 at launch was set at $4.00 per million input tokens and $12.00 per million output tokens on La Plateforme, placing it in the range of GPT-4 Turbo pricing at the time.
Mistral Large 2, designated mistral-large-2407, was announced on July 24, 2024. It represented a significant step forward in almost every measurable dimension compared to its predecessor: parameters increased from an undisclosed count to 123 billion (confirmed), context window expanded from 32k to 128k tokens, and the model weights were made publicly available for the first time under the Mistral Research License.
Mistral AI described the design goal as creating a model capable of running at scale on a single node, a practical consideration for enterprises operating private inference clusters. The model uses a standard dense decoder-only Transformer architecture, distinguishing it from Mixtral's sparse MoE design. Key architectural parameters include:
At launch, benchmark scores on the base pretrained model included 84.0% on MMLU (English). The instruct variant showed HumanEval scores of 92%, HumanEval Plus at 87%, MBPP Base at 80%, and MBPP Plus at 69%. On mathematics, the model scored 93% on GSM8K and 71.5% on MATH (0-shot, chain-of-thought). Instruction-following benchmarks placed it at 8.63 on MT Bench, 56.3 on Wild Bench, and 73.2 on Arena Hard. The MT Bench score represented a meaningful improvement over the previous generation, which itself trailed GPT-4 noticeably.
Multilingual MMLU scores across languages were:
| Language | MMLU Score |
|---|---|
| French | 82.8% |
| German | 81.6% |
| Spanish | 82.7% |
| Italian | 82.7% |
| Portuguese | 81.6% |
| Dutch | 80.7% |
| Russian | 79.0% |
| Japanese | 78.8% |
| Chinese | 74.8% |
| Korean | 60.1% |
Code support was listed at 80+ programming languages, including Python, Java, C, C++, JavaScript, Bash, Swift, and Fortran. Mistral AI noted that Mistral Large 2 was trained to acknowledge uncertainty rather than generate plausible but incorrect responses, a behavior improvement described in the context of making the model more suitable for retrieval-augmented generation (RAG) pipelines.
Function calling was improved with support for parallel and sequential tool execution. The model handles multi-turn conversations with better context retention than Mistral Large 1, and Mistral AI added fine-tuning capability for the model on La Plateforme at launch.
Weights were released on Hugging Face under the repository mistralai/Mistral-Large-Instruct-2407. Deployment requires over 300 GB of GPU VRAM in bf16 precision (approximately 8 GPUs), or approximately 75 GB in fp4 precision. At launch, Mistral AI set API pricing at $3.00 per million input tokens and $9.00 per million output tokens, later adjusted to $2.00 input and $6.00 output.
Mistral Large 2.1, model identifier mistral-large-2411, was released on November 19, 2024. It uses the same 123-billion-parameter architecture as Mistral Large 2 but was trained with specific focus on three areas that had drawn attention from enterprise users: long context handling, function calling reliability, and system prompt adherence.
According to the model card, Mistral-Large-Instruct-2411 extends Mistral-Large-Instruct-2407 with better long context, function calling, and system prompt support. The context window remained at 128,000 tokens. Language and code coverage remained identical to the 2407 version.
The instruct template was updated to v7, introducing explicit system prompt delimiters:
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT][INST] <user message>[/INST] <assistant response></s>
This template change reflected enterprise feedback about the difficulty of reliably separating system-level instructions from user input in agentic deployments. Mistral AI also improved the model's performance on RAG tasks, where precise adherence to retrieved context (rather than default model priors) is a recurring pain point.
The November 2024 release coincided with the announcement of Pixtral Large, a multimodal variant that shares Mistral Large 2's text backbone but adds vision capabilities.
Mistral Large 2.1 was listed on the Mistral documentation as the deprecation-scheduled successor to Large 2, with a scheduled retirement date of February 27, 2026, at which point Mistral Large 3 (released December 2024, with 675 billion parameters in an MoE architecture) became the recommended successor.
Pricing remained at $2.00 per million input tokens and $6.00 per million output tokens.
Mistral Medium 3 was announced on May 7, 2025, and introduced at a substantially lower price point than the Large family while targeting comparable performance for many workloads. Mistral AI's announcement described the model as achieving approximately 90% of Claude Sonnet 3.7's performance across benchmarks at $0.40 per million input tokens and $2.00 per million output tokens.
Although Medium 3 sits below the Large designation in the naming hierarchy, it is relevant to the Mistral Large family because it was positioned as the practical replacement for many Mistral Large 2 deployments where cost or latency was the primary constraint. The model is multimodal, supporting both text and image input. Self-hosted deployment requires a minimum of four GPUs, substantially lower than the eight-plus GPUs needed for Mistral Large 2.
The announcement noted that Medium 3 outperforms Llama 4 Maverick and Cohere Command A on benchmark evaluations, and performs particularly well on coding and STEM tasks. Availability at launch was through La Plateforme and Amazon SageMaker, with Google Cloud Vertex AI, Azure AI Foundry, IBM watsonx, and NVIDIA NIM listed as upcoming distribution channels.
Mistral Large 2 and 2.1 are dense decoder-only Transformers. Unlike Mixtral 8x7B and Mixtral 8x22B, which used sparse mixture-of-experts routing, all 123 billion parameters in Mistral Large 2 are active during every forward pass. This design choice was deliberate: Mistral AI noted that dense models offer more predictable per-request compute costs and simpler deployment on single-node hardware.
The attention mechanism uses grouped-query attention (GQA), a technique first introduced in Mistral 7B. GQA partitions query heads into groups that share key-value projections, reducing the size of the KV cache during inference. In Mistral Large 2, 48 query heads are grouped to share 8 KV heads. This significantly reduces memory bandwidth requirements during inference compared to standard multi-head attention at similar parameter counts.
RoPE (Rotary Position Embedding) is used for position encoding. RoPE encodes position information directly into the attention computation rather than adding positional embeddings to token representations, and it generalizes better to sequence lengths beyond those seen during training.
The feed-forward network uses SwiGLU activation, which combines a gating mechanism with a smooth nonlinearity and has become standard in post-2022 large model training following its demonstrated improvements over ReLU variants.
RMS normalization (rather than LayerNorm) is applied before each sub-layer. RMSNorm simplifies the normalization computation by omitting the mean-centering step, which has been found to produce equivalent results in practice.
Mistral AI did not publish a formal technical report or arXiv preprint for Mistral Large 2, so some architectural details are drawn from the model configuration files in the Hugging Face repository and third-party analysis.
The Mistral Large family has operated under two licensing regimes across its versions.
Mistral Large 1 (February 2024) was released as a fully proprietary closed-source model. No weights were published. Access required either a Mistral API account or an Azure subscription, and there was no mechanism for self-hosting outside of a commercial contract with Mistral AI.
Mistral Large 2 (July 2024) and Mistral Large 2.1 (November 2024) are released under the Mistral Research License (MRL). The MRL explicitly permits:
The MRL prohibits:
For any commercial use that involves charging for access, deploying the model in a production product, or embedding the model in a service sold to third parties, organizations must contact Mistral AI to obtain a separate commercial license. Mistral AI has indicated that such licenses are evaluated on a case-by-case basis.
This dual-track approach has drawn some comparison to Meta's approach with the Llama family, though the Llama 2 and Llama 3 licenses use different terms. In practice, the MRL is more restrictive than the Llama 3 community license, which permits commercial use without a separate agreement for organizations below certain user thresholds.
API access through La Plateforme, Azure, AWS, and other cloud providers constitutes commercial licensing through Mistral AI's distribution agreements, so users accessing the model via API do not need to negotiate a separate license.
Mistral Large models are available through multiple distribution channels.
La Plateforme is Mistral AI's direct API service, accessible at console.mistral.ai. It provides access to all current Mistral models under a pay-as-you-go token pricing model. La Plateforme also hosts fine-tuning capabilities and the Mistral Agents API, which simplifies building multi-step agentic applications with Mistral Large as the orchestrating model.
Microsoft Azure was the first distribution partner for Mistral Large, announced simultaneously with the model's February 2024 launch. Mistral models on Azure are available through Azure AI Foundry Models as a Service (MaaS), which offers both pay-as-you-go and provisioned throughput billing. Azure distributes all major Mistral models, including Mistral Large 2 and Mistral Large 2.1.
Amazon Web Services added Mistral Large 2 to Amazon Bedrock in 2024. AWS published a reference architecture showing Mistral Large 2 in agentic RAG configurations with LlamaIndex, and Amazon SageMaker supports self-hosted deployment.
Google Cloud Vertex AI added Mistral Large 24.11 (the November 2024 version) alongside Codestral 25.01 in early 2025. Mistral models on Vertex AI are accessed through the model garden and support both the Mistral API format and the Vertex AI unified inference interface.
IBM watsonx became an additional distribution channel as part of an enterprise partnership, with IBM taking on customer licensing responsibilities for its own customer base. Mistral Large 2 was described as the first third-party model available under IBM's own customer license agreements.
Model weights for Mistral Large 2 and 2.1 are also available for direct download from Hugging Face at mistralai/Mistral-Large-Instruct-2407 and mistralai/Mistral-Large-Instruct-2411 respectively.
The following table shows pricing across major platforms for the Mistral Large family (as of 2025). Prices are per million tokens.
| Model | Platform | Input ($/M) | Output ($/M) |
|---|---|---|---|
| Mistral Large 1 (2402) | La Plateforme | $4.00 | $12.00 |
| Mistral Large 2 (2407) | La Plateforme | $2.00 | $6.00 |
| Mistral Large 2.1 (2411) | La Plateforme | $2.00 | $6.00 |
| Mistral Large 2 (2411) | Azure AI Foundry | $2.00 | $6.00 |
| Mistral Large 2 (2411) | Amazon Bedrock | $2.00 | $6.00 |
| Mistral Medium 3 | La Plateforme | $0.40 | $2.00 |
Mistral Large 1's original $4.00/$12.00 pricing was substantially reduced with the introduction of Mistral Large 2. The price reduction of roughly 50% on input and output tokens, combined with the expanded context window and open weights, contributed to the broader reception of Mistral Large 2 as a significant value improvement.
The following table shows Mistral Large 2 (2407/2411) performance on standard benchmarks alongside selected comparable models at the time of launch.
| Benchmark | Mistral Large 2 | GPT-4o | Claude 3 Opus | Llama 3.1 405B |
|---|---|---|---|---|
| MMLU (English) | 84.0% | 88.7% | 86.8% | 88.6% |
| HumanEval | 92% | 90.2% | 84.9% | 89.0% |
| HumanEval Plus | 87% | -- | -- | -- |
| MBPP Base | 80% | -- | -- | -- |
| GSM8K | 93% | 95.5% | 95.0% | 96.8% |
| MATH (instruct, CoT) | 71.5% | 76.6% | 60.1% | -- |
| MT Bench | 8.63 | -- | 9.0 | -- |
| Arena Hard | 73.2 | 79.3 | 63.6 | 69.3 |
| Wild Bench | 56.3 | 60.1 | 50.6 | -- |
On function calling benchmarks, Mistral AI reported that Mistral Large 2 outperformed GPT-4o and Claude 3.5 Sonnet. This result was notable given that function calling is a particularly important capability for agentic applications.
Multilingual MMLU scores showed Mistral Large 2 consistently ranked second behind Llama 3.1 405B across most languages, with particularly strong results in French, Spanish, Italian, and Portuguese (all above 81%), and weaker performance in Korean (60.1%).
The following table shows a qualitative comparison of Mistral Large 2.1 against its primary competitors at the time of its November 2024 release.
| Dimension | Mistral Large 2.1 | GPT-4o (Nov 2024) | Claude 3.5 Sonnet (Oct 2024) |
|---|---|---|---|
| Parameters | 123B (dense) | Undisclosed | Undisclosed |
| Context window | 128K tokens | 128K tokens | 200K tokens |
| Open weights | Yes (MRL) | No | No |
| MMLU | 84.0% | 88.7% | 88.7% |
| HumanEval | 92% | 90.2% | 92.0% |
| GSM8K | 93% | 95.5% | 96.4% |
| Function calling | Strong | Strong | Strong |
| Input price ($/M tokens) | $2.00 | $2.50 | $3.00 |
| Output price ($/M tokens) | $6.00 | $10.00 | $15.00 |
| Self-hosting | Yes | No | No |
Mistral Large 2.1's most distinct advantage over GPT-4o and Claude 3.5 Sonnet is the availability of open weights under the MRL. For enterprises that require air-gapped or on-premises deployment, particularly in regulated industries like finance and healthcare, the ability to self-host 123B parameters is a practical differentiator. Neither OpenAI nor Anthropic offered self-hosted versions of their flagship models in this period.
On raw benchmark performance, GPT-4o and Claude 3.5 Sonnet generally score higher than Mistral Large 2.1 on knowledge-heavy evaluations like MMLU, but the gap on coding (HumanEval) is much narrower. On Arena Hard, a benchmark based on human preference ratings, Claude 3.5 Sonnet scored lower than Mistral Large 2, while GPT-4o scored higher.
Pricing favors Mistral Large 2.1 substantially on output tokens, where it was roughly 1.7x cheaper than GPT-4o and 2.5x cheaper than Claude 3.5 Sonnet at listed rates.
Mistral Large 2 and 2.1 are used primarily for tasks that require a combination of long-context understanding, multilingual capability, and structured output.
Code generation is a core deployment area. With a 92% score on HumanEval and training coverage across 80+ programming languages, Mistral Large 2 handles code completion, test generation, debugging, and documentation. Developers frequently pair it with Codestral, Mistral's dedicated code model, using Mistral Large 2 for higher-level reasoning and architecture tasks while Codestral handles fill-in-the-middle completions in editors.
Retrieval-augmented generation (RAG) is another significant use case. Mistral AI trained the model to acknowledge uncertainty explicitly rather than hallucinate, which enterprise users have noted makes it more reliable for knowledge-base query pipelines where precision matters. The 128k context window allows full documents to be included in the prompt without chunking.
Agentic workflows using function calling and the Mistral Agents API are increasingly common. The improved parallel and sequential tool execution in Mistral Large 2, and the further improvements to function calling reliability in Large 2.1, make the model well-suited as an orchestrator in multi-step workflows that call external APIs and databases.
Multilingual document processing is a common enterprise application, particularly for European organizations where French, German, Spanish, Italian, Portuguese, and Dutch coverage is needed at high quality.
Data sovereignty-sensitive deployments represent a use case category that the open weights specifically address. Financial services, healthcare, and government organizations that cannot send proprietary data to a third-party API can download and run Mistral Large 2 internally while meeting regulatory requirements.
Codestral is a separate model family from Mistral AI, specifically designed for code generation. Released in May 2024, Codestral is a 22-billion-parameter model with a fill-in-the-middle capability that integrates with code editors through the Mistral API and through plugins for VS Code and JetBrains products.
Mistral Large 2 and Codestral are complementary rather than redundant. Mistral Large 2 handles reasoning-heavy tasks such as architecture design, code review, and complex debugging that require understanding the broader context of a codebase. Codestral handles high-frequency, latency-sensitive completions at a fraction of the inference cost. In production systems, the two models are often used in the same pipeline, with Codestral handling inline completions and Mistral Large 2 handling longer reasoning tasks.
The original Mistral Large received positive coverage at launch in February 2024, primarily because its arrival on Azure represented a direct challenge to OpenAI's previously exclusive position on Microsoft's platform. The Microsoft investment was also noted as a sign of confidence in a European AI company competing at the frontier level.
Mistral Large 2, released in July 2024, received substantially more detailed technical attention because of the weight release. Analysts noted that 84.0% on MMLU, combined with 92% on HumanEval, placed it genuinely competitive with models from OpenAI and Anthropic rather than merely adjacent to them. IBM's announcement of Mistral Large 2 as an available model on watsonx, under IBM's own customer license terms, was read as a signal that regulated-industry enterprises were willing to accept Mistral AI as a tier-one vendor.
Criticism focused on a few areas. The Mistral Research License was viewed as more restrictive than Llama 3's terms for commercial deployment, and some developers who had built on Llama's Apache 2.0-adjacent license found the MRL's commercial restrictions inconvenient. The model's inference speed, measured at approximately 32 tokens per second in third-party evaluations, was notably slower than the roughly 63-token-per-second median for comparable open-weight models, which affected cost-performance comparisons in high-throughput production settings.
Mistral AI's decision not to publish an arXiv technical report for Mistral Large 2 drew some comment. Mistral 7B was accompanied by a detailed paper (arXiv 2310.06825), and the absence of equivalent documentation for the larger model made independent architectural analysis more difficult.
Several limitations affect Mistral Large 2 in practice.
Deployment scale requirements are substantial. Running the model in bf16 precision requires over 300 GB of GPU VRAM, which translates to a minimum of four high-memory GPUs (such as A100 80GB) or two to three H100/H200 units. Organizations running inference at scale typically need eight or more GPUs to achieve reasonable throughput. This limits self-hosted deployment to larger enterprises with existing ML infrastructure.
Korean language performance lags behind other supported languages. The 60.1% MMLU score for Korean compares unfavorably to the 81-82% range for major European languages, and reflects the training data distribution of the model.
Multimodal inputs are not supported in Mistral Large 2 or 2.1. Image understanding requires a different model (Pixtral Large, which shares the text backbone but adds a vision encoder). For teams that need both text and vision capabilities, a separate model endpoint is required.
The model does not include a dedicated long-context retrieval mechanism. While the 128k token window is large, performance on tasks that require precise citation from text near the middle of a long context is lower than performance on text near the beginning or end, a pattern observed in most transformer architectures of this generation.
Commercial licensing terms add friction for startups and smaller development teams. Developers who want to build a commercial product on self-hosted Mistral Large 2 weights must negotiate a separate agreement with Mistral AI, rather than simply accepting a published license. This process introduces uncertainty for early-stage products.