Mistral 7B
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,283 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 6,283 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mistral 7B is a 7.3-billion-parameter, decoder-only large language model released by mistral ai on September 27, 2023, under the apache 2 license. It was the company's first publicly released model and one of the first 7B-class systems to outperform Meta's llama 2 13B across most standard benchmarks at the time of release, while also matching or beating the much larger LLaMA 1 34B on reasoning, math and code tasks.[1][2] The launch made a simple but consequential point: with the right architectural choices and a careful training mix, a 7B model could match a 13B competitor on most evaluations while costing far less to serve. Mistral 7B established Mistral AI as a serious player in foundation-model research only four months after the company was founded, and it set the template that most subsequent dense open-weights LLMs would follow.[1][3]
The model shipped under the apache 2 license, with weights distributed both through hugging face and through a direct BitTorrent magnet link that Mistral posted on X (formerly Twitter) the day before the official blog announcement.[2][3][4] That magnet link became something of a meme in open-source AI circles, partly because Llama 2's license at the time included acceptable-use restrictions and a 700-million-monthly-active-user clause that some saw as not quite "open." Mistral 7B contained no such restrictions.[5]
| Field | Value |
|---|---|
| Developer | mistral ai |
| Initial release | September 27, 2023[2] |
| Latest version | Mistral 7B v0.3 / Instruct v0.3 (May 22, 2024)[6] |
| Parameter count | ~7.24 billion (rounded to "7B" in the name; ~7.3 billion as quoted in the announcement)[1][2] |
| Architecture | Decoder-only Transformer with GQA + SWA, RoPE, RMSNorm, SwiGLU[1] |
| Context length | 8,192 tokens (v0.1); 32,768 tokens (v0.2, v0.3)[7][8] |
| Vocabulary | 32,000 (v0.1) / 32,768 (v0.3)[1][6] |
| Tokenizer | SentencePiece byte-fallback BPE (v3 in v0.3)[1][6] |
| License | Apache License 2.0[2][3] |
| Paper | arXiv:2310.06825 (October 10, 2023)[1] |
mistral ai was founded in April 2023 in Paris by Arthur Mensch, Guillaume Lample, and Timothée Lacroix.[9] The three co-founders had originally met as students at the École Polytechnique outside Paris. Mensch had been a research scientist at google deepmind, where he was one of the lead authors on the Chinchilla scaling-laws paper. Lample and Lacroix had been research scientists at meta ai, where they were among the lead authors of the original LLaMA paper. Mensch took the CEO role, Lample became Chief Scientist, and Lacroix became Chief Technology Officer.[9]
The new company raised a roughly €105 million ($113 million) seed round in June 2023, led by Lightspeed Venture Partners, with participation from Xavier Niel, JCDecaux Holding, Eric Schmidt, Bpifrance, Rodolphe Saadé, and others.[10][11] Reports at the time framed it as the largest seed round in European history, valuing the four-week-old company at roughly €240 million (around $260 million in USD).[10][11] The fundraise was widely cited as evidence that European investors were now willing to put nine-figure cheques behind frontier-AI research; the founders pitched the company as building open, sovereign foundation models as an alternative to closed US labs.
The first model was promised within months of the company's founding. Internally Mistral AI was building toward something larger (the mixture-of-experts model that would eventually ship as Mixtral 8x7B), but the team wanted an open release out the door first.[12] That release was Mistral 7B.
A few things made Mistral 7B more than just another open-weights checkpoint:
For a company that had existed for under five months at the time of the release, all of this was unusually self-confident. It worked.
The Mistral 7B technical report was posted to arXiv on October 10, 2023 as arXiv:2310.06825.[1] The eighteen listed authors are Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.[1] Many of them carried over experience from DeepMind, Meta AI, and hugging face, where Le Scao had led the BigScience BLOOM project.
The blog post announcing the release went up on mistral.ai on September 27, 2023, with the headline "Mistral 7B, the best 7B model to date."[2] It claimed three months of development from the founding of the company to the release of the model.[2]
Mistral 7B is a decoder-only transformer in the same broad family as LLaMA and llama 2. It keeps the now-standard combination of pre-normalisation with rmsnorm, swiglu feed-forward layers, and rotary position embeddings (RoPE) on the queries and keys.[1][13] The notable choices are at the attention level, where Mistral pairs grouped query attention (GQA) with sliding window attention (SWA).[1]
The full configuration from Table 1 of the paper:[1]
| Parameter | Value |
|---|---|
| Total parameters | ~7.24 billion |
Layers (n_layers) | 32 |
Model dimension (dim) | 4096 |
Feed-forward hidden dimension (hidden_dim) | 14,336 |
Attention heads (n_heads) | 32 |
Key-value heads (n_kv_heads) | 8 |
Head dimension (head_dim) | 128 |
Vocabulary size (vocab_size) | 32,000 (byte-fallback BPE, Llama-style) |
Sliding window (window_size) | 4096 tokens |
Context length (context_len) | 8192 tokens |
| Positional encoding | RoPE |
| Normalisation | RMSNorm (pre-norm) |
| Activation | SwiGLU |
| Tokenizer | SentencePiece, Llama-style byte-fallback BPE |
The 32-to-8 ratio of query heads to key-value heads is the GQA factor, and it cuts the size of the KV cache by 4x compared to standard multi-head attention with no measurable drop in quality once the model is trained from scratch with that configuration.[1][14] This is the single most important change for inference cost on long contexts.
Mistral 7B did not introduce any individually new components. RMSNorm comes from Zhang and Sennrich's 2019 root-mean-square normalisation paper, SwiGLU from Shazeer's GLU variants work, and RoPE from Su et al. 2021.[13][15] The combination, however, is the one LLaMA popularised in early 2023, and Mistral 7B inherits it almost wholesale, swapping only the attention layer. The pre-norm placement (LayerNorm before each sub-block instead of after) is the standard transformer recipe that has dominated training-stability practice since GPT-3 era.[13]
The MLP follows the standard SwiGLU geometry: a projection up to hidden_dim = 14,336, a gated SiLU non-linearity, and a projection back down to dim = 4096. The 14,336 figure is roughly 3.5 times the model dimension, in line with the 8/3 multiplier that became the open-source convention for SwiGLU MLPs after LLaMA.[15]
Grouped-Query Attention was introduced in the GQA paper by Joshua Ainslie and colleagues at Google in May 2023 (arXiv:2305.13245).[14] The idea is a middle ground between vanilla multi-head attention, where every query head has its own key and value projections, and multi-query attention (MQA), where all query heads share a single set of key and value projections. GQA partitions the query heads into a smaller number of groups and gives each group its own key and value projection. Mistral 7B uses 32 query heads grouped into 8 KV heads, so each group of 4 query heads shares one set of K and V matrices.[1][14]
The practical effect is that the KV cache, which dominates memory during autoregressive generation at long sequence lengths, shrinks by the group factor. That makes batched serving cheaper and lets the model fit longer contexts in the same memory budget. The GQA paper showed that uptraining a multi-head model into a GQA configuration recovers nearly all of the original quality, and Mistral 7B confirmed that training from scratch with GQA works just as well.[14]
llama 2 had already adopted GQA at the 34B and 70B sizes but kept full multi-head attention for the 7B and 13B variants.[16] Mistral 7B was one of the first widely released sub-10B open models to ship with GQA, and the pattern was picked up almost immediately by the rest of the field. Within a year, GQA was the default for new dense decoder LLMs in roughly the 1B to 100B range, including llama 3, Gemma, and Qwen families.[17][18]
The second architectural choice is sliding window attention, originally introduced for the Longformer model by Iz Beltagy, Matthew Peters, and Arman Cohan in April 2020 (arXiv:2004.05150).[19] In a sliding window of size W, each token only attends to the previous W tokens rather than to the entire history. The cost of attention drops from O(n²) to O(n·W), and the receptive field grows linearly with depth: with 32 layers and a window of 4096, the effective receptive field reaches 32 × 4096 = 131,072 tokens, far beyond the nominal 8192 context length.[1][2]
Mistral pairs sliding-window attention with a rolling KV-cache buffer. At position i, only the keys and values for positions i − W to i − 1 are kept in memory; older entries are overwritten in place inside a fixed-size circular buffer of size W. The cache index at timestep i is simply i mod W. The result is that memory per layer stays constant once the prompt passes the window size, regardless of how long the actual prompt is.[1] On a 32k-token sequence, the paper reports the rolling buffer reduces cache memory usage by 8x relative to full attention without hurting quality, and the launch blog highlighted a 2x speed improvement over standard attention for a 16k-token sequence at a 4k window, on top of the GQA savings.[1][2]
For very long prompts, the paper also describes pre-fill chunking: split the prompt into chunks of size W, process them sequentially using a causal mask within each chunk and a sliding-window mask against cached prior chunks, and let the rolling cache accumulate the relevant state.[1] In the original release Mistral promoted an "effective" context of 32K thanks to SWA plus rolling buffer, although in practice quality at very long contexts depended heavily on the use case.[2] The v0.2 instruct model later increased the nominal context window to 32,768 tokens and dropped sliding-window attention from the default configuration, signalling that full attention with a longer base context had become the more common pattern across the field.[7][20]
The receptive-field calculation deserves a closer look because it is one of the easier-to-misread numbers in the paper. At layer 1, a token sees the previous W = 4096 tokens through one attention operation. At layer 2, each of those 4096 tokens has already aggregated information from a 4096-token window of its own, so the layer-2 query effectively reaches back roughly 2W tokens. At layer k the reachable span is approximately k·W. With k = 32 layers and W = 4096, the theoretical span is 32 × 4096 = 131,072 tokens.[1]
The practical span is smaller, since information attenuates as it has to be re-aggregated layer by layer, but the construction explains how a model with a 4k attention window and only 8k positional embeddings can carry useful long-range signal much further than naïve attention would suggest. The Mistral team's reported FlashAttention modifications also yielded a 2x speed boost for 16k-token sequences over the vanilla attention baseline.[1]
The original v0.1 tokenizer was a LLaMA-style byte-fallback BPE trained with SentencePiece, with a vocabulary size of 32,000 tokens.[1][3] Byte-fallback BPE means that any character that is not covered by the learned merge vocabulary is encoded as a sequence of raw UTF-8 bytes; the tokenizer therefore never fails on unfamiliar characters or scripts. Mistral 7B v0.3 extended the vocabulary to 32,768 entries to make room for new control tokens and to improve efficiency on certain scripts.[6] The v0.3 update introduced the "v3 tokenizer" packaged via the mistral_common library; later Mistral releases (NeMo, Pixtral, Mistral Large 2) used yet newer tokenizers, including the Tekken tokenizer derived from OpenAI's tiktoken.[21]
Mistral AI has not published a complete account of the training data or compute budget for Mistral 7B. The paper notes that the model was pretrained on data extracted from "the open Web" and emphasises that the model is a base model with no built-in moderation, leaving safety alignment to downstream fine-tuners.[1] Total parameter count is approximately 7.24 billion when summed across embeddings, attention projections, and MLP layers, hence the "7B" name; the announcement blog rounds this to "7.3 billion."[1][2]
Hardware and exact token counts have not been disclosed in print. What is documented is the architectural recipe (the eight-line config in Table 1) and the published evaluation numbers. The instruct variant was trained via supervised fine-tuning on publicly available instruction-following datasets, without rlhf or DPO in the v0.1 release.[1] The v0.1 paper additionally describes a content-moderation experiment in which the model was prompted to self-classify its own outputs into categories such as illegal activities, hateful content, and unqualified advice, with the authors reporting 99.4% precision and 95.6% recall on a curated adversarial test set.[1]
The Mistral 7B paper benchmarks the base model against LLaMA 1 (7B, 13B, 33B), llama 2 (7B, 13B), and Code Llama 7B across a standard suite of evaluations. The headline numbers from Table 2 of the paper (Mistral 7B vs the closest competitor, Llama 2 13B):[1]
| Benchmark | Mistral 7B | Llama 2 13B | Llama 2 7B | Code-Llama 7B |
|---|---|---|---|---|
| MMLU (5-shot) | 60.1% | 55.6% | 44.4% | 36.9% |
| HellaSwag (0-shot) | 81.3% | 80.7% | 77.1% | 62.9% |
| WinoGrande (0-shot) | 75.3% | 72.9% | 69.5% | 62.3% |
| PIQA (0-shot) | 83.0% | 80.8% | 77.9% | 72.8% |
| Arc-Easy | 80.0% | 75.2% | 68.7% | 59.4% |
| Arc-Challenge | 55.5% | 48.8% | 43.2% | 34.5% |
| NaturalQuestions | 28.8% | 29.0% | 24.7% | 11.0% |
| TriviaQA | 69.9% | 69.6% | 63.8% | 34.9% |
| HumanEval (pass@1) | 30.5% | 18.9% | 11.6% | 31.1% |
| MBPP | 47.5% | 35.4% | 26.1% | 52.5% |
| MATH | 13.1% | 6.0% | 3.9% | 5.2% |
| GSM8K (8-shot, maj@8) | 52.2% | 34.3% | 16.0% | 20.8% |
Mistral 7B beat Llama 2 13B on every benchmark in the table except NaturalQuestions, where the two were within a percentage point. On MMLU the gap was about 4.5 points, on GSM8K it was about 18 points, and on HumanEval it was about 11.6 points.[1] The math and reasoning gaps were big enough that Mistral 7B was also competitive with or better than the much larger Llama 1 33B on those tasks, a comparison the launch blog turned into one of its headline framings.[1][2]
The paper additionally reports an MT-Bench score of 6.84 ± 0.07 for Mistral-7B-Instruct-v0.1, ahead of llama 2 13B Chat at 6.65 and ahead of all other 7B chat models at the time of publication.[1] A side-by-side human preference test reported in the paper showed Mistral preferred 5,020 times versus Llama 2 13B Chat preferred 4,143 times in the assessed sample on llmboxing.com/leaderboard.[1] On MMLU specifically, Mistral 7B Instruct v0.1 scored 56.3%, which is several points below the base model's 60.1%, a typical pattern for early instruction-tuned 7B models, where the tuning data was not designed to preserve knowledge benchmarks.[22]
The headline framing in the release blog was that Mistral 7B "performs equivalently to a Llama 2 that would be more than 3x its size" on reasoning and reading comprehension.[2] That framing was marketing-flavoured, but the underlying numbers held up to independent scrutiny on Hugging Face's Open LLM Leaderboard, where Mistral 7B sat near the top of its weight class for most of late 2023.[23]
Mistral has shipped several iterations under the Mistral 7B name. The headline differences are tokenizer changes, instruction-following data, function-calling support, and the move from 8K to 32K context.
| Variant | Release | Notes |
|---|---|---|
| Mistral-7B-v0.1 (base) | Sept 27, 2023 | Original base model. 8K context, 32k vocab, GQA + SWA.[1][2] |
| Mistral-7B-Instruct-v0.1 | Sept 27, 2023 | First instruct version, supervised fine-tune on public instruction data; MT-Bench 6.84.[1][22] |
| Mistral-7B-Instruct-v0.2 | Dec 11, 2023 | Improved instruction following. 32K context (RoPE θ = 1e6), SWA disabled.[7][20] |
| Mistral-7B-v0.2 (base) | March 23, 2024 | Base release matching v0.2 instruct architecture, posted during a hackathon at SHACK15 in San Francisco co-hosted with Cerebral Valley.[24][25] |
| Mistral-7B-v0.3 (base) | May 22, 2024 | Vocabulary extended to 32,768 entries; v3 tokenizer.[6] |
| Mistral-7B-Instruct-v0.3 | May 22, 2024 | v3 tokenizer, function calling via [TOOL_CALLS], [AVAILABLE_TOOLS], [TOOL_RESULTS] control tokens.[6][26] |
v0.1 used a strict 8K context window and 4K sliding window. v0.2, released as an instruct fine-tune on December 11, 2023, raised the nominal context to 32,768 tokens, removed sliding-window attention from the default configuration, and increased the RoPE base frequency to θ = 1 × 10⁶ to better support long-context extrapolation.[7][20] In config.json terms, sliding_window was set to null, max_position_embeddings to 32,768, and rope_theta from 10,000.0 to 1,000,000.0.[7][20]
The matching base model was released about three months later in March 2024 at Mistral's hackathon at SHACK15 in San Francisco, co-hosted with the Cerebral Valley community.[24][25] It was the first non-instruct v0.2 weights set to be officially distributed by Mistral. Because the official mistralai organisation on Hugging Face did not initially host the v0.2 base weights, the early redistribution lived at mistral-community/Mistral-7B-v0.2 and alpindale/Mistral-7B-v0.2-hf.[25] v0.2 became the workhorse for fine-tuning experiments throughout 2024 because it kept the 7.3B parameter count and Apache 2.0 license but added the longer context that downstream applications had started to expect.
The v0.3 generation extended the vocabulary from 32,000 to 32,768 entries to add three new control tokens, [TOOL_CALLS], [AVAILABLE_TOOLS], and [TOOL_RESULTS], used by the structured function-calling format.[6][26] Function calls are issued by the model emitting a JSON payload between [TOOL_CALLS] boundaries, and tool results are returned inside [TOOL_RESULTS] boundaries; tool-call IDs are constrained to exactly nine alphanumeric characters.[6] The v0.3 release accompanied a broader push by Mistral to support agentic workloads alongside the Mixtral 8x7B and Mistral Large product lines.[26]
The instruct variants use a chat template centred on the [INST] and [/INST] control tokens. The very first user instruction is preceded by the <s> begin-of-sentence token; subsequent instructions are not. Assistant generation ends with the </s> end-of-sentence token. A typical multi-turn sequence looks like:
<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice...</s>
[INST] Do you have mayonnaise recipes? [/INST]
The template is built into the tokenizer.apply_chat_template method in Hugging Face Transformers, which handles the formatting automatically when supplied with a list of {"role": ..., "content": ...} messages.[7][22]
Mistral 7B was downloaded heavily within hours of release and immediately became the base model for a wave of community fine-tunes. A few of the most influential:
zephyr-7b-sft-full and applied a re-binarised version of UltraFeedback using preference ratings rather than the original critique scores. It reached MT-Bench 7.30 and AlpacaEval 91.42%, slightly above Zephyr Beta on the latter.[30]The architectural pattern of GQA plus sliding-window attention plus RoPE plus RMSNorm plus SwiGLU, with a roughly 4x query-to-KV-head ratio, became the default recipe for new dense open-weights LLMs in the 2024 to 2026 period. Models from Alibaba's qwen line, Google's gemma line, Meta's llama 3 line, and several others adopted the same general blueprint, with variations on window size or whether to keep SWA at all.[17][18]
The release pattern (weights first, paper later, no application form) has also stuck. Within a year, the default community expectation for a serious open release was Apache 2.0 or a similarly permissive license, weights on hugging face, day-one support in popular inference engines, and at most a brief blog post. Anything more restrictive started to look defensive.[4][5]
For Mistral AI itself, the success of the 7B release set up a sequence of larger funding rounds:
The company became one of the most-cited examples of European AI capacity in policy discussions about sovereignty and competitiveness. Existing investors Nvidia, DST Global, Andreessen Horowitz, Bpifrance, General Catalyst, Index Ventures, and Lightspeed also participated in the Series C alongside ASML.[38][39]
The wider ecosystem of fine-tunes built on Mistral 7B is hard to count precisely. As of 2026, the Hugging Face hub lists thousands of derivative models, including instruction-tuned variants in dozens of languages, role-play and uncensored models, code-focused fine-tunes, retrieval-augmented setups, and small-scale reasoning models. Many of the early "Mistral" community fine-tunes were the first widely used non-Meta open-weights chat models that people felt they could deploy commercially without legal review.[5][17]
Part of the reason Mistral 7B took off so quickly is that the inference story was extremely friendly. Day-one support landed in vllm, text-generation-inference (TGI), and llama.cpp.[2][3] Within a week there were quantised GGUF, GGML, GPTQ, AWQ, and EXL2 builds on Hugging Face from community contributors, several of which fit comfortably on a single 8 GB consumer GPU.[40]
Concrete deployment numbers worth noting:
Tooling support spread quickly. ollama added a pre-packaged Mistral 7B build very early, and the model became one of the most-downloaded entries in the Ollama library through 2024 and 2025.[41] LM Studio, Jan, GPT4All, and the major commercial inference hosts (Together AI, Anyscale, Fireworks, Replicate, OpenRouter, and others) all offered hosted Mistral 7B endpoints within weeks of release. By 2025 the official mistralai/Mistral-7B-v0.1 repository was logging over 500,000 downloads in its first month of release and well above 900,000 monthly downloads for stretches of 2024 to 2025, putting it among the most-downloaded open-weights causal-LM repositories on the platform.[42][43]
Because v0.1 was released under Apache 2.0 with no acceptable-use clause, the post-launch fine-tuning ecosystem was unusually wide. The most commonly used wrappers for Mistral 7B fine-tuning include Hugging Face's Transformers plus TRL (TRL ships built-in SFT and DPO trainers), PEFT for parameter-efficient training, LoRA and QLoRA for low-cost adapter fine-tunes, and Mistral's own mistral-finetune repository released alongside v0.3.[44] LLaMA-Factory also added Mistral support among its first batch of non-LLaMA architectures.
The release used two distribution channels at once. The official Hugging Face repository at mistralai/Mistral-7B-v0.1 (and the corresponding instruct variants) hosted the SafeTensors weights.[3] Separately, the Mistral team posted a BitTorrent magnet link on social media a day before the blog post went live. The torrent contained the same weights plus a sample inference script.[4][45]
The license is the Apache License, version 2.0, with no acceptable-use addendum, no platform-size restrictions, no separate research-only clause, and no requirement to identify model outputs.[2][3] That is among the most permissive licenses in use for foundation models. By contrast, llama 2's "Community License" at the time included a 700-million-monthly-active-user restriction, an acceptable-use policy, and a requirement to attribute outputs as Llama-derived.[16][46]
The combination of a recognised permissive license, a clean state-of-the-art claim at the 7B size, and a low barrier to actually running the thing was the trifecta that drove adoption.[4][5]
Mistral 7B was the first in what has become a wide line of releases. The most relevant follow-ups for understanding its place in the family:
| Model | Released | Notes |
|---|---|---|
| Mistral 7B | Sept 27, 2023 | Dense 7.3B, Apache 2.0.[2] |
| Mixtral 8x7B | Dec 11, 2023 | Sparse mixture-of-experts: 8 experts of ~7B each, 2 routed per token. About 46.7B total parameters and ~13B active. Apache 2.0.[12] |
| Mistral Medium | Dec 2023 | First proprietary commercial model (closed weights).[9] |
| Mistral Large | Feb 26, 2024 | Closed-weights commercial flagship, first hosted on Azure via Microsoft partnership.[35][36] |
| Mixtral 8x22B | April 2024 | Bigger MoE successor to Mixtral 8x7B. Apache 2.0.[9] |
| Codestral 22B | May 29, 2024 | Code-focused dense model under the Mistral Non-Production License.[47] |
| Mistral 7B v0.3 | May 22, 2024 | Updated tokenizer (32,768-entry vocab), function calling.[6] |
| Codestral Mamba 7B | July 16, 2024 | First Mistral model using the Mamba state-space architecture.[9] |
| Mathstral 7B | July 16, 2024 | Math-focused fine-tune.[9] |
| Mistral NeMo 12B | July 18, 2024 | 12B model built with NVIDIA, 128K context, Tekken tokenizer.[21] |
| Pixtral 12B | September 2024 | First multimodal Mistral release; based on the NeMo 12B text backbone.[48] |
| Ministral 3B / 8B | October 2024 | Smaller models for edge use.[9] |
| Mistral Small 3 (24B) | January 30, 2025 | 24B dense, Apache 2.0, ~81% MMLU, 32K context.[49] |
| Mistral Small 3.1 / 3.2 | March 2025 / June 2025 | Successive updates to the 24B Small line.[9] |
| Magistral Small / Medium | June 2025 | Reasoning-focused models.[9] |
| Mistral Medium 3 | May 2025 | Enterprise-grade dense model.[9] |
| Mistral Large 3 | December 2, 2025 | Flagship dense/MoE successor with 675B total / 41B active parameters; commercial.[50] |
By the time Mistral Large 3 shipped in late 2025, the original 7B was no longer the company's headline product, but it had not been retired. The base v0.3 weights remained one of the most heavily downloaded checkpoints on Hugging Face and stayed in active use for fine-tuning, distillation, and edge deployment.[43][50]
Mistral 7B is, by 2026 standards, a small model. There are clear limits.
What it remains useful for: a strong, well-documented, permissively licensed baseline for fine-tuning research and a standard reference architecture for understanding the GQA-plus-SWA design pattern.
In 2025 and into 2026 Mistral 7B continues to show up as the default starting point for academic fine-tuning papers, for university courses on LLM internals, and for production deployments where a small, locally hosted, permissively licensed model is the right fit. Mistral AI has not deprecated it. The v0.3 weights are still served from the official Hugging Face organisation, and ollama, llama.cpp, vllm, and TGI all maintain support.[41][43]
Mistral AI itself has shifted its public emphasis toward larger commercial models (Mistral Large 3, Mistral Medium 3, Magistral Medium) and toward the Mixtral MoE line. The September 2025 partnership and €1.3 billion investment from ASML, which gave the Dutch lithography company an 11% stake and made it Mistral's biggest single shareholder, signalled that the company is positioning itself as a long-term European AI champion with deep ties to the European semiconductor industry.[38][39] In late 2025 and early 2026 Mistral also broke ground on data centres near Paris and Sweden, supported by a roughly $830 million infrastructure round, the first dedicated computing build-out of that scale for a European AI lab.[9]
The original 7B sits in the company's history the same way LLaMA 1 sits in Meta's: the first one out the door, the proof of concept, the model that made everything afterward easier to ship.