Mistral 7B
Last reviewed
May 1, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 3,374 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 3,374 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mistral 7B is a 7.3 billion parameter open-weights large language model released by Mistral AI on September 27, 2023. It was the company's first publicly released model and one of the first 7B-class models to outperform Meta's LLaMA 2 13B across most standard benchmarks at the time of release. The launch made a simple but consequential point: with the right architectural choices and a careful training mix, a 7B model could match or beat a 13B competitor on most evaluations while costing far less to serve. Mistral 7B established Mistral AI as a serious player in foundation-model research only four months after the company was founded, and it triggered a wave of open-weights work in Europe.
The model shipped under the Apache 2.0 license, with weights distributed both through Hugging Face and through a direct BitTorrent magnet link that Mistral posted on X (formerly Twitter) the day before the official blog announcement. That magnet link became something of a meme in open-source AI circles, partly because Llama 2's license at the time included acceptable-use restrictions and a 700-million-monthly-active-user clause that some saw as not quite "open." Mistral 7B contained no such restrictions.
Mistral AI was founded in late April 2023 in Paris by Arthur Mensch, Guillaume Lample, and Timothée Lacroix. Mensch had been a research scientist at DeepMind and worked on the Chinchilla scaling-laws paper. Lample and Lacroix had been at Meta AI, where they were among the lead authors of the original LLaMA paper. The trio raised a 105 million euro (roughly $113 million) seed round in June 2023, led by Lightspeed Venture Partners with participation from Xavier Niel, JCDecaux Holding, Eric Schmidt, and others. At the time it was widely reported as the largest seed round in European history, valuing the four-week-old company at around 240 million euros.
The Mistral 7B technical report was posted to arXiv on October 10, 2023 (arXiv:2310.06825). The eighteen listed authors are Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Many of them carried over experience from DeepMind, Meta AI, and Hugging Face, where Le Scao had led the BigScience BLOOM project.
The blog post announcing the release went up on mistral.ai on September 27, 2023, with the headline "Mistral 7B, the best 7B model to date." It claimed three months of development from the founding of the company to the release of the model.
A few things made Mistral 7B more than just another open-weights checkpoint:
For a company that had existed for under five months at the time of the release, all of this was unusually self-confident. It worked.
Mistral 7B is a decoder-only Transformer in the same broad family as LLaMA and LLaMA 2. It keeps the now-standard combination of pre-normalization with RMSNorm, SwiGLU feed-forward layers, and rotary position embeddings (RoPE) on the queries and keys. The notable choices are at the attention level, where Mistral pairs Grouped-Query Attention (GQA) with sliding-window attention.
The full configuration from Table 1 of the paper:
| Parameter | Value |
|---|---|
| Total parameters | 7.24 billion |
| Layers | 32 |
| Model dimension (d_model) | 4096 |
| Feed-forward hidden dimension | 14336 |
| Attention heads | 32 |
| Key-value heads | 8 |
| Head dimension | 128 |
| Vocabulary size | 32000 (byte-fallback BPE, Llama-style) |
| Sliding window size | 4096 tokens |
| Context length | 8192 tokens |
| Positional encoding | RoPE |
| Normalization | RMSNorm (pre-norm) |
| Activation | SwiGLU |
| Tokenizer | SentencePiece, Llama-style BPE |
The 32-to-8 ratio of query heads to key-value heads is the GQA factor, and it cuts the size of the KV cache by 4x compared to standard multi-head attention with no measurable drop in quality once trained from scratch with that configuration. This is the single most important change for inference cost on long contexts.
Grouped-Query Attention was introduced in the GQA paper by Joshua Ainslie and colleagues at Google in May 2023 (arXiv:2305.13245). The idea is a middle ground between vanilla multi-head attention, where every query head has its own key and value projections, and multi-query attention (MQA), where all query heads share a single set of key and value projections. GQA partitions the query heads into a smaller number of groups and gives each group its own key and value projection. Mistral 7B uses 32 query heads grouped into 8 KV heads, so each group of 4 query heads shares one set of K and V matrices.
The practical effect is that the KV cache, which dominates memory during autoregressive generation at long sequence lengths, shrinks by the group factor. That makes batched serving cheaper and lets the model fit longer contexts in the same memory budget. The GQA paper showed that uptraining a multi-head model into a GQA configuration recovers nearly all of the original quality, and Mistral 7B confirmed that training from scratch with GQA works just as well.
Llama 2 had already adopted GQA at the 34B and 70B sizes but kept full multi-head attention for the 7B and 13B variants. Mistral 7B was one of the first widely released sub-10B open models to ship with GQA, and the pattern was picked up almost immediately by the rest of the field. Within a year, GQA was the default for new dense decoder LLMs in roughly the 1B to 100B range.
The second architectural choice is sliding-window attention, originally introduced for the Longformer model by Iz Beltagy, Matthew Peters, and Arman Cohan in April 2020 (arXiv:2004.05150). In a sliding window of size W, each token only attends to the previous W tokens rather than to the entire history. The cost of attention drops from O(n^2) to O(nW), and the receptive field grows linearly with depth: with 32 layers and a window of 4096, the effective receptive field reaches 32 times 4096 = 131,072 tokens, far beyond the nominal 8192 context length.
Mistral pairs sliding-window attention with a rolling KV-cache buffer. At position i, only the keys and values for positions i minus W to i minus 1 are kept in memory; older entries are overwritten in place inside a fixed-size circular buffer. The result is that memory per layer stays constant once you pass the window size, regardless of how long the actual prompt is.
For very long prompts, the paper also describes pre-fill chunking: split the prompt into chunks of size W, process them sequentially, and let the rolling cache accumulate the relevant state. In the original release Mistral promoted an "effective" context of 32K thanks to SWA plus rolling buffer, although in practice quality at very long contexts depended heavily on the use case. The v0.2 instruct model later increased the nominal context window to 32,768 tokens and dropped sliding-window attention from the default configuration, signalling that full attention with a longer base context had become the more common pattern.
Mistral has shipped several iterations under the Mistral 7B name. The headline differences are tokenizer changes, instruction-following data, function-calling support, and the move from 8K to 32K context.
| Variant | Release | Notes |
|---|---|---|
| Mistral-7B-v0.1 | Sept 27, 2023 | Original base model. 8K context, 32k vocab, GQA + SWA. |
| Mistral-7B-Instruct-v0.1 | Sept 27, 2023 | First instruct version, supervised fine-tune on public instruction data. |
| Mistral-7B-Instruct-v0.2 | Dec 2023 | Improved instruction following. 32K context; SWA disabled. |
| Mistral-7B-v0.2 (base) | March 2024 | Base release matching v0.2 instruct architecture. |
| Mistral-7B-v0.3 | May 2024 | Vocabulary extended to 32,768; v3 tokenizer. |
| Mistral-7B-Instruct-v0.3 | May 22, 2024 | v3 tokenizer, function calling via TOOL_CALLS / AVAILABLE_TOOLS / TOOL_RESULTS tokens. |
The instruct lineage is what most people actually run. v0.2 became the workhorse for fine-tuning experiments throughout 2024, and v0.3 added the structured tool-use format that downstream agent frameworks were starting to expect.
The Mistral 7B paper benchmarks the base model against LLaMA 1 (7B, 13B, 33B), LLaMA 2 (7B, 13B), and Code Llama 7B across a standard suite of evaluations. Numbers from Table 2 of the paper:
| Benchmark | Mistral 7B | LLaMA 2 13B | LLaMA 2 7B | Code-Llama 7B |
|---|---|---|---|---|
| MMLU (5-shot) | 60.1% | 55.6% | 44.4% | 36.9% |
| HellaSwag (0-shot) | 81.3% | 80.7% | 77.1% | 62.9% |
| WinoGrande (0-shot) | 75.3% | 72.9% | 69.5% | 62.3% |
| PIQA (0-shot) | 83.0% | 80.8% | 77.9% | 72.8% |
| ARC-Easy | 80.0% | 75.2% | 68.7% | 59.4% |
| ARC-Challenge | 55.5% | 48.8% | 43.2% | 34.5% |
| NaturalQuestions | 28.8% | 29.0% | 24.7% | 11.0% |
| TriviaQA | 69.9% | 69.6% | 63.8% | 34.9% |
| HumanEval (pass@1) | 30.5% | 18.9% | 11.6% | 31.1% |
| MBPP | 47.5% | 35.4% | 26.1% | 52.5% |
| MATH | 13.1% | 6.0% | 3.9% | 5.2% |
| GSM8K (8-shot, maj@8) | 52.2% | 34.3% | 16.0% | 20.8% |
Mistral 7B beat LLaMA 2 13B on every benchmark in the table except NaturalQuestions, where the two were within a percentage point. On MMLU the gap was about 4.5 points, on GSM8K it was 18 points, and on HumanEval it was 11.6 points. The math and reasoning gaps were big enough that Mistral 7B was also competitive with or better than the much larger LLaMA 1 33B on those tasks.
For the instruct version, the paper reports an MT-Bench score of 6.84 for Mistral-7B-Instruct-v0.1, ahead of LLaMA 2 13B Chat at 6.65 and ahead of all other 7B chat models at the time. A side-by-side human preference test reported in the paper showed Mistral preferred 5,020 times against LLaMA 2 13B Chat preferred 4,143 times across the assessed sample.
The headline framing in the release blog was that Mistral 7B "performs equivalently to a Llama 2 that would be more than 3x its size" on reasoning and reading comprehension. That framing was somewhat marketing-flavored, but the underlying numbers held up to scrutiny on the Open LLM Leaderboard, where Mistral 7B sat near the top of its weight class for most of late 2023.
Part of the reason Mistral 7B took off so quickly is that the inference story was extremely friendly. Day-one support landed in vLLM, text-generation-inference, and llama.cpp. Within a week there were quantized GGUF, GPTQ, AWQ, and EXL2 builds on Hugging Face from community contributors, several of which fit comfortably on a single 8 GB consumer GPU.
Concrete numbers worth noting:
Tooling support spread quickly. Ollama added a pre-packaged Mistral 7B build very early, and the model became one of the most-downloaded entries in the Ollama library through 2024.
The release used two distribution channels at once. The official Hugging Face repository at mistralai/Mistral-7B-v0.1 (and the corresponding instruct variants) hosted the SafeTensors weights. Separately, the Mistral team posted a BitTorrent magnet link on social media a day before the blog post went live. The torrent contained the same weights plus a sample inference script.
The license is the Apache License 2.0, with no acceptable-use addendum, no platform-size restrictions, no separate research-only clause, and no requirement to identify model outputs. That is among the most permissive licenses in use for foundation models. By contrast, LLaMA 2's "Community License" at the time included a 700-million-MAU restriction, an acceptable-use policy, and a requirement to attribute outputs as Llama-derived.
The combination of a recognized permissive license, a clean state-of-the-art claim at the 7B size, and a low barrier to actually running the thing was the trifecta that drove adoption.
Mistral 7B was the first in what has become a wide line of releases. The most relevant follow-ups for understanding its place in the family:
| Model | Released | Notes |
|---|---|---|
| Mistral 7B | Sept 27, 2023 | Dense 7.3B, Apache 2.0. |
| Mixtral 8x7B | Dec 11, 2023 | Sparse mixture-of-experts: 8 experts of ~7B each, 2 routed per token. About 46.7B total parameters and ~13B active. Apache 2.0. |
| Mistral Medium | Dec 2023 | First proprietary commercial model. |
| Mistral Large | Feb 26, 2024 | Larger commercial model. |
| Mixtral 8x22B | April 2024 | Bigger MoE successor to Mixtral 8x7B. |
| Codestral 22B | May 29, 2024 | Code-focused dense model. |
| Mistral 7B v0.3 | May 2024 | Updated tokenizer, function calling. |
| Codestral Mamba 7B | July 16, 2024 | First Mistral model using the Mamba state-space architecture. |
| Mathstral 7B | July 16, 2024 | Math-focused fine-tune. |
| Mistral NeMo 12B | July 2024 | 12B model built with NVIDIA, with an extended Tekken tokenizer. |
| Pixtral 12B | September 2024 | First multimodal Mistral release. |
| Ministral 3B and 8B | October 2024 | Smaller models for edge use. |
| Mistral Small 3 | January 2025 | 24B dense model. |
| Mistral Large 2 / Large 3 | 2024 / Dec 2025 | Successive flagship dense models. |
| Magistral Small / Medium | June 2025 | Reasoning-focused models. |
By the time Mistral Large 3 shipped in late 2025, the original 7B was no longer the company's headline product, but it had not been retired. The base v0.3 weights remained one of the most heavily downloaded checkpoints on Hugging Face and stayed in active use for fine-tuning, distillation, and edge deployment.
The architectural pattern of GQA plus sliding-window attention plus RoPE plus RMSNorm plus SwiGLU, with a roughly 4x query-to-KV-head ratio, became the default recipe for new dense open-weights LLMs in the 2024 to 2026 period. Models from Alibaba's Qwen line, Google's Gemma line, Meta's LLaMA 3 line, and several others adopted the same general blueprint, with variations on window size or whether to keep SWA at all.
The release pattern (weights first, paper later, no application form) has also stuck. Within a year, the default community expectation for a serious open release was Apache 2.0 or a similarly permissive license, weights on Hugging Face, day-one support in popular inference engines, and at most a brief blog post. Anything more restrictive started to look defensive.
For Mistral AI itself, the success of the 7B release set up a sequence of larger funding rounds: a 385 million euro Series A in December 2023 at a 2 billion euro valuation, a 600 million euro round in June 2024 at roughly 5.8 billion euros, and a 2 billion euro round in September 2025 at around 12 billion euros (with ASML taking a notable strategic stake). The company became one of the most-cited examples of European AI capacity in policy discussions about sovereignty and competitiveness.
The wider ecosystem of fine-tunes built on Mistral 7B is hard to count precisely. As of 2026, the Hugging Face hub lists thousands of derivative models, including instruction-tuned variants in dozens of languages, role-play and uncensored models, code-focused fine-tunes, retrieval-augmented setups, and small-scale reasoning models. Many of the early "Mistral" community fine-tunes (OpenHermes-Mistral, Dolphin-Mistral, Zephyr-7B-Beta, Notus, and others) were the first widely used non-Meta open-weights chat models people felt they could deploy commercially without legal review.
Mistral 7B is, by 2026 standards, a small model. There are clear limits.
What it remains useful for: a strong, well-documented, permissively licensed baseline for fine-tuning research and a standard reference architecture for understanding the GQA-plus-SWA design pattern.
In 2025 and into 2026 Mistral 7B continues to show up as the default starting point for academic fine-tuning papers, for university courses on LLM internals, and for production deployments where a small, locally hosted, permissively licensed model is the right fit. Mistral AI has not deprecated it. The v0.3 weights are still served from the official Hugging Face organization, and Ollama, llama.cpp, vLLM, and TGI all maintain support.
Mistral AI itself has shifted its public emphasis toward larger commercial models (Mistral Large 3, Magistral Medium) and toward the Mixtral MoE line, but the original 7B sits in the company's history the same way Llama 1 sits in Meta's: the first one out the door, the proof of concept, the model that made everything afterward easier to ship.