Mistral 7B

AI Companies Large Language Models Open Source AI

34 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

50 citations

Revision

v6 · 6,722 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mistral 7B is a 7.3-billion-parameter, decoder-only large language model released by mistral ai on September 27, 2023, under the apache 2 license. It was the company's first publicly released model and one of the first 7B-class systems to outperform Meta's llama 2 13B across most standard benchmarks at the time of release, while also matching or beating the much larger LLaMA 1 34B on reasoning, math and code tasks.^[1]^[2] The launch made a simple but consequential point: with the right architectural choices and a careful training mix, a 7B model could match a 13B competitor on most evaluations while costing far less to serve. Mistral 7B established Mistral AI as a serious player in foundation-model research only four months after the company was founded, and it set the template that most subsequent dense open-weights LLMs would follow.^[1]^[3]

The paper states the claim in one sentence: "Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation."^[1] The model achieves this with two architectural levers, again from the abstract: it "leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost."^[1]

The model shipped under the apache 2 license, with weights distributed both through hugging face and through a direct BitTorrent magnet link that Mistral posted on X (formerly Twitter) the day before the official blog announcement.^[2]^[3]^[4] That magnet link became something of a meme in open-source AI circles, partly because Llama 2's license at the time included acceptable-use restrictions and a 700-million-monthly-active-user clause that some saw as not quite "open." Mistral 7B contained no such restrictions. The launch blog put it plainly: "We're releasing Mistral 7B under the Apache 2.0 license, it can be used without restrictions."^[2]^[5]

Quick answer: what is Mistral 7B and why does it matter?

Mistral 7B is a 7.3-billion-parameter open-weights base language model, released September 27, 2023 as Mistral AI's first model and the first open release from a major non-US foundation-model lab to ship frontier-tier weights at the 7B scale. It is decoder-only and built on the LLaMA-style recipe (RoPE, RMSNorm, SwiGLU), with two distinguishing efficiency features: grouped-query attention (32 query heads, 8 key-value heads) to shrink the KV cache by about 4x, and sliding window attention (a 4,096-token window across 32 layers) to process long sequences cheaply. At launch it outperformed llama 2 13B on every reported benchmark except NaturalQuestions while serving at roughly half the cost, and Mistral marketed it as performing "equivalently to a Llama 2 that would be more than 3x its size."^[1]^[2] Its permissive Apache 2.0 license made it the default starting point for a wave of commercial open-source fine-tunes such as Zephyr 7B, OpenHermes 2.5, and Starling-LM.^[27]^[29]^[31]

Infobox

Field	Value
Developer	mistral ai
Initial release	September 27, 2023^[2]
Latest version	Mistral 7B v0.3 / Instruct v0.3 (May 22, 2024)^[6]
Parameter count	~7.24 billion (rounded to "7B" in the name; ~7.3 billion as quoted in the announcement)^[1]^[2]
Architecture	Decoder-only Transformer with GQA + SWA, RoPE, RMSNorm, SwiGLU^[1]
Context length	8,192 tokens (v0.1); 32,768 tokens (v0.2, v0.3)^[7]^[8]
Vocabulary	32,000 (v0.1) / 32,768 (v0.3)^[1]^[6]
Tokenizer	SentencePiece byte-fallback BPE (v3 in v0.3)^[1]^[6]
License	Apache License 2.0^[2]^[3]
Paper	arXiv:2310.06825 (October 10, 2023)^[1]

When was Mistral 7B released, and by whom?

mistral ai was founded in April 2023 in Paris by Arthur Mensch, Guillaume Lample, and Timothée Lacroix.^[9] The three co-founders had originally met as students at the École Polytechnique outside Paris. Mensch had been a research scientist at google deepmind, where he was one of the lead authors on the Chinchilla scaling-laws paper. Lample and Lacroix had been research scientists at meta ai, where they were among the lead authors of the original LLaMA paper. Mensch took the CEO role, Lample became Chief Scientist, and Lacroix became Chief Technology Officer.^[9]

The new company raised a roughly €105 million ($113 million) seed round in June 2023, led by Lightspeed Venture Partners, with participation from Xavier Niel, JCDecaux Holding, Eric Schmidt, Bpifrance, Rodolphe Saadé, and others.^[10]^[11] Reports at the time framed it as the largest seed round in European history, valuing the four-week-old company at roughly €240 million (around $260 million in USD).^[10]^[11] The fundraise was widely cited as evidence that European investors were now willing to put nine-figure cheques behind frontier-AI research; the founders pitched the company as building open, sovereign foundation models as an alternative to closed US labs.

The first model was promised within months of the company's founding. Internally Mistral AI was building toward something larger (the mixture-of-experts model that would eventually ship as Mixtral 8x7B), but the team wanted an open release out the door first.^[12] That release was Mistral 7B, shipped on September 27, 2023, about three months after the company was founded.^[2]

Why did the Mistral 7B release matter?

A few things made Mistral 7B more than just another open-weights checkpoint:

It demonstrated that a 7B-class model could be competitive with llama 2 13B at roughly half the inference cost. That mattered for both consumer hardware and large-scale serving.^[1]^[2]
The apache 2 license allowed full commercial use, modification, and redistribution, with no acceptable-use clause and no large-platform carve-outs. For many companies and researchers this was the first major release of an English-strong base model under such a permissive license.^[2]^[3]
Mistral AI was the first major non-US foundation-model lab to release frontier-tier open weights at scale. The release became part of a larger argument about European AI sovereignty.^[9]^[11]
The model shipped with day-one support in vllm, Text Generation Inference (TGI), and llama cpp, which meant that within hours people were running it on consumer GPUs, laptops, and cloud instances.^[2]^[3]
The torrent-link release style, which Mistral repeated for Mixtral 8x7B in December 2023, set the tone for a particular kind of "drop the weights, write the paper later" engineering culture.^[4]^[12]

For a company that had existed for under five months at the time of the release, all of this was unusually self-confident. It worked.

The paper and its authors

The Mistral 7B technical report was posted to arXiv on October 10, 2023 as arXiv:2310.06825.^[1] The eighteen listed authors are Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.^[1] Many of them carried over experience from DeepMind, Meta AI, and hugging face, where Le Scao had led the BigScience BLOOM project. The paper's abstract opens with the model's design goal in the authors' own words: "We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency."^[1]

The blog post announcing the release went up on mistral.ai on September 27, 2023, with the headline "Mistral 7B, the best 7B model to date."^[2] It claimed three months of development from the founding of the company to the release of the model.^[2]

Architecture

Mistral 7B is a decoder-only transformer in the same broad family as LLaMA and llama 2. It keeps the now-standard combination of pre-normalisation with rmsnorm, swiglu feed-forward layers, and rotary position embeddings (RoPE) on the queries and keys.^[1]^[13] The notable choices are at the attention level, where Mistral pairs grouped query attention (GQA) with sliding window attention (SWA).^[1]

The full configuration from Table 1 of the paper:^[1]

Parameter	Value
Total parameters	~7.24 billion
Layers (`n_layers`)	32
Model dimension (`dim`)	4096
Feed-forward hidden dimension (`hidden_dim`)	14,336
Attention heads (`n_heads`)	32
Key-value heads (`n_kv_heads`)	8
Head dimension (`head_dim`)	128
Vocabulary size (`vocab_size`)	32,000 (byte-fallback BPE, Llama-style)
Sliding window (`window_size`)	4096 tokens
Context length (`context_len`)	8192 tokens
Positional encoding	RoPE
Normalisation	RMSNorm (pre-norm)
Activation	SwiGLU
Tokenizer	SentencePiece, Llama-style byte-fallback BPE

The 32-to-8 ratio of query heads to key-value heads is the GQA factor, and it cuts the size of the KV cache by 4x compared to standard multi-head attention with no measurable drop in quality once the model is trained from scratch with that configuration.^[1]^[14] This is the single most important change for inference cost on long contexts.

Why these particular ingredients

Mistral 7B did not introduce any individually new components. RMSNorm comes from Zhang and Sennrich's 2019 root-mean-square normalisation paper, SwiGLU from Shazeer's GLU variants work, and RoPE from Su et al. 2021.^[13]^[15] The combination, however, is the one LLaMA popularised in early 2023, and Mistral 7B inherits it almost wholesale, swapping only the attention layer. The pre-norm placement (LayerNorm before each sub-block instead of after) is the standard transformer recipe that has dominated training-stability practice since GPT-3 era.^[13]

The MLP follows the standard SwiGLU geometry: a projection up to hidden_dim = 14,336, a gated SiLU non-linearity, and a projection back down to dim = 4096. The 14,336 figure is roughly 3.5 times the model dimension, in line with the 8/3 multiplier that became the open-source convention for SwiGLU MLPs after LLaMA.^[15]

What is grouped-query attention in Mistral 7B?

Grouped-Query Attention was introduced in the GQA paper by Joshua Ainslie and colleagues at Google in May 2023 (arXiv:2305.13245).^[14] The idea is a middle ground between vanilla multi-head attention, where every query head has its own key and value projections, and multi-query attention (MQA), where all query heads share a single set of key and value projections. GQA partitions the query heads into a smaller number of groups and gives each group its own key and value projection. Mistral 7B uses 32 query heads grouped into 8 KV heads, so each group of 4 query heads shares one set of K and V matrices.^[1]^[14]

The practical effect is that the KV cache, which dominates memory during autoregressive generation at long sequence lengths, shrinks by the group factor. That makes batched serving cheaper and lets the model fit longer contexts in the same memory budget. The GQA paper showed that uptraining a multi-head model into a GQA configuration recovers nearly all of the original quality, and Mistral 7B confirmed that training from scratch with GQA works just as well.^[14]

llama 2 had already adopted GQA at the 34B and 70B sizes but kept full multi-head attention for the 7B and 13B variants.^[16] Mistral 7B was one of the first widely released sub-10B open models to ship with GQA, and the pattern was picked up almost immediately by the rest of the field. Within a year, GQA was the default for new dense decoder LLMs in roughly the 1B to 100B range, including llama 3, Gemma, and Qwen families.^[17]^[18]

How does sliding window attention work?

The second architectural choice is sliding window attention, originally introduced for the Longformer model by Iz Beltagy, Matthew Peters, and Arman Cohan in April 2020 (arXiv:2004.05150).^[19] In a sliding window of size W, each token only attends to the previous W tokens rather than to the entire history. The cost of attention drops from O(n²) to O(n·W), and the receptive field grows linearly with depth: with 32 layers and a window of 4096, the effective receptive field reaches 32 × 4096 = 131,072 tokens, far beyond the nominal 8192 context length.^[1]^[2] The launch blog summarised the mechanism this way: "each layer attends to the previous 4,096 hidden states."^[2]

Mistral pairs sliding-window attention with a rolling KV-cache buffer. At position i, only the keys and values for positions i − W to i − 1 are kept in memory; older entries are overwritten in place inside a fixed-size circular buffer of size W. The cache index at timestep i is simply i mod W. The result is that memory per layer stays constant once the prompt passes the window size, regardless of how long the actual prompt is.^[1] On a 32k-token sequence, the paper reports the rolling buffer reduces cache memory usage by 8x relative to full attention without hurting quality, and the launch blog highlighted a 2x speed improvement over standard attention for a 16k-token sequence at a 4k window, on top of the GQA savings.^[1]^[2]

For very long prompts, the paper also describes pre-fill chunking: split the prompt into chunks of size W, process them sequentially using a causal mask within each chunk and a sliding-window mask against cached prior chunks, and let the rolling cache accumulate the relevant state.^[1] In the original release Mistral promoted an "effective" context of 32K thanks to SWA plus rolling buffer, although in practice quality at very long contexts depended heavily on the use case.^[2] The v0.2 instruct model later increased the nominal context window to 32,768 tokens and dropped sliding-window attention from the default configuration, signalling that full attention with a longer base context had become the more common pattern across the field.^[7]^[20]

The sliding window receptive-field math

The receptive-field calculation deserves a closer look because it is one of the easier-to-misread numbers in the paper. At layer 1, a token sees the previous W = 4096 tokens through one attention operation. At layer 2, each of those 4096 tokens has already aggregated information from a 4096-token window of its own, so the layer-2 query effectively reaches back roughly 2W tokens. At layer k the reachable span is approximately k·W. With k = 32 layers and W = 4096, the theoretical span is 32 × 4096 = 131,072 tokens.^[1]

The practical span is smaller, since information attenuates as it has to be re-aggregated layer by layer, but the construction explains how a model with a 4k attention window and only 8k positional embeddings can carry useful long-range signal much further than naïve attention would suggest. The Mistral team's reported FlashAttention modifications also yielded a 2x speed boost for 16k-token sequences over the vanilla attention baseline.^[1]

Tokenizer

The original v0.1 tokenizer was a LLaMA-style byte-fallback BPE trained with SentencePiece, with a vocabulary size of 32,000 tokens.^[1]^[3] Byte-fallback BPE means that any character that is not covered by the learned merge vocabulary is encoded as a sequence of raw UTF-8 bytes; the tokenizer therefore never fails on unfamiliar characters or scripts. Mistral 7B v0.3 extended the vocabulary to 32,768 entries to make room for new control tokens and to improve efficiency on certain scripts.^[6] The v0.3 update introduced the "v3 tokenizer" packaged via the mistral_common library; later Mistral releases (NeMo, Pixtral, Mistral Large 2) used yet newer tokenizers, including the Tekken tokenizer derived from OpenAI's tiktoken.^[21]

Training

Mistral AI has not published a complete account of the training data or compute budget for Mistral 7B. The paper notes that the model was pretrained on data extracted from "the open Web" and emphasises that the model is a base model with no built-in moderation, leaving safety alignment to downstream fine-tuners.^[1] Total parameter count is approximately 7.24 billion when summed across embeddings, attention projections, and MLP layers, hence the "7B" name; the announcement blog rounds this to "7.3 billion."^[1]^[2]

Hardware and exact token counts have not been disclosed in print. What is documented is the architectural recipe (the eight-line config in Table 1) and the published evaluation numbers. The instruct variant was trained via supervised fine-tuning on publicly available instruction-following datasets, without rlhf or DPO in the v0.1 release.^[1] The v0.1 paper additionally describes a content-moderation experiment in which the model was prompted to self-classify its own outputs into categories such as illegal activities, hateful content, and unqualified advice, with the authors reporting 99.4% precision and 95.6% recall on a curated adversarial test set.^[1]

How does Mistral 7B compare to Llama 2 on benchmarks?

The Mistral 7B paper benchmarks the base model against LLaMA 1 (7B, 13B, 33B), llama 2 (7B, 13B), and Code Llama 7B across a standard suite of evaluations. The headline numbers from Table 2 of the paper (Mistral 7B vs the closest competitor, Llama 2 13B):^[1]

Benchmark	Mistral 7B	Llama 2 13B	Llama 2 7B	Code-Llama 7B
MMLU (5-shot)	60.1%	55.6%	44.4%	36.9%
HellaSwag (0-shot)	81.3%	80.7%	77.1%	62.9%
WinoGrande (0-shot)	75.3%	72.9%	69.5%	62.3%
PIQA (0-shot)	83.0%	80.8%	77.9%	72.8%
Arc-Easy	80.0%	75.2%	68.7%	59.4%
Arc-Challenge	55.5%	48.8%	43.2%	34.5%
NaturalQuestions	28.8%	29.0%	24.7%	11.0%
TriviaQA	69.9%	69.6%	63.8%	34.9%
HumanEval (pass@1)	30.5%	18.9%	11.6%	31.1%
MBPP	47.5%	35.4%	26.1%	52.5%
MATH	13.1%	6.0%	3.9%	5.2%
GSM8K (8-shot, maj@8)	52.2%	34.3%	16.0%	20.8%

Mistral 7B beat Llama 2 13B on every benchmark in the table except NaturalQuestions, where the two were within a percentage point. On MMLU the gap was about 4.5 points, on GSM8K it was about 18 points, and on HumanEval it was about 11.6 points.^[1] The math and reasoning gaps were big enough that Mistral 7B was also competitive with or better than the much larger Llama 1 33B on those tasks, a comparison the launch blog turned into one of its headline framings.^[1]^[2] Mistral summarised the result as: "Mistral 7B performs equivalently to a Llama 2 that would be more than 3x its size."^[2]

The paper additionally reports an MT-Bench score of 6.84 ± 0.07 for Mistral-7B-Instruct-v0.1, ahead of llama 2 13B Chat at 6.65 and ahead of all other 7B chat models at the time of publication.^[1] A side-by-side human preference test reported in the paper showed Mistral preferred 5,020 times versus Llama 2 13B Chat preferred 4,143 times in the assessed sample on llmboxing.com/leaderboard.^[1] On MMLU specifically, Mistral 7B Instruct v0.1 scored 56.3%, which is several points below the base model's 60.1%, a typical pattern for early instruction-tuned 7B models, where the tuning data was not designed to preserve knowledge benchmarks.^[22]

The headline framing in the release blog was that Mistral 7B "performs equivalently to a Llama 2 that would be more than 3x its size" on reasoning and reading comprehension.^[2] That framing was marketing-flavoured, but the underlying numbers held up to independent scrutiny on Hugging Face's Open LLM Leaderboard, where Mistral 7B sat near the top of its weight class for most of late 2023.^[23]

Instruct variants and version history

Mistral has shipped several iterations under the Mistral 7B name. The headline differences are tokenizer changes, instruction-following data, function-calling support, and the move from 8K to 32K context.

Variant	Release	Notes
Mistral-7B-v0.1 (base)	Sept 27, 2023	Original base model. 8K context, 32k vocab, GQA + SWA.^[1]^[2]
Mistral-7B-Instruct-v0.1	Sept 27, 2023	First instruct version, supervised fine-tune on public instruction data; MT-Bench 6.84.^[1]^[22]
Mistral-7B-Instruct-v0.2	Dec 11, 2023	Improved instruction following. 32K context (RoPE θ = 1e6), SWA disabled.^[7]^[20]
Mistral-7B-v0.2 (base)	March 23, 2024	Base release matching v0.2 instruct architecture, posted during a hackathon at SHACK15 in San Francisco co-hosted with Cerebral Valley.^[24]^[25]
Mistral-7B-v0.3 (base)	May 22, 2024	Vocabulary extended to 32,768 entries; v3 tokenizer.^[6]
Mistral-7B-Instruct-v0.3	May 22, 2024	v3 tokenizer, function calling via `[TOOL_CALLS]`, `[AVAILABLE_TOOLS]`, `[TOOL_RESULTS]` control tokens.^[6]^[26]

v0.1 to v0.2

v0.1 used a strict 8K context window and 4K sliding window. v0.2, released as an instruct fine-tune on December 11, 2023, raised the nominal context to 32,768 tokens, removed sliding-window attention from the default configuration, and increased the RoPE base frequency to θ = 1 × 10⁶ to better support long-context extrapolation.^[7]^[20] In config.json terms, sliding_window was set to null, max_position_embeddings to 32,768, and rope_theta from 10,000.0 to 1,000,000.0.^[7]^[20]

The matching base model was released about three months later in March 2024 at Mistral's hackathon at SHACK15 in San Francisco, co-hosted with the Cerebral Valley community.^[24]^[25] It was the first non-instruct v0.2 weights set to be officially distributed by Mistral. Because the official mistralai organisation on Hugging Face did not initially host the v0.2 base weights, the early redistribution lived at mistral-community/Mistral-7B-v0.2 and alpindale/Mistral-7B-v0.2-hf.^[25] v0.2 became the workhorse for fine-tuning experiments throughout 2024 because it kept the 7.3B parameter count and Apache 2.0 license but added the longer context that downstream applications had started to expect.

v0.3

The v0.3 generation extended the vocabulary from 32,000 to 32,768 entries to add three new control tokens, [TOOL_CALLS], [AVAILABLE_TOOLS], and [TOOL_RESULTS], used by the structured function-calling format.^[6]^[26] Function calls are issued by the model emitting a JSON payload between [TOOL_CALLS] boundaries, and tool results are returned inside [TOOL_RESULTS] boundaries; tool-call IDs are constrained to exactly nine alphanumeric characters.^[6] The v0.3 release accompanied a broader push by Mistral to support agentic workloads alongside the Mixtral 8x7B and Mistral Large product lines.^[26]

Chat templates

The instruct variants use a chat template centred on the [INST] and [/INST] control tokens. The very first user instruction is preceded by the <s> begin-of-sentence token; subsequent instructions are not. Assistant generation ends with the </s> end-of-sentence token. A typical multi-turn sequence looks like:

<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice...</s>
[INST] Do you have mayonnaise recipes? [/INST]

The template is built into the tokenizer.apply_chat_template method in Hugging Face Transformers, which handles the formatting automatically when supplied with a list of {"role": ..., "content": ...} messages.^[7]^[22]

Reception and downstream fine-tunes

Mistral 7B was downloaded heavily within hours of release and immediately became the base model for a wave of community fine-tunes. A few of the most influential:

Zephyr 7B Beta from Hugging Face H4 (released November 2023, paper "Zephyr: Direct Distillation of LM Alignment," arXiv:2310.16944) used Direct Preference Optimization (DPO) on UltraFeedback over an UltraChat-tuned Mistral 7B base. It reached an MT-Bench score of 7.34 and a 90.6% AlpacaEval win rate, the highest MT-Bench for a 7B open chat model at the time and even ahead of Llama 2 Chat 70B on chat tasks.^[27]^[28]
OpenHermes 2.5 Mistral 7B from Teknium (November 3, 2023) was fine-tuned on roughly 1 million examples of primarily GPT-4-generated instruction data plus code; it became one of the most downloaded community chat models of late 2023 and popularised the ChatML format on Mistral. Reported scores included HumanEval pass@1 of 50.7%, TruthfulQA 53.04%, AGI-Eval 43.07%, and a GPT4All average of 73.12.^[29]
Notus 7B from Argilla (November 2023) was a DPO fine-tune that started from Hugging Face H4's zephyr-7b-sft-full and applied a re-binarised version of UltraFeedback using preference ratings rather than the original critique scores. It reached MT-Bench 7.30 and AlpacaEval 91.42%, slightly above Zephyr Beta on the latter.^[30]
Starling-LM-7B-alpha from a UC Berkeley team (Banghua Zhu and colleagues, November 2023) used Reinforcement Learning from AI Feedback (rlaif) with an Advantage-induced Policy Alignment (APA) algorithm and the Nectar reward model on top of OpenChat 3.5, which was itself a Mistral 7B fine-tune. Starling reached MT-Bench 8.09, behind only GPT-4 and GPT-4 Turbo at the time.^[31]
Dolphin-2.x-Mistral-7B from Eric Hartford and the Cognitive Computations community (December 2023 onwards) focused on producing an uncensored chat model on the Mistral base, widely used for role-play and creative applications. Dolphin-2.8-Mistral-7B-v02 (March 2024) was a full-weights fine-tune of Mistral 7B v0.2 with a 16K sequence length, trained on roughly 10x L40S GPUs over three days on Crusoe Cloud.^[32]

The architectural pattern of GQA plus sliding-window attention plus RoPE plus RMSNorm plus SwiGLU, with a roughly 4x query-to-KV-head ratio, became the default recipe for new dense open-weights LLMs in the 2024 to 2026 period. Models from Alibaba's qwen line, Google's gemma line, Meta's llama 3 line, and several others adopted the same general blueprint, with variations on window size or whether to keep SWA at all.^[17]^[18]

The release pattern (weights first, paper later, no application form) has also stuck. Within a year, the default community expectation for a serious open release was Apache 2.0 or a similarly permissive license, weights on hugging face, day-one support in popular inference engines, and at most a brief blog post. Anything more restrictive started to look defensive.^[4]^[5]

Strategic and commercial impact

For Mistral AI itself, the success of the 7B release set up a sequence of larger funding rounds:

December 11, 2023: €385 million ($428 million) Series A at a roughly €2 billion valuation, led by Andreessen Horowitz with participation from Salesforce, BNP Paribas, Lightspeed, and others.^[33]^[34]
February 26, 2024: $16 million strategic investment from Microsoft alongside an Azure distribution partnership that made Mistral Large the second LLM hosted natively on Azure AI Studio after OpenAI's.^[35]^[36]
June 11, 2024: €600 million ($640 million) Series B at a €5.8 billion ($6 billion) valuation, led by General Catalyst, with €468 million in equity and €132 million in debt; participating investors included Nvidia, Andreessen Horowitz, Salesforce Ventures, IBM, Samsung Venture, Cisco, ServiceNow, and others.^[37]
September 9, 2025: €1.7 billion ($2 billion) Series C at an €11.7 billion ($13.7-14 billion) valuation in which the Dutch lithography giant ASML invested €1.3 billion and took an 11% stake on a fully diluted basis, becoming Mistral's largest single shareholder and lead investor.^[38]^[39]

The company became one of the most-cited examples of European AI capacity in policy discussions about sovereignty and competitiveness. Existing investors Nvidia, DST Global, Andreessen Horowitz, Bpifrance, General Catalyst, Index Ventures, and Lightspeed also participated in the Series C alongside ASML.^[38]^[39]

The wider ecosystem of fine-tunes built on Mistral 7B is hard to count precisely. As of 2026, the Hugging Face hub lists thousands of derivative models, including instruction-tuned variants in dozens of languages, role-play and uncensored models, code-focused fine-tunes, retrieval-augmented setups, and small-scale reasoning models. Many of the early "Mistral" community fine-tunes were the first widely used non-Meta open-weights chat models that people felt they could deploy commercially without legal review.^[5]^[17]

How do you run Mistral 7B locally?

Part of the reason Mistral 7B took off so quickly is that the inference story was extremely friendly. Day-one support landed in vllm, text-generation-inference (TGI), and llama.cpp.^[2]^[3] Within a week there were quantised GGUF, GGML, GPTQ, AWQ, and EXL2 builds on Hugging Face from community contributors, several of which fit comfortably on a single 8 GB consumer GPU.^[40]

Concrete deployment numbers worth noting:

In bf16, the model weights take roughly 14 GB. A single 16 GB consumer GPU like an RTX 4080 can serve it without offload.
4-bit GGUF quantization brings the file size to around 4 GB, which means it runs at acceptable speed on M1 and M2 MacBooks and on CPUs with enough RAM.^[40]
The KV cache savings from GQA mean that, for a given batch and context, Mistral 7B uses about a quarter of the cache memory of an equivalent multi-head 7B model.^[14]
The official launch blog cited roughly a 2x speed improvement over standard attention for a 16k-token sequence with a 4k sliding window.^[2]
The rolling buffer cache reported in the paper reduced cache memory usage by 8x on a 32k-token sequence relative to dense attention.^[1]

Tooling support spread quickly. ollama added a pre-packaged Mistral 7B build very early, and the model became one of the most-downloaded entries in the Ollama library through 2024 and 2025.^[41] LM Studio, Jan, GPT4All, and the major commercial inference hosts (Together AI, Anyscale, Fireworks, Replicate, OpenRouter, and others) all offered hosted Mistral 7B endpoints within weeks of release. By 2025 the official mistralai/Mistral-7B-v0.1 repository was logging over 500,000 downloads in its first month of release and well above 900,000 monthly downloads for stretches of 2024 to 2025, putting it among the most-downloaded open-weights causal-LM repositories on the platform.^[42]^[43]

Recommended fine-tuning tools

Because v0.1 was released under Apache 2.0 with no acceptable-use clause, the post-launch fine-tuning ecosystem was unusually wide. The most commonly used wrappers for Mistral 7B fine-tuning include Hugging Face's Transformers plus TRL (TRL ships built-in SFT and DPO trainers), PEFT for parameter-efficient training, LoRA and QLoRA for low-cost adapter fine-tunes, and Mistral's own mistral-finetune repository released alongside v0.3.^[44] LLaMA-Factory also added Mistral support among its first batch of non-LLaMA architectures.

Is Mistral 7B open source?

Mistral 7B is released under the Apache License, version 2.0, one of the most permissive licenses in use for foundation models. The release used two distribution channels at once. The official Hugging Face repository at mistralai/Mistral-7B-v0.1 (and the corresponding instruct variants) hosted the SafeTensors weights.^[3] Separately, the Mistral team posted a BitTorrent magnet link on social media a day before the blog post went live. The torrent contained the same weights plus a sample inference script.^[4]^[45]

The license carries no acceptable-use addendum, no platform-size restrictions, no separate research-only clause, and no requirement to identify model outputs.^[2]^[3] By contrast, llama 2's "Community License" at the time included a 700-million-monthly-active-user restriction, an acceptable-use policy, and a requirement to attribute outputs as Llama-derived.^[16]^[46] Mistral's own framing in the launch blog was unambiguous: the model "can be used without restrictions."^[2]

The combination of a recognised permissive license, a clean state-of-the-art claim at the 7B size, and a low barrier to actually running the thing was the trifecta that drove adoption.^[4]^[5]

The Mistral model family after 7B

Mistral 7B was the first in what has become a wide line of releases. The most relevant follow-ups for understanding its place in the family:

Model	Released	Notes
Mistral 7B	Sept 27, 2023	Dense 7.3B, Apache 2.0.^[2]
Mixtral 8x7B	Dec 11, 2023	Sparse mixture-of-experts: 8 experts of ~7B each, 2 routed per token. About 46.7B total parameters and ~13B active. Apache 2.0.^[12]
Mistral Medium	Dec 2023	First proprietary commercial model (closed weights).^[9]
Mistral Large	Feb 26, 2024	Closed-weights commercial flagship, first hosted on Azure via Microsoft partnership.^[35]^[36]
Mixtral 8x22B	April 2024	Bigger MoE successor to Mixtral 8x7B. Apache 2.0.^[9]
Codestral 22B	May 29, 2024	Code-focused dense model under the Mistral Non-Production License.^[47]
Mistral 7B v0.3	May 22, 2024	Updated tokenizer (32,768-entry vocab), function calling.^[6]
Codestral Mamba 7B	July 16, 2024	First Mistral model using the Mamba state-space architecture.^[9]
Mathstral 7B	July 16, 2024	Math-focused fine-tune.^[9]
Mistral NeMo 12B	July 18, 2024	12B model built with NVIDIA, 128K context, Tekken tokenizer.^[21]
Pixtral 12B	September 2024	First multimodal Mistral release; based on the NeMo 12B text backbone.^[48]
Ministral 3B / 8B	October 2024	Smaller models for edge use.^[9]
Mistral Small 3 (24B)	January 30, 2025	24B dense, Apache 2.0, ~81% MMLU, 32K context.^[49]
Mistral Small 3.1 / 3.2	March 2025 / June 2025	Successive updates to the 24B Small line.^[9]
Magistral Small / Medium	June 2025	Reasoning-focused models.^[9]
Mistral Medium 3	May 2025	Enterprise-grade dense model.^[9]
Mistral Large 3	December 2, 2025	Flagship dense/MoE successor with 675B total / 41B active parameters; commercial.^[50]

By the time Mistral Large 3 shipped in late 2025, the original 7B was no longer the company's headline product, but it had not been retired. The base v0.3 weights remained one of the most heavily downloaded checkpoints on Hugging Face and stayed in active use for fine-tuning, distillation, and edge deployment.^[43]^[50]

What are the limitations of Mistral 7B?

Mistral 7B is, by 2026 standards, a small model. There are clear limits.

Knowledge is dated. The pretraining cutoff is roughly mid-2023. Anything after that has to come from retrieval, fine-tuning, or in-context examples.^[1]
It is strongest in English. Multilingual coverage is acceptable but not on par with later Mistral releases like NeMo 12B or with comparable multilingual-first models.^[21]
Short context. The 8K context of v0.1 is short by current standards. v0.2 and v0.3 raised this to 32K but still fall short of the very long contexts (200K+) now common in flagship proprietary models.^[7]^[20]
No built-in moderation. Mistral has been explicit that safety alignment is left to downstream users, which is part of the appeal for some and the criticism for others.^[1]^[3]
Surpassed by newer 7B to 9B-class models. Llama 3.1 8B, Qwen2 and Qwen2.5 7B, Gemma 2 9B, and several task-specific fine-tunes all surpass v0.1 on most public leaderboards. For pure capability per parameter, Mistral 7B is no longer state of the art.^[17]^[18]
Documented sliding-window quality cliffs. In v0.1, attention quality degrades past the 4K window, which is why Mistral disabled SWA in v0.2 once they had moved to a longer dense context.^[7]^[20]

What it remains useful for: a strong, well-documented, permissively licensed baseline for fine-tuning research and a standard reference architecture for understanding the GQA-plus-SWA design pattern.

Recent status (2025 to 2026)

In 2025 and into 2026 Mistral 7B continues to show up as the default starting point for academic fine-tuning papers, for university courses on LLM internals, and for production deployments where a small, locally hosted, permissively licensed model is the right fit. Mistral AI has not deprecated it. The v0.3 weights are still served from the official Hugging Face organisation, and ollama, llama.cpp, vllm, and TGI all maintain support.^[41]^[43]

Mistral AI itself has shifted its public emphasis toward larger commercial models (Mistral Large 3, Mistral Medium 3, Magistral Medium) and toward the Mixtral MoE line. The September 2025 partnership and €1.3 billion investment from ASML, which gave the Dutch lithography company an 11% stake and made it Mistral's biggest single shareholder, signalled that the company is positioning itself as a long-term European AI champion with deep ties to the European semiconductor industry.^[38]^[39] In late 2025 and early 2026 Mistral also broke ground on data centres near Paris and Sweden, supported by a roughly $830 million infrastructure round, the first dedicated computing build-out of that scale for a European AI lab.^[9]

The original 7B sits in the company's history the same way LLaMA 1 sits in Meta's: the first one out the door, the proof of concept, the model that made everything afterward easier to ship.

References

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W., "Mistral 7B", arXiv preprint, 2023-10-10. https://arxiv.org/abs/2310.06825. Accessed 2026-06-21. ↩
Mistral AI, "Mistral 7B: The best 7B model to date, Apache 2.0", Mistral AI blog, 2023-09-27. https://mistral.ai/news/announcing-mistral-7b. Accessed 2026-06-21. ↩
Hugging Face, "mistralai/Mistral-7B-v0.1 model card", Hugging Face Hub, 2023. https://huggingface.co/mistralai/Mistral-7B-v0.1. Accessed 2026-06-21. ↩
Franzen, C., "Mistral AI bucks release trend by dropping torrent link to new open source LLM", VentureBeat, 2023-09-27. https://venturebeat.com/ai/mistral-ai-bucks-release-trend-by-dropping-torrent-link-to-new-open-source-llm. Accessed 2026-06-21. ↩
Maiberg, E., "$260 Million AI Company Releases Undeletable Chatbot That Gives Detailed Instructions on Murder, Ethnic Cleansing", 404 Media, 2023-09-29. https://www.404media.co/260-million-ai-company-releases-chatbot-that-gives-detailed-instructions-on-murder-ethnic-cleansing/. Accessed 2026-06-21. ↩
Hugging Face, "mistralai/Mistral-7B-Instruct-v0.3 model card", Hugging Face Hub, 2024-05-22. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3. Accessed 2026-06-21. ↩
Hugging Face, "mistralai/Mistral-7B-Instruct-v0.2 model card", Hugging Face Hub, 2023-12-11. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2. Accessed 2026-06-21. ↩
Together AI, "Mistral (7B) Instruct v0.2 API", Together AI model catalog, 2024. https://www.together.ai/models/mistral-7b-instruct-v0-2. Accessed 2026-06-21. ↩
Wikipedia contributors, "Mistral AI", Wikipedia, 2026 revision. https://en.wikipedia.org/wiki/Mistral_AI. Accessed 2026-06-21. ↩
Dillet, R., "France's Mistral AI blows in with a $113M seed round at a $260M valuation to take on OpenAI", TechCrunch, 2023-06-13. https://techcrunch.com/2023/06/13/frances-mistral-ai-blows-in-with-a-113m-seed-round-at-a-260m-valuation-to-take-on-openai/. Accessed 2026-06-21. ↩
Tech.eu, "FantAstIque! French start-up Mistral AI raises a €105 million Seed round in its first month of existence", Tech.eu, 2023-06-14. https://tech.eu/2023/06/14/fantastique-french-start-up-mistral-ai-raises-a-105-million-seed-round-in-its-first-month-of-existence/. Accessed 2026-06-21. ↩
Mistral AI, "Mixtral of experts: A high quality Sparse Mixture-of-Experts", Mistral AI blog, 2023-12-11. https://mistral.ai/news/mixtral-of-experts. Accessed 2026-06-21. ↩
Touvron, H., Lavril, T., Izacard, G. et al., "LLaMA: Open and Efficient Foundation Language Models", arXiv:2302.13971, 2023-02-27. https://arxiv.org/abs/2302.13971. Accessed 2026-06-21. ↩
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., Sanghai, S., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints", arXiv:2305.13245, 2023-05-22. https://arxiv.org/abs/2305.13245. Accessed 2026-06-21. ↩
Shazeer, N., "GLU Variants Improve Transformer", arXiv:2002.05202, 2020-02-12. https://arxiv.org/abs/2002.05202. Accessed 2026-06-21. ↩
Touvron, H., Martin, L., Stone, K. et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models", arXiv:2307.09288, 2023-07-18. https://arxiv.org/abs/2307.09288. Accessed 2026-06-21. ↩
Llama Team, "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-06-21. ↩
Google DeepMind, "Gemma: Open Models Based on Gemini Research and Technology", arXiv:2403.08295, 2024-03-13. https://arxiv.org/abs/2403.08295. Accessed 2026-06-21. ↩
Beltagy, I., Peters, M. E., Cohan, A., "Longformer: The Long-Document Transformer", arXiv:2004.05150, 2020-04-10. https://arxiv.org/abs/2004.05150. Accessed 2026-06-21. ↩
Mistral AI Labs (@MistralAILabs), "New release: Mistral 7B v0.2 Base", X (Twitter), 2024-03-23. https://x.com/MistralAILabs/status/1771670765521281370. Accessed 2026-06-21. ↩
Mistral AI, "Mistral NeMo", Mistral AI blog, 2024-07-18. https://mistral.ai/news/mistral-nemo. Accessed 2026-06-21. ↩
Hugging Face, "mistralai/Mistral-7B-Instruct-v0.1 model card", Hugging Face Hub, 2023-09-27. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1. Accessed 2026-06-21. ↩
Beeching, E. et al., "Open LLM Leaderboard", Hugging Face Spaces, 2023 to 2024. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard. Accessed 2026-06-21. ↩
Cerebral Valley (@cerebral_valley), "@MistralAI just announced Mistral 7B v0.2 Base Model at our hackathon at @SHACK15sf", X (Twitter), 2024-03-23. https://x.com/cerebral_valley/status/1771630171679776900. Accessed 2026-06-21. ↩
Hugging Face, "mistral-community/Mistral-7B-v0.2 (base, March 2024 release)", Hugging Face Hub, 2024-03. https://huggingface.co/mistral-community/Mistral-7B-v0.2. Accessed 2026-06-21. ↩
MarkTechPost, "Mistral AI Team Releases the Mistral-7B-Instruct-v0.3", MarkTechPost, 2024-05-22. https://www.marktechpost.com/2024/05/22/mistral-ai-team-releases-the-mistral-7b-instruct-v0-3-an-instruct-fine-tuned-version-of-the-mistral-7b-v0-3/. Accessed 2026-06-21. ↩
Hugging Face, "HuggingFaceH4/zephyr-7b-beta model card", Hugging Face Hub, 2023-10. https://huggingface.co/HuggingFaceH4/zephyr-7b-beta. Accessed 2026-06-21. ↩
Tunstall, L., Beeching, E., Lambert, N. et al., "Zephyr: Direct Distillation of LM Alignment", arXiv:2310.16944, 2023-10-25. https://arxiv.org/abs/2310.16944. Accessed 2026-06-21. ↩
Hugging Face, "teknium/OpenHermes-2.5-Mistral-7B model card", Hugging Face Hub, 2023-11-03. https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B. Accessed 2026-06-21. ↩
Bartolomé, A. and Vila-Suero, D., "Introducing Notus: A DPO fine-tune of Zephyr with a focus on high-quality data", Hugging Face Blog, 2023-11-29. https://huggingface.co/blog/alvarobartt/notus-7b-v1. Accessed 2026-06-21. ↩
Hugging Face, "berkeley-nest/Starling-LM-7B-alpha model card", Hugging Face Hub, 2023-11. https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha. Accessed 2026-06-21. ↩
Hugging Face, "cognitivecomputations/dolphin-2.8-mistral-7b-v02 model card", Hugging Face Hub, 2024-03. https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02. Accessed 2026-06-21. ↩
Tech.eu, "Mistral AI confirms €385M Series A funding round", Tech.eu, 2023-12-11. https://tech.eu/2023/12/11/mistral-ai-confirms-385m-series-a-funding-round/. Accessed 2026-06-21. ↩
Latham & Watkins, "Latham Represents a16z on €385 Million Series A Funding Round of Mistral AI", Latham & Watkins press release, 2023-12. https://www.lw.com/en/news/2023/12/latham-advises-a16z-on-mistral-ai-series-a-funding-round. Accessed 2026-06-21. ↩
Microsoft Azure, "Microsoft and Mistral AI announce new partnership to accelerate AI innovation and introduce Mistral Large first on Azure", Microsoft Azure blog, 2024-02-26. https://azure.microsoft.com/en-us/blog/microsoft-and-mistral-ai-announce-new-partnership-to-accelerate-ai-innovation-and-introduce-mistral-large-first-on-azure/. Accessed 2026-06-21. ↩
Browne, R., "Microsoft invests in Europe's Mistral AI to expand beyond OpenAI", CNBC, 2024-02-26. https://www.cnbc.com/2024/02/26/microsoft-invests-in-europes-mistral-ai-to-expand-beyond-openai.html. Accessed 2026-06-21. ↩
Dillet, R., "Paris-based AI startup Mistral AI raises $640M", TechCrunch, 2024-06-11. https://techcrunch.com/2024/06/11/paris-based-ai-startup-mistral-ai-raises-640-million/. Accessed 2026-06-21. ↩
CNBC, "AI firm Mistral valued at $14 billion as chip giant ASML takes major stake", CNBC, 2025-09-09. https://www.cnbc.com/2025/09/09/ai-firm-mistral-valued-at-14-billion-as-asml-takes-major-stake.html. Accessed 2026-06-21. ↩
Mistral AI, "Mistral AI raises €1.7B to accelerate technological progress with AI", Mistral AI blog, 2025-09-09. https://mistral.ai/news/mistral-ai-raises-1-7-b-to-accelerate-technological-progress-with-ai. Accessed 2026-06-21. ↩
Hugging Face, "TheBloke/Mistral-7B-v0.1-GGUF model card", Hugging Face Hub, 2023-09. https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF. Accessed 2026-06-21. ↩
Ollama, "mistral", Ollama model library, 2023 to 2026. https://ollama.com/library/mistral. Accessed 2026-06-21. ↩
Quantumrun Foresight, "Mistral 7B Statistics and User Trends", Quantumrun, 2025. https://www.quantumrun.com/consulting/mistral-7b-statistics/. Accessed 2026-06-21. ↩
Hugging Face, "mistralai/Mistral-7B-v0.3 model card", Hugging Face Hub, 2024-05-22. https://huggingface.co/mistralai/Mistral-7B-v0.3. Accessed 2026-06-21. ↩
Mistral AI, "mistral-finetune (GitHub repository)", GitHub, 2024. https://github.com/mistralai/mistral-finetune. Accessed 2026-06-21. ↩
Slashdot, "$260 Million AI Startup Releases 'Unmoderated' Chatbot Via Torrent", Slashdot, 2023-09-29. https://slashdot.org/story/23/09/29/2024216/260-million-ai-startup-releases-unmoderated-chatbot-via-torrent. Accessed 2026-06-21. ↩
Meta AI, "Llama 2 Community License Agreement", Meta AI, 2023-07. https://ai.meta.com/llama/license/. Accessed 2026-06-21. ↩
Mistral AI, "Codestral: Hello, World!", Mistral AI blog, 2024-05-29. https://mistral.ai/news/codestral. Accessed 2026-06-21. ↩
Wiggers, K., "Mistral releases Pixtral 12B, its first multimodal model", TechCrunch, 2024-09-11. https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/. Accessed 2026-06-21. ↩
Mistral AI, "Mistral Small 3", Mistral AI blog, 2025-01-30. https://mistral.ai/news/mistral-small-3. Accessed 2026-06-21. ↩
Wikipedia contributors, "Mistral AI (model timeline)", Wikipedia, 2026 revision. https://en.wikipedia.org/wiki/Mistral_AI#Models. Accessed 2026-06-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Mistral 7B

Quick answer: what is Mistral 7B and why does it matter?

Infobox

When was Mistral 7B released, and by whom?

Why did the Mistral 7B release matter?

The paper and its authors

Architecture

Why these particular ingredients

What is grouped-query attention in Mistral 7B?

How does sliding window attention work?

The sliding window receptive-field math

Tokenizer

Training

How does Mistral 7B compare to Llama 2 on benchmarks?

Instruct variants and version history

v0.1 to v0.2

v0.3

Chat templates

Reception and downstream fine-tunes

Strategic and commercial impact

How do you run Mistral 7B locally?

Recommended fine-tuning tools

Is Mistral 7B open source?

The Mistral model family after 7B

What are the limitations of Mistral 7B?

Recent status (2025 to 2026)

See also

References

Improve this article

What links here (24 of 63)

What links here (24 of 63)

Quick answer: what is Mistral 7B and why does it matter?

Infobox

When was Mistral 7B released, and by whom?

Why did the Mistral 7B release matter?

The paper and its authors

Architecture

Why these particular ingredients

What is grouped-query attention in Mistral 7B?

How does sliding window attention work?

The sliding window receptive-field math

Tokenizer

Training

How does Mistral 7B compare to Llama 2 on benchmarks?

Instruct variants and version history

v0.1 to v0.2

v0.3

Chat templates

Reception and downstream fine-tunes

Strategic and commercial impact

How do you run Mistral 7B locally?

Recommended fine-tuning tools

Is Mistral 7B open source?

The Mistral model family after 7B

What are the limitations of Mistral 7B?

Recent status (2025 to 2026)

See also

References

Improve this article

Related Articles

DeepSeek

Meta AI

Mistral AI

01.AI

IBM watsonx

EleutherAI

What links here (24 of 63)

Related Articles

DeepSeek

Meta AI

Mistral AI

01.AI

IBM watsonx

EleutherAI

What links here (24 of 63)