Codestral
Last reviewed
May 7, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 4,398 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 · 4,398 words
Add missing citations, update stale details, or suggest a clearer explanation.
Codestral is a family of code-specialized large language models developed by Mistral AI. The family began with the release of Codestral 22B on May 29, 2024, making it Mistral AI's first model built specifically for software development tasks. Subsequent releases expanded the family to include a Mamba-architecture variant, multiple updated versions, a dedicated code embedding model, and an agentic coding model called Devstral. By 2025, Mistral had positioned the entire Codestral family as a complete coding stack for enterprise software development.
Mistral AI was founded in April 2023 by three former researchers: Arthur Mensch (previously at Google DeepMind) and Guillaume Lample and Timothée Lacroix (both previously at Meta AI). The Paris-based company built an early reputation on releasing capable open-weight models such as Mistral 7B and Mixtral 8x7B, distinguishing itself from competitors through a commitment to open weights and efficiency-focused architecture.
By 2024, the demand for purpose-built code generation models had grown substantially. Projects like DeepSeek Coder and CodeLlama had demonstrated that models trained on large code corpora with coding-specific fine-tuning could substantially outperform general-purpose models on programming tasks. Fill-in-the-middle (FIM) capabilities, where a model completes code given both preceding and following context, had become a standard requirement for IDE integrations. Mistral entered this space with Codestral 22B, which it described as setting a new standard on the performance-to-latency tradeoff for code generation.
Mistral released Codestral 22B v0.1 on May 29, 2024. The model has 22.2 billion parameters and was trained on a dataset spanning more than 80 programming languages, including Python, Java, C, C++, JavaScript, TypeScript, Rust, Go, and Bash. The model uses BF16 tensor precision and was built on the same transformer architecture used across Mistral's model family, incorporating grouped-query attention (GQA) for faster inference.
One of the distinguishing features of the initial release was its 32,768-token context window. At the time, most competing open-source code models offered 4,096 to 16,384 tokens of context. The larger window allowed Codestral 22B to handle repository-level tasks more reliably, particularly for benchmarks that require reading across multiple files or long functions.
The model supports fill-in-the-middle completion, where a prompt specifies a prefix and suffix and the model generates the missing middle section. This mechanism is central to how IDE plugins like Continue and Tabnine use the model: the surrounding code in an open file provides both prefix and suffix context, and Codestral generates a completion that fits between them.
On the HumanEval benchmark, which tests Python function synthesis from docstrings, Codestral 22B scored 81.1%. It also outperformed all other models available at the time on RepoBench, a benchmark designed to measure cross-file code completion that requires understanding repository structure. Mistral attributed this lead specifically to the 32K context window, which allowed the model to incorporate more relevant code from other files in the same repository.
Mistral released the weights on Hugging Face under the Mistral AI Non-Production License (MNPL), a custom license that permits research and personal use but restricts commercial deployment of the model weights. API access through Mistral's La Plateforme was available for commercial use under standard API terms.
On July 16, 2024, Mistral released Codestral Mamba 7B (model card: Mamba-Codestral-7B-v0.1), the first model in their lineup built on the Mamba 2 architecture rather than a standard transformer. The release was notable as an experiment in applying state space model (SSM) architectures to code generation at a time when Mamba models were attracting significant research interest as potential transformer alternatives.
The Mamba architecture processes sequences through a selective state space mechanism rather than the attention mechanism used in transformers. This gives Mamba models a key property: inference time scales linearly with sequence length rather than quadratically. Transformer-based models require attention computations that grow with the square of the input length, which becomes a bottleneck for very long contexts. A Mamba model processes each new token in roughly constant time regardless of how long the preceding sequence is.
Mistral tested Codestral Mamba on in-context retrieval tasks up to 256,000 tokens and found it maintained competitive performance at those lengths. On HumanEval, the model scored 75.0%, and on MBPP (Mostly Basic Python Problems), it scored 68.5%. These results were competitive with other 7B-class code models at the time, including CodeLlama 7B and CodeGemma 7B, which scored lower on the same benchmarks.
Unlike the original Codestral 22B, Codestral Mamba was released under the Apache 2.0 license, allowing commercial use without restriction. This licensing difference reflected Mistral's different positioning of the two models: the 22B model was targeted at commercial API usage, while the 7B Mamba model was released more openly for research and experimentation.
The model was available on La Plateforme under the identifier codestral-mamba-2407 and could be deployed locally using Mistral's mistral-inference SDK, which relies on reference implementations from the Mamba GitHub repository.
Mistral announced Codestral 25.01 on January 13, 2025. The update brought substantial improvements in generation speed, which Mistral described as approximately twice as fast as the original 22B model. The speed increase came from a redesigned architecture and an improved tokenizer that reduces the number of tokens needed to represent code, particularly for programming languages with verbose syntax.
On the LMSys Copilot Arena leaderboard, which aggregates developer votes on which model produces better in-context code completions, Codestral 25.01 debuted at first place. The model continued to support fill-in-the-middle completion across 80+ languages and maintained focus on low-latency, high-frequency tasks suited to IDE integration.
The release expanded Codestral's distribution beyond La Plateforme. At launch, it became available on Google Cloud Vertex AI, with private preview access on Azure AI Foundry and forthcoming availability on Amazon Bedrock. Mistral also made on-premises deployment available for enterprises requiring data residency, allowing the model to run within a private VPC without sending data to Mistral's own infrastructure.
In terms of benchmark scores, Codestral 25.01 reached 86.6% on HumanEval, 80.2% on MBPP, and improved its RepoBench score to 38.0%. LiveCodeBench, a more recent benchmark that tests competitive programming problems, showed a score of 37.9%.
On August 1, 2025, Mistral released Codestral 25.08 alongside what it called a complete coding stack for enterprise development. This version brought several measurable improvements over 25.01:
The context window expanded from 32,768 tokens to 256,000 tokens, matching the extended context of Mamba-based models and making it practical to pass entire repositories to the model in a single call. Mistral reported a 30% increase in the rate at which developers accepted the model's code completions without modification, a 50% reduction in runaway generations (where the model produces irrelevant or excessively long output), and a 5% improvement on instruction-following benchmarks.
Codestral 25.08 can be deployed in the cloud, within a private virtual network, or on a company's own servers without architecture changes. The model maintains the same fill-in-the-middle interface and 80+ language support as previous versions.
The August 2025 release was paired with the formal launch of Mistral Code, a native IDE extension for VS Code and JetBrains IDEs that packages Codestral completions, Devstral agentic automation, and Codestral Embed semantic search into a single integrated tool.
Mistral released Codestral Embed on May 28, 2025, under the API identifier codestral-embed-2505. It is Mistral's first embedding model designed specifically for code rather than natural language.
Embedding models convert text or code into dense numerical vectors that can be used for semantic similarity search, clustering, and retrieval-augmented generation (RAG). General-purpose embedding models, such as OpenAI's text-embedding-large or Cohere Embed, were not trained on code-specific data and tend to underperform on tasks like finding semantically similar functions across a large codebase or mapping a natural language query to relevant source files.
Mistral benchmarked Codestral Embed against Voyage Code 3, Cohere Embed v4.0, and OpenAI's text-embedding-3-large on code-specific retrieval tasks including SWE-Bench repository search and Text2Code retrieval from GitHub. In both evaluations, Codestral Embed outperformed all three competitors. The model supports configurable output dimensions, allowing users to trade off between retrieval quality and storage cost. Mistral noted that even at 256 dimensions with int8 quantized precision, the model outperformed competing models running at higher precision and larger dimensions.
Codestral Embed is priced at $0.15 per million tokens on La Plateforme, with a 50% discount for batch API usage. The primary use cases are code search in IDE plugins, semantic deduplication of code in large repositories, RAG pipelines for coding agents, and mapping pull request descriptions to relevant source files.
Devstral was released on May 21, 2025, built in collaboration between Mistral AI and All Hands AI, the organization behind the OpenHands agent framework. Where Codestral models are optimized for fast single-call code completion, Devstral is designed for multi-step agentic tasks that require planning, tool use, and modifying multiple files across a codebase.
The initial Devstral release scored 46.8% on SWE-Bench Verified, a benchmark that tests whether a model can resolve real GitHub issues in open-source repositories by applying code changes. This score placed it ahead of all openly published open-source models at the time by more than six percentage points. The model was released under the Apache 2.0 license and weighed in at 24 billion parameters, making it small enough to run on a single Nvidia RTX 4090 or a Mac with 32 GB of RAM.
Devstral's API pricing at launch was $0.1 per million input tokens and $0.3 per million output tokens, making it one of the least expensive agentic coding models available.
A second version, Devstral 2, was released in December 2025. This version expanded to 123 billion parameters in the full variant and introduced a 24B small variant called Devstral Small 2. Devstral 2 (123B) achieved 72.2% on SWE-Bench Verified, while Devstral Small 2 (24B) reached 68.0%. Mistral claimed the larger model was up to seven times more cost-efficient than Claude Sonnet at equivalent real-world software engineering tasks. Devstral Small 2 was released under the Apache 2.0 license; the 123B model used a modified MIT license with additional terms.
Mistral also released Mistral Vibe CLI alongside Devstral 2: an open-source command-line coding assistant powered by Devstral that can explore, modify, and execute changes across a codebase using natural language instructions.
Codestral 22B and the 25.01 and 25.08 updates are built on a transformer decoder architecture consistent with the broader Mistral model family. The architecture incorporates grouped-query attention (GQA), which reduces the number of key-value heads compared to multi-head attention, lowering memory bandwidth requirements during inference without significantly degrading output quality. Sliding window attention, used in earlier Mistral 7B models, may be applied in some layers to handle long-context inputs efficiently.
Codestral Mamba 7B departs from this architecture entirely. It uses the Mamba 2 selective state space model framework, where the model maintains a fixed-size hidden state that is updated for each new token rather than attending over all previous tokens. The result is linear-time inference: generating the thousandth token takes roughly the same amount of computation as generating the tenth. The trade-off is that SSM models can sometimes underperform transformers on tasks that require precise retrieval of information from early in a long sequence.
All Codestral models use byte-pair encoding (BPE) tokenization. The 25.01 update introduced a revised tokenizer with improved handling of common code patterns, contributing to the speed improvements in that release.
Codestral 25.08 uses a context window of 256,000 tokens, which at 4 bytes per token in BF16 precision translates to a theoretical maximum of approximately 1 GB of context in a single call. In practice, useful context is limited by the model's ability to attend over very long sequences and by the memory requirements of the KV cache.
Mistral has not published detailed information about the training data composition for the Codestral family. The company states that the models are trained on datasets spanning more than 80 programming languages. Languages confirmed by Mistral to be included are Python, Java, C, C++, JavaScript, TypeScript, Rust, Go, Bash, SQL, and HTML/CSS.
The training methodology follows the standard approach for code models: pretraining on large corpora of source code from public repositories (primarily GitHub), followed by instruction fine-tuning on code generation tasks. FIM-specific training is applied to teach the model how to generate completions given both prefix and suffix context, which is necessary for the model to work with the FIM prompting format used in IDE integrations.
Mistral has indicated that the 25.01 and 25.08 versions benefited from refined data curation and additional fine-tuning on code completion and correction tasks, which accounts for part of the improvement in developer-accepted completion rates.
Licensing within the Codestral family is not uniform:
Codestral 22B v0.1 was released under the Mistral AI Non-Production License (MNPL), introduced alongside the model in May 2024 by Mistral CEO Arthur Mensch. The MNPL allows use for research, testing, and non-commercial personal projects. It prohibits commercial use of the weights, including using the model to power internal tools within a commercial organization. Developers who want to deploy the weights commercially must obtain a separate commercial license from Mistral. API access through La Plateforme is not subject to the MNPL and is available for commercial use.
Codestral Mamba 7B was released under the Apache 2.0 license, which imposes no restrictions on commercial use. This made it the first freely deployable model in the Codestral family.
Codestral 25.01 and 25.08: The updated Codestral models are available via API on La Plateforme and through cloud provider integrations. Mistral has offered on-premises deployment for enterprise customers, with terms negotiated separately.
Devstral (original, 24B) was released under the Apache 2.0 license.
Devstral 2 Small (24B) uses the Apache 2.0 license. Devstral 2 (123B) uses a modified MIT license with additional conditions.
The MNPL was a point of contention in the developer community when it was introduced. Some contributors noted that, despite Mistral marketing Codestral as an "open" model, the MNPL is more restrictive than standard open-source licenses such as Apache 2.0 or the MIT license and does not qualify as open source under the Open Source Initiative definition.
HumanEval is a benchmark introduced by OpenAI that consists of 164 Python programming problems, each specified by a function signature and docstring. The model must generate a function body that passes a set of hidden unit tests. Scores are reported as pass@1, the fraction of problems where the first generated solution passes all tests.
| Model | HumanEval (pass@1) |
|---|---|
| Codestral 22B v0.1 | 81.1% |
| Codestral Mamba 7B | 75.0% |
| Codestral 25.01 | 86.6% |
| CodeLlama 70B | ~72% |
| DeepSeek Coder 7B | ~73% |
| Qwen2.5-Coder 7B | ~88% |
MBPP (Mostly Basic Python Problems) contains approximately 500 short Python programming tasks. Like HumanEval, it is evaluated by executing generated code against test cases.
| Model | MBPP |
|---|---|
| Codestral 22B v0.1 | ~78% |
| Codestral Mamba 7B | 68.5% |
| Codestral 25.01 | 80.2% |
RepoBench measures a model's ability to complete code by retrieving relevant context from across a repository. Performance on RepoBench correlates strongly with context window size, because relevant context may be spread across many files. Codestral 22B's 32K context window gave it a significant advantage over contemporary 4K and 8K context models when this benchmark was run at launch. Codestral 25.01 scored 38.0% on RepoBench.
Codestral 22B achieved a FIM pass@1 score of 95.3% on SingleLineInfilling, a benchmark that tests single-line code completions in FIM format. Mistral cited this as the highest score among all models, including closed-source ones, at the time of the original release.
SWE-Bench Verified is a subset of the SWE-Bench benchmark consisting of real GitHub issues verified by human annotators. Models are given a repository and an issue description and must produce a patch that resolves the issue.
| Model | SWE-Bench Verified |
|---|---|
| Devstral (May 2025, 24B) | 46.8% |
| Devstral Small 1.1 | 53.6% |
| Devstral Medium | 61.6% |
| Devstral 2 Small (24B) | 68.0% |
| Devstral 2 (123B) | 72.2% |
The table below compares key models in the Codestral family with competing code-specialized models as of mid-2025.
| Model | Developer | Parameters | HumanEval | Context | License |
|---|---|---|---|---|---|
| Codestral 25.01 | Mistral AI | not disclosed | 86.6% | 256K | Proprietary API |
| Codestral 22B v0.1 | Mistral AI | 22B | 81.1% | 32K | MNPL (weights) |
| Codestral Mamba 7B | Mistral AI | 7.3B | 75.0% | 256K (tested) | Apache 2.0 |
| DeepSeek Coder V2 | DeepSeek | 236B (21B active) | ~90% | 128K | DeepSeek License |
| Qwen2.5-Coder 7B | Alibaba | 7B | ~88% | 128K | Apache 2.0 |
| Qwen2.5-Coder 32B | Alibaba | 32B | ~92% | 128K | Apache 2.0 |
| CodeLlama 70B | Meta | 70B | ~72% | 100K | Meta Llama License |
Codestral's principal competitive advantages over the alternatives listed above are its fill-in-the-middle performance, its first-place standing on the LMSys Copilot Arena leaderboard (which reflects preference in real developer use rather than synthetic benchmarks), and the tight integration of the full Codestral stack into Mistral Code. DeepSeek Coder V2 and Qwen2.5-Coder 32B score higher on HumanEval, but they are substantially larger models and, in the case of DeepSeek, subject to export restrictions that limit deployment in some enterprise environments.
On FIM benchmarks, which are more directly relevant to the IDE use case, Codestral 22B's 95.3% score substantially exceeded competitors at the time of release, reflecting the model's training emphasis on this specific capability.
Mistral makes the Codestral family available through La Plateforme, its API service at api.mistral.ai. As of 2025, pricing for the main models was:
| Model | Input (per million tokens) | Output (per million tokens) |
|---|---|---|
| Codestral 25.01 / 25.08 | $0.30 | $0.90 |
| Codestral Mamba 7B | Free (rate limited) / La Plateforme | Free (rate limited) |
| Codestral Embed | $0.15 | N/A |
| Devstral (May 2025, 24B) | $0.10 | $0.30 |
La Plateforme offers a free tier with rate limits for experimentation. All models are pay-as-you-go with no monthly minimum. Codestral is also available via Google Cloud Vertex AI, Azure AI Foundry (where pricing may differ), and Amazon Bedrock.
For enterprise deployments requiring data residency, Mistral offers on-premises and private VPC deployment options, typically under separate commercial agreements.
Codestral was designed from the start to work inside development environments as an autocomplete and chat backend. The primary integrations are:
Continue: The Continue extension for VS Code and JetBrains IDEs supports Codestral for both FIM-based autocomplete and chat. Mistral offers a separate Codestral API endpoint for FIM that Continue uses directly, with authentication through a dedicated Codestral API key issued on La Plateforme. Continue version 0.8.33 or later is required for full Codestral support.
Tabnine: Tabnine's AI coding assistant added Codestral as one of the available backend models in its chat feature. Users running Tabnine in VS Code or JetBrains IDEs can select Codestral alongside other models.
Mistral Code: Mistral's own native IDE extension, launched alongside Codestral 25.08 in August 2025, provides VS Code and JetBrains integration with the complete coding stack: line-by-line Codestral completions, one-click Devstral agentic automation for multi-step tasks, and Codestral Embed-powered semantic search over the local codebase. Mistral Code is positioned as the primary enterprise integration point for the full stack.
Ollama and LM Studio: The Codestral weights (where available under permissive licenses) can be run locally through Ollama and LM Studio, two popular tools for self-hosting language models. This applies particularly to Codestral Mamba 7B and Devstral, both of which carry open licenses.
GitHub Models: Codestral 25.01 was made available in GitHub Models in January 2025, allowing developers to experiment with the model within the GitHub interface.
The Codestral family covers several distinct software development use cases depending on the model variant:
Inline code completion: Codestral 22B and the 25.01/25.08 updates are primarily used for inline autocomplete, where the model generates the next line or block of code as a developer types. The FIM mechanism allows the model to generate completions that fit between existing lines of code, which is the standard paradigm for modern IDE code suggestions.
Test generation: Codestral models can generate unit tests for existing functions. Given a function signature and body, the model generates a test suite with representative inputs and expected outputs.
Code explanation and chat: Using the chat interface on La Plateforme or within IDE plugins, developers can ask Codestral to explain code, suggest refactors, identify bugs, or translate code between languages.
Repository-level search: Codestral Embed enables semantic search over codebases, allowing developers to find functions or classes by describing their behavior in natural language rather than by exact name or keyword match.
Agentic software engineering: Devstral handles multi-step workflows such as implementing a GitHub issue across multiple files, generating tests and confirming they pass, or refactoring a module according to a high-level description. These tasks require the model to interact with tools (file read/write, shell execution, test runners) in a loop.
The original Codestral 22B launch attracted significant attention from the developer community. On Hacker News, the model's strong FIM performance and 32K context window were cited as distinguishing it from existing open-source alternatives. Practitioners who integrated it with Continue and similar tools reported favorable comparisons with GitHub Copilot for inline completion tasks.
JetBrains ran evaluation tests on Codestral using its Kotlin-HumanEval benchmark and found Codestral scoring 73.75, surpassing GPT-4-Turbo's 72.05 on that benchmark. The company subsequently integrated Codestral support into its AI coding tools.
The introduction of the Mistral AI Non-Production License with the original release generated criticism. Some open-source advocates argued that the MNPL contradicts the spirit of open-weight model releases and noted that the license terms effectively prohibit companies from using the model weights internally, even for developer productivity purposes unrelated to building AI products. Mistral addressed part of this concern over time by releasing Codestral Mamba and Devstral under the Apache 2.0 license.
Codestral 25.01's first-place position on the LMSys Copilot Arena leaderboard was widely reported as a validation of its practical usefulness for developers, since the leaderboard is based on real developer preference votes rather than automated benchmark scores.
VentureBeat and other technology publications covered the August 2025 Codestral 25.08 launch with particular interest in the combination of the full stack into a single product, noting that the simultaneous availability of completion, agentic, and embedding models under a unified enterprise offering positioned Mistral as a more direct competitor to GitHub Copilot and Cursor.
Users and reviewers have noted several consistent limitations across Codestral models:
Hardware requirements for local deployment: Running Codestral 22B locally requires hardware capable of loading a 22-billion-parameter model, which in practice means at least 32 GB of system RAM and a GPU with 24 GB or more of VRAM for acceptable inference speed. This puts local deployment out of reach for developers on consumer hardware. Codestral Mamba 7B and Devstral (24B) are more manageable, but still require configurations above typical developer laptops.
License restrictions on weights: The MNPL applied to Codestral 22B v0.1 restricts commercial use of the weights. This means developers who need to deploy the model on their own infrastructure for production use cannot do so freely, and must either use the API or negotiate a commercial license. The licensing terms have been a source of confusion because Mistral marketed the model as "open."
Code hallucination: Like all large language models, Codestral can generate code that looks plausible but is subtly incorrect. The model may generate function calls to APIs that do not exist, use deprecated syntax, or miss edge cases in logic. This is less pronounced for Codestral than for general-purpose models on code tasks, but it remains a known limitation that requires careful review, particularly for security-sensitive code.
Limited transparency on training data: Mistral has not published detailed documentation about the training data composition, dataset size, or filtering methods used for any Codestral model. This makes it difficult to assess potential data quality issues or biases in the model's code generation, and limits researchers' ability to audit the model for potential training data license issues.
Benchmark saturation for simpler tasks: HumanEval and MBPP, the most commonly cited benchmarks for code models, are increasingly saturated. Newer models with fewer parameters can achieve high scores on these benchmarks while underperforming on more realistic tasks involving larger codebases, multiple files, or domain-specific libraries. RepoBench and SWE-Bench offer more realistic assessments, but fewer comparable results exist across competing models.