Voxtral
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,005 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voxtral is a family of open-weight speech-understanding models released by Mistral AI in July 2025. The family has two members. Voxtral Small is a 24-billion-parameter model aimed at production workloads, and Voxtral Mini is a roughly 3-billion-parameter model meant to run on a single GPU or at the edge. Both are licensed under Apache 2.0, and Mistral also exposes a transcription-tuned variant called Voxtral Mini Transcribe through its hosted API. The models do more than turn speech into text. They handle multilingual speech recognition, translation, question answering over audio, summarization of recordings, and they can call backend functions directly from a spoken request. [1][2]
Voxtral marked Mistral's first serious push into open audio models. Before it, the company was known mostly for its open and commercial text models. With Voxtral the same release pattern carried over to speech. The weights went up on Hugging Face, the API endpoints went live the same day, and a technical report landed on arXiv describing how the system was built and how it scored against existing models. [1][3]
Most open speech systems do one job well. OpenAI's Whisper, for example, is built for transcription and translation, and using it inside a larger application usually means bolting a separate language model on top to reason about what was said. Voxtral folds those two stages into one model. It listens to audio, and because its core is a Mistral language model, it can also answer questions about the audio, summarize it, or act on instructions spoken aloud. Mistral describes the result as a multimodal audio chat model rather than a transcription engine. [3]
That framing matters for the parameter sizes. A 3-billion-parameter transcription model would be large by Whisper standards, where the biggest checkpoint is around 1.5 billion parameters. Voxtral Mini is bigger because it carries a full small language model inside it. Voxtral Small is larger still, built around Mistral's 24-billion-parameter text model. The trade is straightforward. You get language reasoning over speech in a single forward pass, at the cost of more weights to host. [3]
The two sizes target different deployments. Voxtral Small needs roughly 55 GB of GPU memory in bf16 or fp16, which puts it on data-center hardware. Voxtral Mini runs in under 10 GB, so it fits on a consumer card or a modest cloud instance. For teams that only want transcription and do not need the chat and reasoning layer, the Voxtral Mini Transcribe API endpoint strips the model down to its recognition behavior and is priced for high volume. [1][4][5]
Here is how the three offerings line up.
| Offering | Backbone | Total parameters | Form | Best for |
|---|---|---|---|---|
| Voxtral Small | Mistral Small 3.1 (24B) | 24.3B | Open weights and API | Production transcription plus audio understanding |
| Voxtral Mini | Ministral 3B | 4.7B | Open weights and API | Local and edge deployment |
| Voxtral Mini Transcribe | Voxtral Mini, transcription-tuned | 4.7B | API only | High-volume, low-cost transcription |
Voxtral Mini Transcribe is not a separate architecture. It is Voxtral Mini operated in a dedicated transcription mode and served through Mistral's API. Mistral positions it as a cheap, low-latency option that beats Whisper on accuracy while costing less than half as much per minute. The same dedicated transcription mode is available in the open weights, where it tends to maximize recognition quality and also resists instructions hidden inside the audio better than the open chat mode does. [1][6]
Voxtral splits into three stages. An audio encoder reads the sound, an adapter compresses the encoder output, and a Mistral language model consumes the result alongside any text in the prompt. [3]
The audio encoder is the Whisper large-v3 encoder, about 640 million parameters in both Voxtral sizes. It takes a log-Mel spectrogram with 128 frequency bins and a 160-sample hop length, the same front end Whisper uses. The encoder runs over the audio and produces a sequence of acoustic features. [3]
Because that feature sequence is long, an MLP adapter sits between the encoder and the language model and downsamples it by a factor of four. The adapter is small, about 25 million parameters in the Mini and 52 million in the Small. Its job is to turn audio frames into a token stream the language model can read as if it were text. [3]
The language backbone is where the two sizes differ. Voxtral Mini is built on Ministral 3B, giving a 3.6-billion-parameter decoder and a 4.7-billion-parameter model overall. Voxtral Small is built on Mistral Small 3.1, a 24-billion-parameter text model, for a 22.9-billion-parameter decoder and 24.3 billion parameters total. Both retain the text abilities of their backbone, so a Voxtral model can also answer a plain text question with no audio attached. [3][4][5]
Long audio fits because of the context window. Both models carry a 32,000-token context. In practice that lets Voxtral transcribe recordings up to about 30 minutes long and reason over audio up to about 40 minutes, while still leaving room for a multi-turn conversation in the same context. [1][3]
| Component | Voxtral Mini | Voxtral Small |
|---|---|---|
| Audio encoder | Whisper large-v3, 640M | Whisper large-v3, 640M |
| Adapter | MLP, 25M, 4x downsample | MLP, 52M, 4x downsample |
| Language backbone | Ministral 3B, 3.6B decoder | Mistral Small 3.1, 22.9B decoder |
| Total parameters | 4.7B | 24.3B |
| Context length | 32K tokens | 32K tokens |
| GPU memory (bf16/fp16) | ~9.5 GB | ~55 GB |
| Audio limit | 30 min transcribe, 40 min understand | 30 min transcribe, 40 min understand |
| License | Apache 2.0 | Apache 2.0 |
Transcription is the base capability. Voxtral detects the spoken language automatically and writes it out, or you can pin it to a dedicated transcription mode for the cleanest output. It is natively multilingual, with strong results in English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian, among others. [1][4]
Speech translation comes from the same model. You can ask Voxtral to render spoken audio in one language as text in another, and the technical report reports high translation scores across European language pairs. [1][3]
Audio question answering and summarization use the language backbone directly. Instead of transcribing a meeting and then feeding the transcript to a separate model, you hand Voxtral the recording and ask it questions or request a structured summary. The audio and the question travel through one model together. [1][3]
Function calling from voice is the capability that leans hardest on the chat backbone. A user can speak a request, and Voxtral can emit a structured function call to trigger a backend workflow or API, with no intermediate text step that you have to parse yourself. This makes the model usable as the front end of a voice agent. [1]
Text handling survives the audio training. Because the backbone is a full Mistral model, Voxtral keeps the text-understanding behavior of Mistral Small 3.1 or Ministral 3B and can be used on text-only inputs when needed. [1][4]
Mistral reports that Voxtral comprehensively outperforms Whisper large-v3 on transcription, and that it beats GPT-4o mini Transcribe and Gemini 2.5 Flash across the tasks they tested. The clearest single comparison is English short-form word error rate, where lower is better. [1][3]
| Model | English short-form WER (%) |
|---|---|
| Voxtral Small | 1.53 |
| Whisper large-v3 | 1.84 |
| Voxtral Mini | 1.86 |
| GPT-4o mini Transcribe | 1.92 |
| Gemini 2.5 Flash | 2.97 |
On the multilingual FLEURS benchmark, Voxtral Small averages around 4 percent word error rate across the languages Mistral evaluated. The per-language figures show where it is strongest and weakest. [3]
| FLEURS language | Voxtral Small WER (%) |
|---|---|
| Italian | 2.62 |
| Spanish | 2.72 |
| English | 3.35 |
| German | 3.38 |
| Portuguese | 3.79 |
| French | 4.03 |
| Dutch | 4.94 |
| Hindi | 7.69 |
| Arabic | 13.44 |
For speech translation the report uses BLEU, where higher is better, and Voxtral Small posts strong scores on English-to-European-language pairs. [3]
| Translation pair | Voxtral Small BLEU |
|---|---|
| English to French | 57.3 |
| English to German | 47.0 |
| English to Spanish | 39.9 |
| English to Italian | 39.9 |
Audio understanding is measured with spoken-question variants of standard text benchmarks. Voxtral Small reaches 88.4 percent on OpenBook QA, 74.3 percent on MMLU, 71.7 percent on a Llama-style QA set, and 89.7 percent on GSM8K, which shows the backbone's reasoning carrying through to audio inputs. [3]
The report also gives a multilingual transcription summary. Averaged across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech, Voxtral Small lands a mean word error rate near 6.6 percent and Voxtral Mini near 7.1 percent, both ahead of the open and closed baselines Mistral compared against. [3][4][5]
Both open models ship under Apache 2.0, which allows commercial use, modification, and redistribution. The weights for Voxtral Small and Voxtral Mini are on Hugging Face, and the same models are reachable through Mistral's API. [1][4][5]
On the hosted side, Mistral lists API pricing starting at $0.001 per minute of audio. The company frames Voxtral Mini Transcribe as costing less than half what comparable commercial transcription services charge while still beating Whisper on accuracy, and it says Voxtral Small matches ElevenLabs Scribe on quality, again for less than half the price. Voxtral is also wired into Le Chat, Mistral's assistant, through a voice mode on web and mobile. [1]
The split between the two sizes maps onto real deployments. Voxtral Mini and the Mini Transcribe endpoint suit high-volume transcription pipelines, call-center logging, captioning, and on-device or edge use where a small footprint matters. Voxtral Small fits applications that need to reason about audio, such as meeting summarization, voice search, spoken question answering over recordings, and voice agents that trigger actions through function calls. Because the models keep their text abilities, a single Voxtral deployment can also serve text requests, which simplifies a stack that would otherwise run separate speech and language models. [1][3]
Voxtral inherits the limits of its parts. Accuracy drops on lower-resource languages, as the Arabic and Hindi FLEURS figures show next to the European numbers. The audio length cap of roughly 30 to 40 minutes means very long recordings have to be chunked before processing, and the 32K context is shared between audio and any conversation, so heavy use of one leaves less room for the other. [3]
There is also a security wrinkle. Because the open chat models follow instructions, they can follow instructions spoken inside the audio itself, which is a prompt-injection risk for applications that feed in untrusted recordings. Mistral's dedicated transcription mode and the Mini Transcribe endpoint are more resistant to this, since they are tuned to transcribe rather than to obey, so safety-sensitive transcription jobs are better served by those paths than by the open chat model. [3][6]