Voxtral

Large Language Models Open Source AI Speech & Audio AI

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,102 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Voxtral is a family of open-weight speech-understanding models released by Mistral AI on July 15, 2025, under the Apache 2.0 license. It ships in two sizes, Voxtral Small (24 billion parameters, for production) and Voxtral Mini (about 3 billion parameters, for edge and local use), plus a transcription-tuned API endpoint called Voxtral Mini Transcribe. Unlike a pure transcription engine, Voxtral both turns speech into text and reasons over audio, handling multilingual speech recognition, translation, summarization, spoken question answering, and function calling directly from a voice request, with a 32,000-token context that covers audio up to roughly 40 minutes long. Mistral describes it as "open, affordable, and production-ready speech understanding for everyone," and lists hosted transcription from $0.001 per minute of audio. ^[1]^[3]

Voxtral marked Mistral's first serious push into open audio models. Before it, the company was known mostly for its open and commercial text models. With Voxtral the same release pattern carried over to speech. The weights went up on Hugging Face, the API endpoints went live the same day, and a technical report landed on arXiv describing how the system was built and how it scored against existing models. The report opens by stating, "We present Voxtral Mini and Voxtral Small, two multimodal audio chat models." ^[1]^[3]

What is Voxtral?

Most open speech systems do one job well. OpenAI's Whisper, for example, is built for transcription and translation, and using it inside a larger application usually means bolting a separate language model on top to reason about what was said. Voxtral folds those two stages into one model. It listens to audio, and because its core is a Mistral language model, it can also answer questions about the audio, summarize it, or act on instructions spoken aloud. Mistral describes the result as a multimodal audio chat model rather than a transcription engine. ^[3]

That framing matters for the parameter sizes. A 3-billion-parameter transcription model would be large by Whisper standards, where the biggest checkpoint is around 1.5 billion parameters. Voxtral Mini is bigger because it carries a full small language model inside it. Voxtral Small is larger still, built around Mistral's 24-billion-parameter text model. The trade is straightforward. You get language reasoning over speech in a single forward pass, at the cost of more weights to host. ^[3]

The two sizes target different deployments. Voxtral Small needs roughly 55 GB of GPU memory in bf16 or fp16, which puts it on data-center hardware. Voxtral Mini runs in under 10 GB, so it fits on a consumer card or a modest cloud instance. For teams that only want transcription and do not need the chat and reasoning layer, the Voxtral Mini Transcribe API endpoint strips the model down to its recognition behavior and is priced for high volume. ^[1]^[4]^[5]

What are the Voxtral model sizes and variants?

Voxtral comes in two open-weight sizes plus one hosted transcription endpoint. Here is how the three offerings line up.

Offering	Backbone	Total parameters	Form	Best for
Voxtral Small	Mistral Small 3.1 (24B)	24.3B	Open weights and API	Production transcription plus audio understanding
Voxtral Mini	Ministral 3B	4.7B	Open weights and API	Local and edge deployment
Voxtral Mini Transcribe	Voxtral Mini, transcription-tuned	4.7B	API only	High-volume, low-cost transcription

Voxtral Mini Transcribe is not a separate architecture. It is Voxtral Mini operated in a dedicated transcription mode and served through Mistral's API. Mistral positions it as a cheap, low-latency option that beats Whisper on accuracy while costing less than half as much per minute. The same dedicated transcription mode is available in the open weights, where it tends to maximize recognition quality and also resists instructions hidden inside the audio better than the open chat mode does. ^[1]^[6]

How is Voxtral built?

Voxtral splits into three stages. An audio encoder reads the sound, an adapter compresses the encoder output, and a Mistral language model consumes the result alongside any text in the prompt. ^[3]

The audio encoder is the Whisper large-v3 encoder, about 640 million parameters in both Voxtral sizes. It takes a log-Mel spectrogram with 128 frequency bins and a 160-sample hop length, the same front end Whisper uses. The encoder runs over the audio and produces a sequence of acoustic features. ^[3]

Because that feature sequence is long, an MLP adapter sits between the encoder and the language model and downsamples it by a factor of four. The adapter is small, about 25 million parameters in the Mini and 52 million in the Small. Its job is to turn audio frames into a token stream the language model can read as if it were text. ^[3]

The language backbone is where the two sizes differ. Voxtral Mini is built on Ministral 3B, giving a 3.6-billion-parameter decoder and a 4.7-billion-parameter model overall. Voxtral Small is built on Mistral Small 3.1, a 24-billion-parameter text model, for a 22.9-billion-parameter decoder and 24.3 billion parameters total. Both retain the text abilities of their backbone, so a Voxtral model can also answer a plain text question with no audio attached. ^[3]^[4]^[5]

Long audio fits because of the context window. Both models carry a 32,000-token context. In its technical report Mistral states that the "32K context window enables the model to handle audio files up to 40 minutes in duration." In practice that lets Voxtral transcribe recordings up to about 30 minutes long and reason over audio up to about 40 minutes, while still leaving room for a multi-turn conversation in the same context. ^[1]^[3]

Component	Voxtral Mini	Voxtral Small
Audio encoder	Whisper large-v3, 640M	Whisper large-v3, 640M
Adapter	MLP, 25M, 4x downsample	MLP, 52M, 4x downsample
Language backbone	Ministral 3B, 3.6B decoder	Mistral Small 3.1, 22.9B decoder
Total parameters	4.7B	24.3B
Context length	32K tokens	32K tokens
GPU memory (bf16/fp16)	~9.5 GB	~55 GB
Audio limit	30 min transcribe, 40 min understand	30 min transcribe, 40 min understand
License	Apache 2.0	Apache 2.0

What can Voxtral do?

Transcription is the base capability. Voxtral detects the spoken language automatically and writes it out, or you can pin it to a dedicated transcription mode for the cleanest output. It is natively multilingual, with strong results in English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian, among others. ^[1]^[4]

Speech translation comes from the same model. You can ask Voxtral to render spoken audio in one language as text in another, and the technical report reports high translation scores across European language pairs. ^[1]^[3]

Audio question answering and summarization use the language backbone directly. Instead of transcribing a meeting and then feeding the transcript to a separate model, you hand Voxtral the recording and ask it questions or request a structured summary. The audio and the question travel through one model together. ^[1]^[3]

Function calling from voice is the capability that leans hardest on the chat backbone. A user can speak a request, and Voxtral can emit a structured function call to trigger a backend workflow or API, with no intermediate text step that you have to parse yourself. This makes the model usable as the front end of a voice agent. ^[1]

Text handling survives the audio training. Because the backbone is a full Mistral model, Voxtral keeps the text-understanding behavior of Mistral Small 3.1 or Ministral 3B and can be used on text-only inputs when needed. ^[1]^[4]

How does Voxtral perform on benchmarks?

Mistral reports that Voxtral comprehensively outperforms Whisper large-v3 on transcription, and that it beats GPT-4o mini Transcribe and Gemini 2.5 Flash across the tasks they tested. The clearest single comparison is English short-form word error rate, where lower is better. ^[1]^[3]

Model	English short-form WER (%)
Voxtral Small	1.53
Whisper large-v3	1.84
Voxtral Mini	1.86
GPT-4o mini Transcribe	1.92
Gemini 2.5 Flash	2.97

On the multilingual FLEURS benchmark, Voxtral Small averages around 4 percent word error rate across the languages Mistral evaluated. The per-language figures show where it is strongest and weakest. ^[3]

FLEURS language	Voxtral Small WER (%)
Italian	2.62
Spanish	2.72
English	3.35
German	3.38
Portuguese	3.79
French	4.03
Dutch	4.94
Hindi	7.69
Arabic	13.44

For speech translation the report uses BLEU, where higher is better, and Voxtral Small posts strong scores on English-to-European-language pairs. ^[3]

Translation pair	Voxtral Small BLEU
English to French	57.3
English to German	47.0
English to Spanish	39.9
English to Italian	39.9

Audio understanding is measured with spoken-question variants of standard text benchmarks. Voxtral Small reaches 88.4 percent on OpenBook QA, 74.3 percent on MMLU, 71.7 percent on a Llama-style QA set, and 89.7 percent on GSM8K, which shows the backbone's reasoning carrying through to audio inputs. ^[3]

The report also gives a multilingual transcription summary. Averaged across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech, Voxtral Small lands a mean word error rate near 6.6 percent and Voxtral Mini near 7.1 percent, both ahead of the open and closed baselines Mistral compared against. ^[3]^[4]^[5]

Is Voxtral open source, and how much does it cost?

Both open models ship under Apache 2.0, which allows commercial use, modification, and redistribution. The weights for Voxtral Small and Voxtral Mini are on Hugging Face, and the same models are reachable through Mistral's API. ^[1]^[4]^[5]

On the hosted side, Mistral lists API pricing starting at $0.001 per minute of audio. The company frames Voxtral Mini Transcribe as costing less than half what comparable commercial transcription services charge while still beating Whisper on accuracy, and it says Voxtral Small matches ElevenLabs Scribe on quality, again for less than half the price. Voxtral is also wired into Le Chat, Mistral's assistant, through a voice mode on web and mobile. ^[1]

What is Voxtral used for?

The split between the two sizes maps onto real deployments. Voxtral Mini and the Mini Transcribe endpoint suit high-volume transcription pipelines, call-center logging, captioning, and on-device or edge use where a small footprint matters. Voxtral Small fits applications that need to reason about audio, such as meeting summarization, voice search, spoken question answering over recordings, and voice agents that trigger actions through function calls. Because the models keep their text abilities, a single Voxtral deployment can also serve text requests, which simplifies a stack that would otherwise run separate speech and language models. ^[1]^[3]

What are Voxtral's limitations?

Voxtral inherits the limits of its parts. Accuracy drops on lower-resource languages, as the Arabic and Hindi FLEURS figures show next to the European numbers. The audio length cap of roughly 30 to 40 minutes means very long recordings have to be chunked before processing, and the 32K context is shared between audio and any conversation, so heavy use of one leaves less room for the other. ^[3]

There is also a security wrinkle. Because the open chat models follow instructions, they can follow instructions spoken inside the audio itself, which is a prompt-injection risk for applications that feed in untrusted recordings. Mistral's dedicated transcription mode and the Mini Transcribe endpoint are more resistant to this, since they are tuned to transcribe rather than to obey, so safety-sensitive transcription jobs are better served by those paths than by the open chat model. ^[3]^[6]

References

Mistral AI. "Voxtral." Mistral AI News, July 15, 2025. https://mistral.ai/news/voxtral ↩
Mistral AI. "Voxtral models." Mistral AI Documentation. https://docs.mistral.ai/getting-started/models/models_overview/
Liu, Alexander H., et al. "Voxtral." arXiv preprint arXiv:2507.13264, July 2025. https://arxiv.org/abs/2507.13264 ↩
Mistral AI. "Voxtral-Small-24B-2507." Hugging Face. https://huggingface.co/mistralai/Voxtral-Small-24B-2507 ↩
Mistral AI. "Voxtral-Mini-3B-2507." Hugging Face. https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 ↩
Willison, Simon. "Voxtral." Simon Willison's Weblog, July 16, 2025. https://simonwillison.net/2025/Jul/16/voxtral/ ↩
Mistral AI. "Voxtral technical report (HTML)." arXiv, July 2025. https://arxiv.org/html/2507.13264v1
OpenAI. "Introducing Whisper." OpenAI, September 21, 2022. https://openai.com/index/whisper/
Mistral AI. "Mistral Small 3." Mistral AI News, January 30, 2025. https://mistral.ai/news/mistral-small-3
Mistral AI. "Un Ministral, des Ministraux." Mistral AI News, October 16, 2024. https://mistral.ai/news/ministraux

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Automatic Speech Recognition Models Mistral AI Mistral Large 3

What is Voxtral?

What are the Voxtral model sizes and variants?

How is Voxtral built?

What can Voxtral do?

How does Voxtral perform on benchmarks?

Is Voxtral open source, and how much does it cost?

What is Voxtral used for?

What are Voxtral's limitations?

References

Improve this article

Related Articles

Sesame (AI company)

Moshi

Sesame CSM

F5-TTS

XTTS (Coqui XTTS)

NVIDIA Parakeet

What links here

Related Articles

Sesame (AI company)

Moshi

Sesame CSM

F5-TTS

XTTS (Coqui XTTS)

NVIDIA Parakeet

What links here