Whisper
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 ยท 7,004 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 ยท 7,004 words
Add missing citations, update stale details, or suggest a clearer explanation.
Whisper is an open-source family of automatic speech recognition (ASR) models developed by OpenAI and first released on September 21, 2022.[^1] Built on an encoder-decoder Transformer architecture and trained on roughly 680,000 hours of multilingual, multitask supervised audio collected from the web, Whisper performs multilingual transcription, speech translation into English, language identification, and timestamp prediction within a single model.[^2] OpenAI released the model code and the pretrained weights under the permissive MIT License, making Whisper one of the few major OpenAI systems to be genuinely open: anyone may download, redistribute, fine-tune, or commercialize the weights.[^3]
The original September 2022 release shipped five model sizes (tiny at 39M parameters, base at 74M, small at 244M, medium at 769M, and large at 1,550M) with English-only .en variants for the smaller sizes.[^1] OpenAI subsequently released large-v2 (December 8, 2022) with improved training and regularization,[^4] large-v3 (November 6, 2023, at OpenAI DevDay) trained on roughly five million hours of audio and using 128 Mel frequency bins instead of 80,[^5] and large-v3-turbo (October 1, 2024), an 809M-parameter model with only four decoder layers that delivers up to roughly eight times faster transcription than large-v3 while remaining close to its accuracy.[^6]
Whisper also runs as a paid hosted service on the OpenAI API under the whisper-1 endpoint, distinct from the open-weight checkpoints.[^7] In March 2025, OpenAI introduced two successor speech-to-text endpoints, gpt-4o-transcribe and gpt-4o-mini-transcribe, built on the GPT-4o architecture and reporting lower word error rates and approximately 90% fewer hallucinations than Whisper large-v2.[^8] Despite this, the open-weight Whisper checkpoints remain dominant in self-hosted and on-device speech recognition through community projects such as whisper.cpp, faster-whisper, WhisperX, and Distil-Whisper.[^9][^10][^11][^12]
Whisper's most discussed limitation is a tendency to hallucinate fabricated text, particularly during silent or low-speech audio. Peer-reviewed work presented at ACM FAccT 2024 (Koenecke et al., "Careless Whisper"), an Associated Press investigation in October 2024, and follow-up academic research have documented hallucinations in medical-transcription deployments and prompted renewed scrutiny of high-stakes ASR use.[^13][^14][^15]
Speech recognition has been an active research area for over half a century, but practical systems prior to Whisper were typically built from carefully curated, hand-transcribed corpora and trained with task-specific objectives such as connectionist temporal classification (CTC) or hybrid hidden Markov model and deep neural network (HMM-DNN) pipelines.[^2] These models often performed well on the data distribution they were trained on (for example, the read-speech audiobook recordings of LibriSpeech) but generalized poorly to noisy or accented audio outside the training distribution.[^2]
By the late 2010s, large-scale self-supervised speech representation learning (wav2vec 2.0, HuBERT, WavLM) had become the dominant pretraining paradigm. These models used unlabeled audio to learn rich acoustic representations, then required a labeled fine-tuning stage on each downstream language and task. Whisper inverted this recipe: instead of self-supervised pretraining followed by supervised fine-tuning, Whisper used a single supervised objective on a vastly larger, more heterogeneous, and weakly labeled web-scale corpus.[^2]
The Whisper paper, "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever (arXiv:2212.04356), released on December 6, 2022 and later published in ICML 2023, formalized the approach.[^2] Rather than training on small, gold-standard datasets, the authors scraped roughly 680,000 hours of audio from the internet paired with whatever transcripts were available: closed captions, fan-made subtitles, and web text aligned with audio. The transcripts were not hand-verified; they were what the machine learning community calls weakly supervised labels.[^2] The bet was that scale and diversity would more than compensate for label noise, producing a model that was robust to real-world conditions even if it was not always the absolute best on any single benchmark.
This thesis turned out to be largely correct. On standard benchmarks like LibriSpeech test-clean, Whisper performed competitively but did not surpass narrowly specialized fine-tuned models. On out-of-distribution evaluations spanning conversational speech, accented English, technical dictation, and noisy recordings, however, Whisper's zero-shot performance was substantially more robust, averaging roughly half the error rates of supervised models specialized to a single corpus.[^2] The release immediately collapsed a previously fragmented landscape of per-language ASR APIs into a single permissively licensed checkpoint that downstream developers could run, modify, fine-tune, or redistribute.[^16]
The Whisper authors framed the work as a continuation of OpenAI's research line on broad pretraining (the GPT series for text, CLIP for image-text alignment, Whisper for audio-text alignment).[^2] In that sense Whisper is best read alongside CLIP: both were published with open weights, both treated the web as a noisy but enormous supervised dataset, and both demonstrated that scale plus weak labels can outperform smaller curated datasets for tasks where coverage matters more than label purity.
Whisper uses a standard encoder-decoder Transformer (Vaswani et al., 2017), adapted to consume audio rather than text.[^2] The architecture is conceptually similar to the sequence-to-sequence models used in neural machine translation. It can be divided into three stages: audio preprocessing, an audio encoder, and a text-generating autoregressive decoder.
Raw audio is resampled to 16,000 Hz mono and split into non-overlapping 30-second segments. Each 30-second segment is converted into a log-magnitude Mel spectrogram: the original five sizes and large-v2 use 80 Mel frequency bins, while large-v3 and large-v3-turbo use 128 bins for finer frequency resolution.[^5] The spectrogram is computed with 25-millisecond analysis windows and a 10-millisecond hop. For each 30-second segment, the resulting input tensor has shape (80, 3000) or (128, 3000) for large-v3-class models.[^2] Audio shorter than 30 seconds is padded with zeros so that every input is fixed-length, which simplifies batching at the cost of wasted compute on short clips. The input values are normalized to a range of approximately zero mean and unit variance using statistics computed across the WhisperFeatureExtractor.
The spectrogram first passes through a small convolutional stem consisting of two 1D convolution layers with a filter width of three, GELU activation, and a stride-2 second layer that downsamples the time dimension by half.[^2] Sinusoidal position embeddings are then added before the features enter a stack of Transformer encoder blocks. Each block contains multi-head self-attention and a feed-forward network with residual connections and pre-layer normalization. The encoder width and depth vary by model size, ranging from four blocks at width 384 in the tiny model up to 32 blocks at width 1,280 in the large model.[^2]
The decoder is an autoregressive Transformer of the same width and depth as the encoder, with the notable exception of large-v3-turbo, which retains the 32-layer encoder but truncates the decoder to four layers.[^6] It uses learned positional embeddings and tied input/output token embeddings. Whisper's tokenizer is a byte-pair encoding (BPE) scheme: the English-only models reuse the GPT-2 tokenizer (50,257 tokens), while the multilingual models extend the vocabulary to 51,865 tokens to cover non-English text adequately.[^2] Cross-attention from the decoder to the encoder is performed once per decoder block; the encoder outputs are cached for the duration of decoding the segment.
A distinctive feature of Whisper is that a single model handles transcription, translation, language identification, voice activity detection, and timestamp prediction. This is achieved by prepending special control tokens to the decoder input:[^2]
<|en|>, <|fr|>, <|zh|>), one for each of the 99 supported languages (100 in large-v3, which added Cantonese).<|transcribe|> for same-language transcription, or <|translate|> for translation into English.<|notimestamps|> suppresses timestamp prediction; when absent, the decoder emits timestamp tokens quantized to 20-millisecond intervals.<|nospeech|> token used to indicate that a segment contains no detectable speech.[^2]By selecting the appropriate prefix at inference time, an application can switch the same checkpoint between, for example, English-to-English transcription with timestamps and Spanish-to-English translation without timestamps. The model's vocabulary thus includes the BPE subword tokens, 99 (or 100) language tokens, 2 task tokens, 1 nospeech token, and roughly 1,500 timestamp tokens covering the 30-second window at 20 ms granularity.[^17]
Whisper's reference implementation uses a temperature-fallback decoder. The first attempt is greedy or beam-search decoding at temperature 0; if the resulting text fails any of three heuristic acceptance checks (low average log-probability, high compression ratio that signals repetition, or a high <|nospeech|> probability), decoding is retried at progressively higher temperatures from 0.2 through 1.0.[^18] The default best_of parameter on the command-line interface is 5, and the default beam size is 5 when explicitly requested. Beam search and greedy decoding only make sense at temperature 0; non-zero temperatures sample stochastically.[^18]
For long-form audio the reference implementation chunks the input into successive 30-second windows, using the previous segment's text as a prompt to maintain context, and uses voice activity detection heuristics derived from the <|nospeech|> token (probability threshold of 0.6) combined with an average log-probability threshold of negative 1 to identify silent segments.[^17]
Whisper ships in nine publicly distributed checkpoints: five base sizes, four .en English-only variants for the smaller sizes, plus the later turbo model. The English-only variants share the multilingual architecture but are trained on English data alone and tend to outperform their multilingual counterparts on English audio at the same parameter count, especially for the tiny and base sizes.[^19] No .en variant exists for the large or turbo sizes; both ship as multilingual-only.
| Model | Parameters | Encoder layers | Decoder layers | Hidden dim | Heads | VRAM (FP16) | Relative speed | Languages |
|---|---|---|---|---|---|---|---|---|
| tiny / tiny.en | 39M | 4 | 4 | 384 | 6 | ~1 GB | ~10x faster | 99 / English only |
| base / base.en | 74M | 6 | 6 | 512 | 8 | ~1 GB | ~7x faster | 99 / English only |
| small / small.en | 244M | 12 | 12 | 768 | 12 | ~2 GB | ~4x faster | 99 / English only |
| medium / medium.en | 769M | 24 | 24 | 1024 | 16 | ~5 GB | ~2x faster | 99 / English only |
| large (v1, v2, v3) | 1,550M | 32 | 32 | 1280 | 20 | ~10 GB | 1x baseline | 99 (100 in v3) |
| large-v3-turbo | 809M | 32 | 4 | 1280 | 20 | ~6 GB | ~8x faster than large | 99 (transcription only) |
Parameter counts, layer counts, and VRAM/speed estimates come from the official OpenAI model card and code; OpenAI states that the tiny model runs roughly an order of magnitude faster than the large model in their reference implementation on the same hardware.[^1][^20]
The original Whisper models were trained on approximately 680,000 hours of audio paired with weakly supervised text labels collected from the web. OpenAI describes the dataset composition as approximately 65% English speech recognition (about 438,000 hours), 17% non-English speech recognition across 96 additional languages (about 117,000 hours), and 18% translation data going from non-English audio to English text (about 125,000 hours).[^2]
OpenAI did not release the training corpus. The paper describes the data curation pipeline at a high level: heuristics to filter out transcripts produced by other ASR systems (to avoid the model learning to reproduce ASR artifacts), automatic language identification to verify that the audio language matched the transcript language, fuzzy deduplication of near-identical clips, and a dedicated step to remove any audio that overlapped with known evaluation sets like LibriSpeech.[^2] A separate transcript quality filter penalized text with all-uppercase normalization, transcripts that lacked normal sentence structure, and transcripts whose CER (character error rate) when produced by an interim Whisper model exceeded a threshold.
For Whisper large-v3, OpenAI expanded the training set to roughly five million hours: about one million hours of weakly labeled audio (similar in character to the original corpus) plus four million hours of pseudo-labeled audio. The pseudo-labels were generated by running Whisper large-v2 over unlabeled audio and using its outputs as targets, a form of large-scale self-training.[^5] The large-v3 release was trained for two epochs on this expanded dataset.
The Whisper paper trained models with the AdamW optimizer and a cosine learning rate schedule with a 2,048-step linear warmup. Training used FP16 with dynamic loss scaling, gradient checkpointing on the encoder, and a global batch size of 256 segments (each 30 seconds long).[^2] Models were trained for 2^20 (approximately one million) updates, which corresponds to two to three epochs over the 680,000-hour corpus depending on model size. The paper reports that no overfitting was observed at these training durations; the authors explicitly chose not to use data augmentation or regularization in the original release, attributing the model's generalization purely to dataset scale and diversity.[^2]
The December 2022 large-v2 release reversed this position. Large-v2 trained for 2.5 times more epochs than large-v1 with the same architecture, plus SpecAugment, stochastic depth, and BPE dropout as regularization. OpenAI reported about 5% relative WER reduction on English and about 10% on other languages compared to v1.[^4]
Whisper's multitask training enables four primary capabilities from one set of weights:
<|translate|> task token directs the decoder to produce English text from non-English audio. Translation is X-to-English only; Whisper does not translate English speech into other languages, nor does it directly translate between two non-English languages.[^2]transcribe() function for automatic language detection.[^2]<|nospeech|> token serves as a coarse voice activity detector.[^2]Whisper does not provide native speaker diarization (identifying who is speaking when) or fine-grained word-level alignment. Both are typically added by downstream pipelines such as WhisperX (which uses forced phoneme alignment against a wav2vec 2.0 model for word-level timestamps and pyannote.audio for diarization).[^11]
OpenAI has released a sequence of incremental upgrades to the large model:
In addition to these large-model releases, the original five base sizes and their .en variants have remained essentially unchanged since September 2022.
Whisper is evaluated primarily with word error rate (WER), the standard ASR metric defined as the edit distance between hypothesis and reference transcripts normalized by reference length.[^2]
On the LibriSpeech test-clean benchmark, a relatively easy corpus of read audiobook speech, Whisper large-v3 achieves around 2.0 to 2.7% WER depending on the implementation and post-processing.[^5][^22] Earlier large-v2 sits at roughly 3% WER on the same set.[^4] These numbers are competitive but not state-of-the-art for that specific benchmark: narrowly trained models from NVIDIA (Parakeet, Canary) and academic groups have reported WERs below 2% on test-clean. The point of Whisper, however, is that it was not tuned to LibriSpeech.
Where Whisper substantially leads is in zero-shot generalization. The paper reports that on a basket of out-of-distribution English ASR benchmarks, Whisper achieves roughly 50% fewer errors than commercial and academic systems that match its performance on test-clean.[^2] On real-world recordings such as conversational meetings, call-center audio, and accented speech, Whisper large-v3 typically reports WER in the 7 to 8% range, and as high as 17 to 18% on heavily degraded call-center recordings.[^2] On the Hugging Face Open ASR Leaderboard, large-v3 records 2.01% WER on LibriSpeech clean, 3.91% on LibriSpeech other, 15.95% on AMI meetings, 11.29% on Earnings22, 10.02% on GigaSpeech, and 2.94% on SPGISpeech, for a mean WER of 7.44%.[^22]
Multilingual performance varies substantially by language. OpenAI publishes per-language WER on the FLEURS and Common Voice benchmarks; well-resourced European languages such as Spanish, Italian, French, German, and Portuguese routinely score below 10% WER on large-v3, while many low-resource African and Southeast Asian languages exceed 20% and in some cases 50% WER.[^5] On Common Voice 15, Whisper large-v3 reports approximately 5.6% WER on English audio and roughly 10% average WER across the supported languages, a 10 to 20% improvement over large-v2.[^21] On FLEURS, Whisper large-v3 averages around 10% WER across more than 100 languages, with the distribution heavily skewed toward better performance in high-resource languages.[^21]
OpenAI's published language list officially advertises only languages where large-v3 stays below 50% WER. Cantonese was added in the large-v3 release as a separate token, bringing the total supported language count to 100; the underlying language head still emits one of 99 multilingual tokens plus the dedicated Cantonese token.[^5]
For speech translation (X-to-English), Whisper large-v3 reports BLEU scores on the CoVoST-2 benchmark that are competitive with dedicated supervised translation systems for high-resource language pairs and substantially better than supervised baselines for low-resource pairs.[^2] Large-v3-turbo, which omitted translation data during fine-tuning, does not support the translation task.[^6]
The most-studied limitation of Whisper is its tendency to generate hallucinations in the sense of hallucination used in language model evaluation, namely text not actually present in the audio. Because Whisper is an autoregressive sequence-to-sequence model trained on weakly labeled web data, it has learned strong priors over plausible text and will sometimes confabulate a fluent continuation when its acoustic evidence is weak, most notoriously during silent stretches, music-only passages, or noisy backgrounds.[^13][^14]
The peer-reviewed paper "Careless Whisper: Speech-to-Text Hallucination Harms" by Allison Koenecke (Cornell University), Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann (New York University), and Mona Sloane (University of Virginia), presented at the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24) in Rio de Janeiro on June 3 to 6, 2024, was the first systematic empirical study of the problem.[^15] Working with thousands of audio snippets from the TalkBank research repository hosted at Carnegie Mellon University, the authors found that roughly 1% of Whisper transcriptions contained entire hallucinated phrases or sentences absent from the underlying audio, and that 38% of these hallucinations were explicitly harmful, including perpetuating violence, making up inaccurate associations, or implying false authority.[^15] The paper documented that hallucinations disproportionately affected speakers with aphasia, a language disorder that produces longer non-vocal pauses, because silence appears to trigger Whisper's confabulation behavior.[^15]
In October 2024, the Associated Press published a follow-up investigation by Garance Burke and Hilke Schellmann reporting that Whisper "is plagued by hallucinations" and that the problem is particularly serious in medical settings, where Whisper-derived tools are widely deployed.[^13] The AP found one machine-learning engineer who detected hallucinations in roughly half of more than 100 hours of Whisper transcriptions he examined, and another developer who examined 26,000 transcripts and found hallucinations in nearly all of them.[^13] A University of Michigan researcher analyzing public-meeting recordings found hallucinations in eight out of every ten audio transcriptions inspected before he began trying to improve the model.[^13] Another study reported 187 hallucinations across more than 13,000 clear-audio snippets.[^13]
The AP also reported that Nabla, a France and US company whose Whisper-derived medical transcription tool had been used in more than 7 million medical visits across more than 30,000 clinicians and more than 40 health systems, including Mankato Clinic in Minnesota and Children's Hospital Los Angeles, deletes the source audio after transcription, leaving clinicians no way to audit the transcripts against ground truth.[^13][^14] OpenAI's own documentation explicitly warns against using Whisper in "decision-making contexts" or "high-risk domains."[^13][^23]
Coverage of the AP investigation extended across major outlets, with Fortune, TechCrunch, PBS NewsHour, the Healthcare Brew newsletter, Tom's Guide, Engadget, and others reporting on the implications for hospitals, courts, and accessibility services that depend on Whisper.[^13][^14][^24][^25][^26]
Subsequent research has begun to characterize where hallucinations originate inside the model. The "Calm-Whisper" paper (Wang et al., arXiv:2505.12969, accepted to INTERSPEECH 2025) benchmarked the contribution of each self-attention head in the Whisper-large-v3 decoder to non-speech hallucinations by performing a head-wise mask, finding that only 3 of the 20 heads account for over 75% of non-speech hallucinations on the UrbanSound dataset.[^27] Fine-tuning those three heads on non-speech audio reduced non-speech hallucination by more than 80% with less than 0.1% WER degradation on LibriSpeech test-clean and test-other.[^27] OpenAI's own successor gpt-4o-transcribe and gpt-4o-mini-transcribe models, released in March 2025, claim roughly 90% fewer hallucinations than Whisper large-v2 on an internal hallucination-with-noise evaluation.[^8]
Whisper supports 99 languages in principle, but performance varies dramatically across them. English audio dominates the training mix (about 65% of speech-recognition hours), and many of the listed languages have orders of magnitude less data and correspondingly higher WERs.[^2] For low-resource languages, Whisper's results can be unreliable. The 2024 to 2025 academic literature on Whisper fine-tuning for low-resource languages (Swiss German, Hausa, Vietnamese, Yoruba, and others) is now a sizable subfield, often relying on LoRA adapters and language-model rescoring to close the gap.[^28]
Autoregressive decoders can fall into repetition loops, emitting the same phrase over and over. Whisper exhibits this failure mode on degraded, very long, or silent audio, especially when the model is given a prompt that biases it toward repetitive output. The standard mitigation is to use beam search with a no-repeat-ngram constraint, or to chunk audio with voice activity detection so that silence is excluded.[^11][^18]
Whisper was designed for batch processing of fixed 30-second windows and does not natively support low-latency real-time streaming. Streaming Whisper systems exist in the community (whisper_streaming, WhisperLive, WhisperKit) but generally rely on overlapping windows, voice activity detection, and partial-hypothesis stabilization layered on top of the off-the-shelf model.[^11] OpenAI's own Realtime API addresses streaming use cases through the closed gpt-realtime-whisper and gpt-4o-mini-transcribe endpoints, not through the open Whisper weights.[^29]
The large models require substantial memory. The full-precision large-v3 needs roughly 10 GB of GPU VRAM; large-v3-turbo needs roughly 6 GB. The smaller models are practical on CPUs and even on phones; whisper.cpp ships builds that run the tiny and base models on a Raspberry Pi.[^9]
The MIT-licensed release of Whisper catalyzed one of the largest open-source ecosystems in modern AI. The main openai/whisper repository on GitHub has accumulated more than 100,000 stars by early 2026, and the openai/whisper-large-v3 model card on Hugging Face records roughly 5 million monthly downloads, making it one of the most-accessed open-source speech models ever published.[^20][^22]
whisper.cpp, maintained by Georgi Gerganov (also the creator of llama.cpp), is a plain C/C++ port of Whisper inference. It uses the GGML tensor library, has no Python or PyTorch dependency, supports integer quantization to reduce model file sizes (the full large model can run as a quantized 1 to 2 GB file, and supported modes include Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, and Q8_0), and runs on x86 CPUs, Apple Silicon (via Core ML and Accelerate), Raspberry Pi, NVIDIA CUDA, Vulkan, and other backends.[^9] whisper.cpp is the engine inside many desktop and mobile applications that run Whisper locally, including the macOS dictation app SuperWhisper and the iOS Voice Memos transcription pipeline used in third-party clients.
faster-whisper is a Python reimplementation built on CTranslate2, a fast C++ inference engine originally written for translation models by SYSTRAN. It applies INT8 and FP16 quantization, fused operations, and SIMD-optimized CPU kernels to deliver roughly 4x the throughput of the official OpenAI implementation while keeping accuracy essentially unchanged, and with batched inference and INT8 quantization it has been measured at nearly 8x faster than real time.[^10] faster-whisper is the default backend for many server-side transcription products and underlies WhisperX. It exposes a BatchedInferencePipeline interface that yields an additional 3 to 5x throughput improvement on top of the base speedup.[^10]
WhisperX, by Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman (Visual Geometry Group, University of Oxford), is a complete speech-processing pipeline that uses faster-whisper for transcription and then layers in additional capabilities: voice activity detection to remove silence (which both speeds inference and reduces hallucinations), forced phoneme alignment against a wav2vec 2.0 model to obtain word-level timestamps with roughly plus or minus 50 ms accuracy versus the vanilla Whisper segment-level precision of plus or minus 500 ms, and optional speaker diarization with pyannote.audio.[^11] WhisperX was published at INTERSPEECH 2023 (arXiv:2303.00747) and reports up to 70x real-time transcription with large-v2.[^11] It is the most cited paper to build on Whisper, with several hundred follow-up citations across speech and accessibility research.
Distil-Whisper, by the Hugging Face team (Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush), applies knowledge distillation to produce dramatically smaller and faster Whisper variants. The technique copies the entire encoder from the teacher Whisper model and freezes it during distillation while reducing the decoder to as few as two layers, initialized from the first and last layers of the teacher. Training uses 22,000 hours of pseudo-labeled English audio (and up to 98,000 hours for the v3.5 release), filtered with a word error rate heuristic to retain only the highest-quality pseudo-labels.[^12] The published distilled checkpoints are 5.8 times faster than the teacher with 51% fewer parameters, while staying within 1% WER on out-of-distribution test data in zero-shot transfer.[^12]
| Model | Teacher | Training data | Speedup | WER gap vs. teacher |
|---|---|---|---|---|
| distil-large-v2 | Whisper large-v2 | ~22,000 hours | ~6x | <1% |
| distil-large-v3 | Whisper large-v3 | ~22,000 hours | ~6x | <1% |
| distil-large-v3.5 | Whisper large-v3 | ~98,000 hours | ~6x | <1% |
| distil-small.en | Whisper small | English subset | ~6x | <1% |
| distil-medium.en | Whisper medium | English subset | ~6x | <1% |
Distil-Whisper checkpoints are compatible with the faster-whisper library and can also serve as the draft model in speculative decoding against full Whisper, providing roughly 2x speedup with mathematically guaranteed identical outputs.[^12]
transformers library ships an officially supported Whisper implementation with chunked long-form decoding, batch inference, and integration with the broader Hugging Face training and serving ecosystem.[^20]Beyond the open-source ecosystem, Whisper underpins a large slice of commercial transcription and voice-AI products. Notable named deployments include:
The commercial transcription market in 2026 is essentially split between the open Whisper checkpoints (self-hosted via whisper.cpp, faster-whisper, WhisperX, Distil-Whisper, and others), the OpenAI Audio API hosted endpoints, and dedicated speech-API vendors such as AssemblyAI, Deepgram, Speechmatics, AWS Transcribe, and Google Cloud Speech-to-Text.[^31][^32]
Independent of the open-weight checkpoints, OpenAI offers Whisper as a paid hosted API. Originally launched on March 1, 2023 alongside the ChatGPT API, the speech-to-text endpoint was first served by whisper-1, a hosted variant of large-v2, at $0.006 per minute of input audio.[^33] As of early 2026, the OpenAI API Audio surface offers:
| Model | Listed price | Notes |
|---|---|---|
whisper-1 | $0.006 per minute | Original hosted Whisper checkpoint; legacy |
gpt-4o-transcribe | $0.006 per minute | GPT-4o-based, lower WER, fewer hallucinations |
gpt-4o-mini-transcribe | $0.003 per minute | Cheaper variant, slightly less accurate |
gpt-4o-transcribe-diarize | (separate price) | Built-in speaker diarization, released late 2025 |
gpt-realtime-whisper | $0.017 per minute | Low-latency streaming transcription |
The API bills per minute of input audio, not per minute of speech, accepts files up to 25 MB in common formats (MP3, WAV, FLAC, M4A, WEBM, MPGA, MPEG), and supports json, text, srt, verbose_json, and vtt response formats.[^7][^34] It accepts an optional prompt parameter to bias spelling of proper nouns and style, and a timestamp_granularities parameter that can return word-level, segment-level, or both kinds of timestamps. Word-level timestamps are exposed only when response_format=verbose_json.[^34]
The API endpoint and the open-weight checkpoints should be understood as distinct products. Running Whisper from the open weights is free (modulo compute costs) and works fully offline; the API is a managed service with its own usage policies, quota limits, and pricing.
OpenAI released both Whisper's code and its pretrained model weights under the MIT License, a short and permissive license that allows commercial use, modification, redistribution, and incorporation into proprietary products, subject only to retaining the original copyright notice.[^3] This is in sharp contrast to several other OpenAI systems (DALL-E 2, GPT-3, and the broader ChatGPT family) whose weights have never been published. The MIT licensing decision is the foundation of the Whisper ecosystem: every project listed above (whisper.cpp, faster-whisper, WhisperX, Distil-Whisper, WhisperKit, and the countless fine-tunes on Hugging Face) is downstream of OpenAI's choice to make the weights freely usable.
Within the AI community, Whisper is often cited alongside CLIP (also OpenAI, also openly released) as evidence that OpenAI was, at one point, willing to ship significant systems with permissive licenses. Since the GPT-3 era, however, OpenAI has not released open weights for any of its flagship language models.[^16]
OpenAI's audio strategy has progressively moved away from the open Whisper line toward unified multimodal systems. The first step was GPT-4o, announced on May 13, 2024 ("Spring Update"), a single end-to-end model trained across text, vision, and audio that could respond to audio inputs in as little as 232 milliseconds with an average of 320 ms, comparable to human turn-taking latency.[^35] Although GPT-4o uses its own native audio encoder rather than calling Whisper, several elements of the Whisper recipe (the 16 kHz mono input, the 30-second windowing, the BPE tokenizer extension for cross-lingual support) carry over to GPT-4o audio.[^35]
In March 2025, OpenAI introduced gpt-4o-transcribe and gpt-4o-mini-transcribe as next-generation speech-to-text models. OpenAI reported lower WER across 33 tested languages, semantic voice activity detection for better endpointing, and approximately 90% fewer hallucinations relative to Whisper large-v2 on the internal hallucination-with-noise evaluation.[^8]
In October 2025, OpenAI added gpt-4o-transcribe-diarize, the first OpenAI ASR model with built-in speaker diarization. The diarize endpoint returns a diarized_json payload with per-speaker segment timestamps and supports up to roughly 1,400 seconds per chunk; speakers default to labels A:, B:, etc., unless reference voice samples are provided up front.[^36][^37]
The Realtime API reached general availability in August 2025, providing low-latency streaming audio in and audio out for conversational applications. The Realtime API supports gpt-realtime-whisper (a low-latency streaming variant priced at $0.017 per minute), gpt-4o-mini-transcribe, gpt-4o-transcribe, and whisper-1 as transcription backends.[^29]
These successor models are not open-weight. They are available only through the OpenAI API. As a result, the open-source community continues to rely on the open Whisper checkpoints and their derivatives for self-hosted, on-premise, on-device, and air-gapped deployments. Several non-OpenAI labs have also released competitive open ASR models (notably NVIDIA's Parakeet TDT and Canary families, and Meta's SeamlessM4T speech components), but Whisper retains the lead in raw adoption and ecosystem breadth.[^31]
The following table summarizes how Whisper compares with the most-discussed alternative ASR families as of 2026.
| System | Vendor | Open weights | Multilingual | Streaming | Hallucination rate | Typical use |
|---|---|---|---|---|---|---|
| Whisper large-v3 | OpenAI | Yes (MIT) | 100 languages | No (native) | High on silence/noise | Self-hosted batch ASR |
| Whisper large-v3-turbo | OpenAI | Yes (MIT) | 99 languages (no translate) | No (native) | High (similar to v3) | Fast self-hosted batch |
| Distil-Whisper | Hugging Face | Yes (MIT) | English only (mostly) | Limited | Inherited from teacher | Edge / on-device |
| gpt-4o-transcribe | OpenAI | No (API) | 33+ languages | Yes | ~90% lower than Whisper-v2 | Hosted production |
| Deepgram Nova-3 | Deepgram | No (API) | Multiple | Yes | Low (proprietary metric) | Hosted streaming |
| AssemblyAI Universal | AssemblyAI | No (API) | Multiple | Yes | Low (proprietary metric) | Hosted production |
| Parakeet TDT | NVIDIA | Yes (CC-BY) | English | Yes | Low | Self-hosted streaming |
| Canary | NVIDIA | Yes (CC-BY-NC) | 4 languages | Limited | Low | Self-hosted multilingual |
| SeamlessM4T | Meta | Yes (CC-BY-NC-SA) | 100 languages | Limited | Moderate | Self-hosted multilingual |
| Speechmatics | Speechmatics | No (commercial) | 45+ languages | Yes | Low (proprietary) | Hosted enterprise |
Whisper's distinguishing characteristics in this comparison are its permissive MIT license, its multilingual coverage, and its enormous community ecosystem. Its primary weaknesses relative to commercial alternatives are the lack of native streaming and the well-documented hallucination behavior on silence and non-speech audio.[^31][^32][^15]
As of May 2026, the open-weight Whisper checkpoints (large-v3 and large-v3-turbo, plus the smaller original sizes) remain the most recent open weights OpenAI has published for ASR. OpenAI has not announced an "open-weight Whisper-4" or similar successor.
Instead, OpenAI's continued investment in speech-to-text has moved into the closed gpt-4o-* audio model family described above. Independent benchmarking shows that gpt-4o-transcribe improves on Whisper across most metrics but at the cost of API-only access and per-minute pricing.[^32] Distil-Whisper v3.5, large-v3-turbo, and the WhisperX pipeline continue to receive community updates, ensuring that the open ASR stack remains a viable choice for users who cannot or will not send audio to OpenAI's hosted endpoints.
For developers and researchers in 2026, the practical picture is: use the open-weight Whisper checkpoints (typically via faster-whisper, whisper.cpp, or WhisperX) when self-hosting, privacy, cost, or offline operation matters; use OpenAI's gpt-4o-transcribe or gpt-4o-mini-transcribe when accuracy and lower hallucination rates matter more than openness; and apply VAD-based silence removal and human review for any high-stakes deployment, regardless of which model is used.[^13][^15]
Whisper is widely regarded as the model that opened the modern era of practical, robust speech recognition for developers. By providing a single, MIT-licensed checkpoint that handled 99 languages, translation, and timestamps with strong noise robustness, it collapsed a previously fragmented landscape of language-specific commercial APIs and academic checkpoints into a single fine-tunable starting point. The proliferation of community implementations (whisper.cpp running on phones and laptops, faster-whisper deployed in production at scale, WhisperX adding diarization, Distil-Whisper compressing the model further) is now a frequently cited example of what becomes possible when a frontier lab releases a capable model with a permissive license.[^16]
At the same time, Whisper has become a case study in the limitations of large weakly supervised generative models for high-stakes settings. The 2024 ACM FAccT paper, the AP's October 2024 investigation, the Cornell and UVA hallucination studies, and ongoing reporting on medical-transcription deployments have made Whisper a recurring example in discussions of AI safety, model auditing, and the practical risks of deploying language-model-derived systems where mistakes have real consequences.[^13][^15] The 2025 Calm-Whisper research and OpenAI's own redesign in gpt-4o-transcribe (claiming about 90% fewer hallucinations) suggest that the hallucination problem is partially tractable through architectural and training changes, but also that the open Whisper weights, frozen since November 2023 (large-v3) and October 2024 (large-v3-turbo), still carry the failure mode they shipped with.[^8][^27]