Whisper

Deep Learning Natural Language Processing OpenAI Speech & Audio AI

36 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

37 citations

Revision

v12 · 7,272 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Whisper is an open-source family of automatic speech recognition (ASR) models developed by OpenAI and first released on September 21, 2022.^[1] Built on an encoder-decoder Transformer architecture and trained on roughly 680,000 hours of multilingual, multitask supervised audio collected from the web, Whisper performs multilingual transcription, speech translation into English, language identification, and timestamp prediction within a single model.^[2] OpenAI released the model code and the pretrained weights under the permissive MIT License, making Whisper one of the few major OpenAI systems to be genuinely open: anyone may download, redistribute, fine-tune, or commercialize the weights.^[3] In the model's defining paper, the authors report that "when compared to humans, the models approach their accuracy and robustness," the claim that established Whisper as the first broadly human-competitive open ASR system.^[2]

The original September 2022 release shipped five model sizes (tiny at 39M parameters, base at 74M, small at 244M, medium at 769M, and large at 1,550M) with English-only .en variants for the smaller sizes.^[1] OpenAI subsequently released large-v2 (December 8, 2022) with improved training and regularization,^[4] large-v3 (November 6, 2023, at OpenAI DevDay) trained on roughly five million hours of audio and using 128 Mel frequency bins instead of 80,^[5] and large-v3-turbo (October 1, 2024), an 809M-parameter model with only four decoder layers that delivers up to roughly eight times faster transcription than large-v3 while remaining close to its accuracy.^[6]

Whisper also runs as a paid hosted service on the OpenAI API under the whisper-1 endpoint, distinct from the open-weight checkpoints.^[7] In March 2025, OpenAI introduced two successor speech-to-text endpoints, gpt-4o-transcribe and gpt-4o-mini-transcribe, built on the GPT-4o architecture and reporting lower word error rates and approximately 90% fewer hallucinations than Whisper large-v2.^[8] Despite this, the open-weight Whisper checkpoints remain dominant in self-hosted and on-device speech recognition through community projects such as whisper.cpp, faster-whisper, WhisperX, and Distil-Whisper.^[9]^[10]^[11]^[12]

Whisper's most discussed limitation is a tendency to hallucinate fabricated text, particularly during silent or low-speech audio. Peer-reviewed work presented at ACM FAccT 2024 (Koenecke et al., "Careless Whisper"), an Associated Press investigation in October 2024, and follow-up academic research have documented hallucinations in medical-transcription deployments and prompted renewed scrutiny of high-stakes ASR use.^[13]^[14]^[15]

What is Whisper used for?

Whisper is used to turn spoken audio into text. From a single set of weights it performs four tasks: transcribing speech into the same language it was spoken in, translating non-English speech into English text, identifying which language is being spoken, and predicting timestamps for when each segment of speech occurs.^[2] In practice it powers podcast and video captioning, meeting and call-center transcription, voice dictation apps, medical and legal documentation tools, accessibility captioning for deaf and hard-of-hearing users, and the speech-to-text layer of many voice assistants and agents. Because the weights are MIT-licensed, the same model is used both as a free self-hosted engine (via whisper.cpp, faster-whisper, or WhisperX) and as a paid hosted endpoint on the OpenAI API.^[3]^[7]

Background

Speech recognition has been an active research area for over half a century, but practical systems prior to Whisper were typically built from carefully curated, hand-transcribed corpora and trained with task-specific objectives such as connectionist temporal classification (CTC) or hybrid hidden Markov model and deep neural network (HMM-DNN) pipelines.^[2] These models often performed well on the data distribution they were trained on (for example, the read-speech audiobook recordings of LibriSpeech) but generalized poorly to noisy or accented audio outside the training distribution.^[2]

By the late 2010s, large-scale self-supervised speech representation learning (wav2vec 2.0, HuBERT, WavLM) had become the dominant pretraining paradigm. These models used unlabeled audio to learn rich acoustic representations, then required a labeled fine-tuning stage on each downstream language and task. Whisper inverted this recipe: instead of self-supervised pretraining followed by supervised fine-tuning, Whisper used a single supervised objective on a vastly larger, more heterogeneous, and weakly labeled web-scale corpus.^[2]

The Whisper paper, "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever (arXiv:2212.04356), released on December 6, 2022 and later published in ICML 2023, formalized the approach.^[2] Its abstract states the central result directly: "When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning."^[2] Rather than training on small, gold-standard datasets, the authors scraped roughly 680,000 hours of audio from the internet paired with whatever transcripts were available: closed captions, fan-made subtitles, and web text aligned with audio. The transcripts were not hand-verified; they were what the machine learning community calls weakly supervised labels.^[2] The bet was that scale and diversity would more than compensate for label noise, producing a model that was robust to real-world conditions even if it was not always the absolute best on any single benchmark.

This thesis turned out to be largely correct. On standard benchmarks like LibriSpeech test-clean, Whisper performed competitively but did not surpass narrowly specialized fine-tuned models. On out-of-distribution evaluations spanning conversational speech, accented English, technical dictation, and noisy recordings, however, Whisper's zero-shot performance was substantially more robust, averaging roughly half the error rates of supervised models specialized to a single corpus.^[2] The release immediately collapsed a previously fragmented landscape of per-language ASR APIs into a single permissively licensed checkpoint that downstream developers could run, modify, fine-tune, or redistribute.^[16]

The Whisper authors framed the work as a continuation of OpenAI's research line on broad pretraining (the GPT series for text, CLIP for image-text alignment, Whisper for audio-text alignment).^[2] In that sense Whisper is best read alongside CLIP: both were published with open weights, both treated the web as a noisy but enormous supervised dataset, and both demonstrated that scale plus weak labels can outperform smaller curated datasets for tasks where coverage matters more than label purity.

Architecture

Whisper uses a standard encoder-decoder Transformer (Vaswani et al., 2017), adapted to consume audio rather than text.^[2] The architecture is conceptually similar to the sequence-to-sequence models used in neural machine translation. It can be divided into three stages: audio preprocessing, an audio encoder, and a text-generating autoregressive decoder.

Audio preprocessing

Raw audio is resampled to 16,000 Hz mono and split into non-overlapping 30-second segments. Each 30-second segment is converted into a log-magnitude Mel spectrogram: the original five sizes and large-v2 use 80 Mel frequency bins, while large-v3 and large-v3-turbo use 128 bins for finer frequency resolution.^[5] The spectrogram is computed with 25-millisecond analysis windows and a 10-millisecond hop. For each 30-second segment, the resulting input tensor has shape (80, 3000) or (128, 3000) for large-v3-class models.^[2] Audio shorter than 30 seconds is padded with zeros so that every input is fixed-length, which simplifies batching at the cost of wasted compute on short clips. The input values are normalized to a range of approximately zero mean and unit variance using statistics computed across the WhisperFeatureExtractor.

Encoder

The spectrogram first passes through a small convolutional stem consisting of two 1D convolution layers with a filter width of three, GELU activation, and a stride-2 second layer that downsamples the time dimension by half.^[2] Sinusoidal position embeddings are then added before the features enter a stack of Transformer encoder blocks. Each block contains multi-head self-attention and a feed-forward network with residual connections and pre-layer normalization. The encoder width and depth vary by model size, ranging from four blocks at width 384 in the tiny model up to 32 blocks at width 1,280 in the large model.^[2]

Decoder

The decoder is an autoregressive Transformer of the same width and depth as the encoder, with the notable exception of large-v3-turbo, which retains the 32-layer encoder but truncates the decoder to four layers.^[6] It uses learned positional embeddings and tied input/output token embeddings. Whisper's tokenizer is a byte-pair encoding (BPE) scheme: the English-only models reuse the GPT-2 tokenizer (50,257 tokens), while the multilingual models extend the vocabulary to 51,865 tokens to cover non-English text adequately.^[2] Cross-attention from the decoder to the encoder is performed once per decoder block; the encoder outputs are cached for the duration of decoding the segment.

Multitask token framework

A distinctive feature of Whisper is that a single model handles transcription, translation, language identification, voice activity detection, and timestamp prediction. This is achieved by prepending special control tokens to the decoder input:^[2]

A language tag (e.g., <|en|>, <|fr|>, <|zh|>), one for each of the 99 supported languages (100 in large-v3, which added Cantonese).
A task token: <|transcribe|> for same-language transcription, or <|translate|> for translation into English.
A timestamp control token: <|notimestamps|> suppresses timestamp prediction; when absent, the decoder emits timestamp tokens quantized to 20-millisecond intervals.
A <|nospeech|> token used to indicate that a segment contains no detectable speech.^[2]

By selecting the appropriate prefix at inference time, an application can switch the same checkpoint between, for example, English-to-English transcription with timestamps and Spanish-to-English translation without timestamps. The model's vocabulary thus includes the BPE subword tokens, 99 (or 100) language tokens, 2 task tokens, 1 nospeech token, and roughly 1,500 timestamp tokens covering the 30-second window at 20 ms granularity.^[17]

Decoding strategy

Whisper's reference implementation uses a temperature-fallback decoder. The first attempt is greedy or beam-search decoding at temperature 0; if the resulting text fails any of three heuristic acceptance checks (low average log-probability, high compression ratio that signals repetition, or a high <|nospeech|> probability), decoding is retried at progressively higher temperatures from 0.2 through 1.0.^[18] The default best_of parameter on the command-line interface is 5, and the default beam size is 5 when explicitly requested. Beam search and greedy decoding only make sense at temperature 0; non-zero temperatures sample stochastically.^[18]

For long-form audio the reference implementation chunks the input into successive 30-second windows, using the previous segment's text as a prompt to maintain context, and uses voice activity detection heuristics derived from the <|nospeech|> token (probability threshold of 0.6) combined with an average log-probability threshold of negative 1 to identify silent segments.^[17]

Model sizes

Whisper ships in nine publicly distributed checkpoints: five base sizes, four .en English-only variants for the smaller sizes, plus the later turbo model. The English-only variants share the multilingual architecture but are trained on English data alone and tend to outperform their multilingual counterparts on English audio at the same parameter count, especially for the tiny and base sizes.^[19] No .en variant exists for the large or turbo sizes; both ship as multilingual-only.

Model	Parameters	Encoder layers	Decoder layers	Hidden dim	Heads	VRAM (FP16)	Relative speed	Languages
tiny / tiny.en	39M	4	4	384	6	~1 GB	~10x faster	99 / English only
base / base.en	74M	6	6	512	8	~1 GB	~7x faster	99 / English only
small / small.en	244M	12	12	768	12	~2 GB	~4x faster	99 / English only
medium / medium.en	769M	24	24	1024	16	~5 GB	~2x faster	99 / English only
large (v1, v2, v3)	1,550M	32	32	1280	20	~10 GB	1x baseline	99 (100 in v3)
large-v3-turbo	809M	32	4	1280	20	~6 GB	~8x faster than large	99 (transcription only)

Parameter counts, layer counts, and VRAM/speed estimates come from the official OpenAI model card and code; OpenAI states that the tiny model runs roughly an order of magnitude faster than the large model in their reference implementation on the same hardware.^[1]^[20]

Training data

The original Whisper models were trained on approximately 680,000 hours of audio paired with weakly supervised text labels collected from the web. OpenAI describes the dataset composition as approximately 65% English speech recognition (about 438,000 hours), 17% non-English speech recognition across 96 additional languages (about 117,000 hours), and 18% translation data going from non-English audio to English text (about 125,000 hours).^[2]

OpenAI did not release the training corpus. The paper describes the data curation pipeline at a high level: heuristics to filter out transcripts produced by other ASR systems (to avoid the model learning to reproduce ASR artifacts), automatic language identification to verify that the audio language matched the transcript language, fuzzy deduplication of near-identical clips, and a dedicated step to remove any audio that overlapped with known evaluation sets like LibriSpeech.^[2] A separate transcript quality filter penalized text with all-uppercase normalization, transcripts that lacked normal sentence structure, and transcripts whose CER (character error rate) when produced by an interim Whisper model exceeded a threshold.

For Whisper large-v3, OpenAI expanded the training set to roughly five million hours: about one million hours of weakly labeled audio (similar in character to the original corpus) plus four million hours of pseudo-labeled audio. The pseudo-labels were generated by running Whisper large-v2 over unlabeled audio and using its outputs as targets, a form of large-scale self-training.^[5] The large-v3 release was trained for two epochs on this expanded dataset.

Training compute and procedure

The Whisper paper trained models with the AdamW optimizer and a cosine learning rate schedule with a 2,048-step linear warmup. Training used FP16 with dynamic loss scaling, gradient checkpointing on the encoder, and a global batch size of 256 segments (each 30 seconds long).^[2] Models were trained for 2^20 (approximately one million) updates, which corresponds to two to three epochs over the 680,000-hour corpus depending on model size. The paper reports that no overfitting was observed at these training durations; the authors explicitly chose not to use data augmentation or regularization in the original release, attributing the model's generalization purely to dataset scale and diversity.^[2]

The December 2022 large-v2 release reversed this position. Large-v2 trained for 2.5 times more epochs than large-v1 with the same architecture, plus SpecAugment, stochastic depth, and BPE dropout as regularization. OpenAI reported about 5% relative WER reduction on English and about 10% on other languages compared to v1.^[4]

Capabilities

Whisper's multitask training enables four primary capabilities from one set of weights:

Multilingual transcription. The model produces text in the same language as the spoken input, with claimed support for 99 languages (100 in large-v3 after Cantonese was added).^[2]^[5] OpenAI's published per-language word error rates show wide variance, with well-resourced European languages performing best.
Speech translation. The <|translate|> task token directs the decoder to produce English text from non-English audio. Translation is X-to-English only; Whisper does not translate English speech into other languages, nor does it directly translate between two non-English languages.^[2]
Language identification. Before transcribing, Whisper can predict the spoken language from the first 30 seconds of audio by sampling the probability over its language tokens. This is used internally by the transcribe() function for automatic language detection.^[2]
Voice activity detection and timestamp prediction. When timestamp tokens are enabled, the decoder emits start/end timestamps quantized to 20-millisecond intervals for each segment. The <|nospeech|> token serves as a coarse voice activity detector.^[2]

Whisper does not provide native speaker diarization (identifying who is speaking when) or fine-grained word-level alignment. Both are typically added by downstream pipelines such as WhisperX (which uses forced phoneme alignment against a wav2vec 2.0 model for word-level timestamps and pyannote.audio for diarization).^[11]

Variants over time

OpenAI has released a sequence of incremental upgrades to the large model:

large (v1). Released September 21, 2022, alongside the original family. 1,550M parameters across 32 encoder and 32 decoder layers, 80 Mel bins.^[1]
large-v2. Released December 8, 2022. Same architecture as v1, but trained for 2.5 times more epochs with additional regularization (SpecAugment, stochastic depth, BPE dropout). OpenAI reported roughly 5% relative word error rate reduction on English and roughly 10% on other languages compared to v1.^[4]
large-v3. Released November 6, 2023, during OpenAI DevDay. Increased Mel spectrogram bins from 80 to 128, added a Cantonese language token (bringing the total to 100 languages), and trained on approximately five million hours of audio (one million weakly labeled, four million pseudo-labeled by large-v2). Achieves a 10 to 20% reduction in error rate across languages versus large-v2.^[5]
large-v3-turbo. Released October 1, 2024. Retains the 32-layer encoder of large-v3 but reduces the decoder to four layers, dropping total parameters to 809M. It is not a distilled model; instead, OpenAI fine-tuned the truncated network for two additional epochs on the same multilingual transcription data used for large-v3, excluding translation data. The result is roughly 8x faster inference than large-v3 with accuracy comparable to large-v2 according to OpenAI's release notes, though larger degradations are reported on some languages such as Thai and Cantonese. Turbo does not support the translation task.^[6]^[21]

In addition to these large-model releases, the original five base sizes and their .en variants have remained essentially unchanged since September 2022.

Benchmarks

Whisper is evaluated primarily with word error rate (WER), the standard ASR metric defined as the edit distance between hypothesis and reference transcripts normalized by reference length.^[2]

English benchmarks

On the LibriSpeech test-clean benchmark, a relatively easy corpus of read audiobook speech, Whisper large-v3 achieves around 2.0 to 2.7% WER depending on the implementation and post-processing.^[5]^[22] Earlier large-v2 sits at roughly 3% WER on the same set.^[4] These numbers are competitive but not state-of-the-art for that specific benchmark: narrowly trained models from NVIDIA (Parakeet, Canary) and academic groups have reported WERs below 2% on test-clean. The point of Whisper, however, is that it was not tuned to LibriSpeech.

Where Whisper substantially leads is in zero-shot generalization. The paper reports that on a basket of out-of-distribution English ASR benchmarks, Whisper achieves roughly 50% fewer errors than commercial and academic systems that match its performance on test-clean.^[2] On real-world recordings such as conversational meetings, call-center audio, and accented speech, Whisper large-v3 typically reports WER in the 7 to 8% range, and as high as 17 to 18% on heavily degraded call-center recordings.^[2] On the Hugging Face Open ASR Leaderboard, large-v3 records 2.01% WER on LibriSpeech clean, 3.91% on LibriSpeech other, 15.95% on AMI meetings, 11.29% on Earnings22, 10.02% on GigaSpeech, and 2.94% on SPGISpeech, for a mean WER of 7.44%.^[22]

Multilingual benchmarks

Multilingual performance varies substantially by language. OpenAI publishes per-language WER on the FLEURS and Common Voice benchmarks; well-resourced European languages such as Spanish, Italian, French, German, and Portuguese routinely score below 10% WER on large-v3, while many low-resource African and Southeast Asian languages exceed 20% and in some cases 50% WER.^[5] On Common Voice 15, Whisper large-v3 reports approximately 5.6% WER on English audio and roughly 10% average WER across the supported languages, a 10 to 20% improvement over large-v2.^[21] On FLEURS, Whisper large-v3 averages around 10% WER across more than 100 languages, with the distribution heavily skewed toward better performance in high-resource languages.^[21]

OpenAI's published language list officially advertises only languages where large-v3 stays below 50% WER. Cantonese was added in the large-v3 release as a separate token, bringing the total supported language count to 100; the underlying language head still emits one of 99 multilingual tokens plus the dedicated Cantonese token.^[5]

Translation

For speech translation (X-to-English), Whisper large-v3 reports BLEU scores on the CoVoST-2 benchmark that are competitive with dedicated supervised translation systems for high-resource language pairs and substantially better than supervised baselines for low-resource pairs.^[2] Large-v3-turbo, which omitted translation data during fine-tuning, does not support the translation task.^[6]

Limitations

Hallucinations

The most-studied limitation of Whisper is its tendency to generate hallucinations in the sense of hallucination used in language model evaluation, namely text not actually present in the audio. Because Whisper is an autoregressive sequence-to-sequence model trained on weakly labeled web data, it has learned strong priors over plausible text and will sometimes confabulate a fluent continuation when its acoustic evidence is weak, most notoriously during silent stretches, music-only passages, or noisy backgrounds.^[13]^[14]

The peer-reviewed paper "Careless Whisper: Speech-to-Text Hallucination Harms" by Allison Koenecke (Cornell University), Anna Seo Gyeong Choi, Katelyn X. Mei, Hilke Schellmann (New York University), and Mona Sloane (University of Virginia), presented at the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24) in Rio de Janeiro on June 3 to 6, 2024, was the first systematic empirical study of the problem.^[15] Working with thousands of audio snippets from the TalkBank research repository hosted at Carnegie Mellon University, the authors found that roughly 1% of Whisper transcriptions contained entire hallucinated phrases or sentences absent from the underlying audio, and that 38% of these hallucinations were explicitly harmful, including perpetuating violence, making up inaccurate associations, or implying false authority.^[15] As the paper puts it, "38% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority."^[15] The paper documented that hallucinations disproportionately affected speakers with aphasia, a language disorder that produces longer non-vocal pauses, because silence appears to trigger Whisper's confabulation behavior.^[15]

In October 2024, the Associated Press published a follow-up investigation by Garance Burke and Hilke Schellmann reporting that Whisper "is plagued by hallucinations" and that the problem is particularly serious in medical settings, where Whisper-derived tools are widely deployed.^[13] The AP found one machine-learning engineer who detected hallucinations in roughly half of more than 100 hours of Whisper transcriptions he examined, and another developer who examined 26,000 transcripts and found hallucinations in nearly all of them.^[13] A University of Michigan researcher analyzing public-meeting recordings found hallucinations in eight out of every ten audio transcriptions inspected before he began trying to improve the model.^[13] Another study reported 187 hallucinations across more than 13,000 clear-audio snippets.^[13]

The AP also reported that Nabla, a France and US company whose Whisper-derived medical transcription tool had been used in more than 7 million medical visits across more than 30,000 clinicians and more than 40 health systems, including Mankato Clinic in Minnesota and Children's Hospital Los Angeles, deletes the source audio after transcription, leaving clinicians no way to audit the transcripts against ground truth.^[13]^[14] OpenAI's own documentation explicitly warns against using Whisper in "decision-making contexts" or "high-risk domains."^[13]^[23]

Coverage of the AP investigation extended across major outlets, with Fortune, TechCrunch, PBS NewsHour, the Healthcare Brew newsletter, Tom's Guide, Engadget, and others reporting on the implications for hospitals, courts, and accessibility services that depend on Whisper.^[13]^[14]^[24]^[25]^[26]

Subsequent research has begun to characterize where hallucinations originate inside the model. The "Calm-Whisper" paper (Wang et al., arXiv:2505.12969, accepted to INTERSPEECH 2025) benchmarked the contribution of each self-attention head in the Whisper-large-v3 decoder to non-speech hallucinations by performing a head-wise mask, finding that only 3 of the 20 heads account for over 75% of non-speech hallucinations on the UrbanSound dataset.^[27] Fine-tuning those three heads on non-speech audio reduced non-speech hallucination by more than 80% with less than 0.1% WER degradation on LibriSpeech test-clean and test-other.^[27] OpenAI's own successor gpt-4o-transcribe and gpt-4o-mini-transcribe models, released in March 2025, claim roughly 90% fewer hallucinations than Whisper large-v2 on an internal hallucination-with-noise evaluation.^[8]

Uneven language coverage

Whisper supports 99 languages in principle, but performance varies dramatically across them. English audio dominates the training mix (about 65% of speech-recognition hours), and many of the listed languages have orders of magnitude less data and correspondingly higher WERs.^[2] For low-resource languages, Whisper's results can be unreliable. The 2024 to 2025 academic literature on Whisper fine-tuning for low-resource languages (Swiss German, Hausa, Vietnamese, Yoruba, and others) is now a sizable subfield, often relying on LoRA adapters and language-model rescoring to close the gap.^[28]

Repetitive looping

Autoregressive decoders can fall into repetition loops, emitting the same phrase over and over. Whisper exhibits this failure mode on degraded, very long, or silent audio, especially when the model is given a prompt that biases it toward repetitive output. The standard mitigation is to use beam search with a no-repeat-ngram constraint, or to chunk audio with voice activity detection so that silence is excluded.^[11]^[18]

Lack of native streaming

Whisper was designed for batch processing of fixed 30-second windows and does not natively support low-latency real-time streaming. Streaming Whisper systems exist in the community (whisper_streaming, WhisperLive, WhisperKit) but generally rely on overlapping windows, voice activity detection, and partial-hypothesis stabilization layered on top of the off-the-shelf model.^[11] OpenAI's own Realtime API addresses streaming use cases through the closed gpt-realtime-whisper and gpt-4o-mini-transcribe endpoints, not through the open Whisper weights.^[29]

Resource requirements

The large models require substantial memory. The full-precision large-v3 needs roughly 10 GB of GPU VRAM; large-v3-turbo needs roughly 6 GB. The smaller models are practical on CPUs and even on phones; whisper.cpp ships builds that run the tiny and base models on a Raspberry Pi.^[9]

Community ecosystem

The MIT-licensed release of Whisper catalyzed one of the largest open-source ecosystems in modern AI. The main openai/whisper repository on GitHub has accumulated more than 100,000 stars by early 2026, and the openai/whisper-large-v3 model card on Hugging Face records roughly 5 million monthly downloads, making it one of the most-accessed open-source speech models ever published.^[20]^[22]

whisper.cpp

whisper.cpp, maintained by Georgi Gerganov (also the creator of llama.cpp), is a plain C/C++ port of Whisper inference. It uses the GGML tensor library, has no Python or PyTorch dependency, supports integer quantization to reduce model file sizes (the full large model can run as a quantized 1 to 2 GB file, and supported modes include Q4_0, Q4_1, Q4_2, Q5_0, Q5_1, and Q8_0), and runs on x86 CPUs, Apple Silicon (via Core ML and Accelerate), Raspberry Pi, NVIDIA CUDA, Vulkan, and other backends.^[9] whisper.cpp is the engine inside many desktop and mobile applications that run Whisper locally, including the macOS dictation app SuperWhisper and the iOS Voice Memos transcription pipeline used in third-party clients.

faster-whisper

faster-whisper is a Python reimplementation built on CTranslate2, a fast C++ inference engine originally written for translation models by SYSTRAN. It applies INT8 and FP16 quantization, fused operations, and SIMD-optimized CPU kernels to deliver roughly 4x the throughput of the official OpenAI implementation while keeping accuracy essentially unchanged, and with batched inference and INT8 quantization it has been measured at nearly 8x faster than real time.^[10] faster-whisper is the default backend for many server-side transcription products and underlies WhisperX. It exposes a BatchedInferencePipeline interface that yields an additional 3 to 5x throughput improvement on top of the base speedup.^[10]

WhisperX

WhisperX, by Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman (Visual Geometry Group, University of Oxford), is a complete speech-processing pipeline that uses faster-whisper for transcription and then layers in additional capabilities: voice activity detection to remove silence (which both speeds inference and reduces hallucinations), forced phoneme alignment against a wav2vec 2.0 model to obtain word-level timestamps with roughly plus or minus 50 ms accuracy versus the vanilla Whisper segment-level precision of plus or minus 500 ms, and optional speaker diarization with pyannote.audio.^[11] WhisperX was published at INTERSPEECH 2023 (arXiv:2303.00747) and reports up to 70x real-time transcription with large-v2.^[11] It is the most cited paper to build on Whisper, with several hundred follow-up citations across speech and accessibility research.

Distil-Whisper

Distil-Whisper, by the Hugging Face team (Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush), applies knowledge distillation to produce dramatically smaller and faster Whisper variants. The technique copies the entire encoder from the teacher Whisper model and freezes it during distillation while reducing the decoder to as few as two layers, initialized from the first and last layers of the teacher. Training uses 22,000 hours of pseudo-labeled English audio (and up to 98,000 hours for the v3.5 release), filtered with a word error rate heuristic to retain only the highest-quality pseudo-labels.^[12] The published distilled checkpoints are 5.8 times faster than the teacher with 51% fewer parameters, while staying within 1% WER on out-of-distribution test data in zero-shot transfer.^[12]

Model	Teacher	Training data	Speedup	WER gap vs. teacher
distil-large-v2	Whisper large-v2	~22,000 hours	~6x	<1%
distil-large-v3	Whisper large-v3	~22,000 hours	~6x	<1%
distil-large-v3.5	Whisper large-v3	~98,000 hours	~6x	<1%
distil-small.en	Whisper small	English subset	~6x	<1%
distil-medium.en	Whisper medium	English subset	~6x	<1%

Distil-Whisper checkpoints are compatible with the faster-whisper library and can also serve as the draft model in speculative decoding against full Whisper, providing roughly 2x speedup with mathematically guaranteed identical outputs.^[12]

Other notable projects

insanely-fast-whisper. A Python CLI that pushes maximum throughput out of NVIDIA GPUs by leveraging FlashAttention-2 and BetterTransformer.
WhisperKit. A Swift package from Argmax that ships Whisper on Apple Silicon via Core ML, with on-device streaming, word timestamps, voice activity detection, and diarization.
WhisperLive and WhisperLiveKit. Community streaming wrappers built on faster-whisper and WhisperX.
whisper-jax. A JAX/TPU port from the Hugging Face team with optimized batch inference.
Hugging Face Transformers. The transformers library ships an officially supported Whisper implementation with chunked long-form decoding, batch inference, and integration with the broader Hugging Face training and serving ecosystem.^[20]

Deployments and downstream products

Beyond the open-source ecosystem, Whisper underpins a large slice of commercial transcription and voice-AI products. Notable named deployments include:

Nabla. A France and US clinical-documentation startup whose ambient-listening medical scribe is built on Whisper. Nabla reports more than 7 million transcribed visits, more than 30,000 clinicians, and more than 40 health systems including Mankato Clinic, Children's Hospital Los Angeles, and many smaller practices.^[13]^[14] Nabla's choice to delete source audio after transcription has been the focus of repeated criticism in the medical-AI press.^[14]^[25]
Microsoft Azure AI Speech. Microsoft offers Whisper through both Azure OpenAI Service and Azure AI Speech Studio. The Azure deployment supports batch transcription of files up to 1 GB and up to 1,000 files per request, with fine-tuning available against customer audio-plus-transcript pairs.^[30]
Speech-to-text features in podcast and creator tools. Descript's overdub and transcript-driven editor, Otter.ai, Riverside.fm's automatic captions, and many smaller transcription startups have either built on Whisper or compared their products explicitly against it.^[31]
Closed captioning and accessibility. A number of community projects, including subtitle pipelines for streaming services and the YouTube and Twitch ecosystems, use Whisper to add captions to recorded video. The model's hallucination behavior in this setting has been documented as a concern for deaf and hard-of-hearing audiences.^[13]
Legal and meeting transcription. AI note-taking and meeting-transcription services for the legal, sales, and customer-service industries, several built atop faster-whisper or AssemblyAI-style stacks, depend on the open Whisper checkpoints.

The commercial transcription market in 2026 is essentially split between the open Whisper checkpoints (self-hosted via whisper.cpp, faster-whisper, WhisperX, Distil-Whisper, and others), the OpenAI Audio API hosted endpoints, and dedicated speech-API vendors such as AssemblyAI, Deepgram, Speechmatics, AWS Transcribe, and Google Cloud Speech-to-Text.^[31]^[32]

OpenAI API endpoint

Independent of the open-weight checkpoints, OpenAI offers Whisper as a paid hosted API. Originally launched on March 1, 2023 alongside the ChatGPT API, the speech-to-text endpoint was first served by whisper-1, a hosted variant of large-v2, at $0.006 per minute of input audio.^[33] As of early 2026, the OpenAI API Audio surface offers:

Model	Listed price	Notes
`whisper-1`	$0.006 per minute	Original hosted Whisper checkpoint; legacy
`gpt-4o-transcribe`	$0.006 per minute	GPT-4o-based, lower WER, fewer hallucinations
`gpt-4o-mini-transcribe`	$0.003 per minute	Cheaper variant, slightly less accurate
`gpt-4o-transcribe-diarize`	(separate price)	Built-in speaker diarization, released late 2025
`gpt-realtime-whisper`	$0.017 per minute	Low-latency streaming transcription

The API bills per minute of input audio, not per minute of speech, accepts files up to 25 MB in common formats (MP3, WAV, FLAC, M4A, WEBM, MPGA, MPEG), and supports json, text, srt, verbose_json, and vtt response formats.^[7]^[34] It accepts an optional prompt parameter to bias spelling of proper nouns and style, and a timestamp_granularities parameter that can return word-level, segment-level, or both kinds of timestamps. Word-level timestamps are exposed only when response_format=verbose_json.^[34]

The API endpoint and the open-weight checkpoints should be understood as distinct products. Running Whisper from the open weights is free (modulo compute costs) and works fully offline; the API is a managed service with its own usage policies, quota limits, and pricing.

Is Whisper open source?

Yes. OpenAI released both Whisper's code and its pretrained model weights under the MIT License, a short and permissive license that allows commercial use, modification, redistribution, and incorporation into proprietary products, subject only to retaining the original copyright notice.^[3] This is in sharp contrast to several other OpenAI systems (DALL-E 2, GPT-3, and the broader ChatGPT family) whose weights have never been published. The MIT licensing decision is the foundation of the Whisper ecosystem: every project listed above (whisper.cpp, faster-whisper, WhisperX, Distil-Whisper, WhisperKit, and the countless fine-tunes on Hugging Face) is downstream of OpenAI's choice to make the weights freely usable.

Within the AI community, Whisper is often cited alongside CLIP (also OpenAI, also openly released) as evidence that OpenAI was, at one point, willing to ship significant systems with permissive licenses. Since the GPT-3 era, however, OpenAI has not released open weights for any of its flagship language models.^[16] The successor speech-to-text models (gpt-4o-transcribe, gpt-4o-mini-transcribe, and the Realtime endpoints) are API-only and have no published weights, which is precisely why the open Whisper checkpoints remain in active use for offline and privacy-sensitive work.^[8]^[29]

Evolution to GPT-4o realtime audio

OpenAI's audio strategy has progressively moved away from the open Whisper line toward unified multimodal systems. The first step was GPT-4o, announced on May 13, 2024 ("Spring Update"), a single end-to-end model trained across text, vision, and audio that could respond to audio inputs in as little as 232 milliseconds with an average of 320 ms, comparable to human turn-taking latency.^[35] Although GPT-4o uses its own native audio encoder rather than calling Whisper, several elements of the Whisper recipe (the 16 kHz mono input, the 30-second windowing, the BPE tokenizer extension for cross-lingual support) carry over to GPT-4o audio.^[35]

In March 2025, OpenAI introduced gpt-4o-transcribe and gpt-4o-mini-transcribe as next-generation speech-to-text models. OpenAI reported lower WER across 33 tested languages, semantic voice activity detection for better endpointing, and approximately 90% fewer hallucinations relative to Whisper large-v2 on the internal hallucination-with-noise evaluation.^[8]

In October 2025, OpenAI added gpt-4o-transcribe-diarize, the first OpenAI ASR model with built-in speaker diarization. The diarize endpoint returns a diarized_json payload with per-speaker segment timestamps and supports up to roughly 1,400 seconds per chunk; speakers default to labels A:, B:, etc., unless reference voice samples are provided up front.^[36]^[37]

The Realtime API reached general availability in August 2025, providing low-latency streaming audio in and audio out for conversational applications. The Realtime API supports gpt-realtime-whisper (a low-latency streaming variant priced at $0.017 per minute), gpt-4o-mini-transcribe, gpt-4o-transcribe, and whisper-1 as transcription backends.^[29]

These successor models are not open-weight. They are available only through the OpenAI API. As a result, the open-source community continues to rely on the open Whisper checkpoints and their derivatives for self-hosted, on-premise, on-device, and air-gapped deployments. Several non-OpenAI labs have also released competitive open ASR models (notably NVIDIA's Parakeet TDT and Canary families, and Meta's SeamlessM4T speech components), but Whisper retains the lead in raw adoption and ecosystem breadth.^[31]

How does Whisper compare to other ASR systems?

The following table summarizes how Whisper compares with the most-discussed alternative ASR families as of 2026.

System	Vendor	Open weights	Multilingual	Streaming	Hallucination rate	Typical use
Whisper large-v3	OpenAI	Yes (MIT)	100 languages	No (native)	High on silence/noise	Self-hosted batch ASR
Whisper large-v3-turbo	OpenAI	Yes (MIT)	99 languages (no translate)	No (native)	High (similar to v3)	Fast self-hosted batch
Distil-Whisper	Hugging Face	Yes (MIT)	English only (mostly)	Limited	Inherited from teacher	Edge / on-device
gpt-4o-transcribe	OpenAI	No (API)	33+ languages	Yes	~90% lower than Whisper-v2	Hosted production
Deepgram Nova-3	Deepgram	No (API)	Multiple	Yes	Low (proprietary metric)	Hosted streaming
AssemblyAI Universal	AssemblyAI	No (API)	Multiple	Yes	Low (proprietary metric)	Hosted production
Parakeet TDT	NVIDIA	Yes (CC-BY)	English	Yes	Low	Self-hosted streaming
Canary	NVIDIA	Yes (CC-BY-NC)	4 languages	Limited	Low	Self-hosted multilingual
SeamlessM4T	Meta	Yes (CC-BY-NC-SA)	100 languages	Limited	Moderate	Self-hosted multilingual
Speechmatics	Speechmatics	No (commercial)	45+ languages	Yes	Low (proprietary)	Hosted enterprise

Whisper's distinguishing characteristics in this comparison are its permissive MIT license, its multilingual coverage, and its enormous community ecosystem. Its primary weaknesses relative to commercial alternatives are the lack of native streaming and the well-documented hallucination behavior on silence and non-speech audio.^[31]^[32]^[15]

Successors and current status

As of May 2026, the open-weight Whisper checkpoints (large-v3 and large-v3-turbo, plus the smaller original sizes) remain the most recent open weights OpenAI has published for ASR. OpenAI has not announced an "open-weight Whisper-4" or similar successor.

Instead, OpenAI's continued investment in speech-to-text has moved into the closed gpt-4o-* audio model family described above. Independent benchmarking shows that gpt-4o-transcribe improves on Whisper across most metrics but at the cost of API-only access and per-minute pricing.^[32] Distil-Whisper v3.5, large-v3-turbo, and the WhisperX pipeline continue to receive community updates, ensuring that the open ASR stack remains a viable choice for users who cannot or will not send audio to OpenAI's hosted endpoints.

For developers and researchers in 2026, the practical picture is: use the open-weight Whisper checkpoints (typically via faster-whisper, whisper.cpp, or WhisperX) when self-hosting, privacy, cost, or offline operation matters; use OpenAI's gpt-4o-transcribe or gpt-4o-mini-transcribe when accuracy and lower hallucination rates matter more than openness; and apply VAD-based silence removal and human review for any high-stakes deployment, regardless of which model is used.^[13]^[15]

Legacy

Whisper is widely regarded as the model that opened the modern era of practical, robust speech recognition for developers. By providing a single, MIT-licensed checkpoint that handled 99 languages, translation, and timestamps with strong noise robustness, it collapsed a previously fragmented landscape of language-specific commercial APIs and academic checkpoints into a single fine-tunable starting point. The proliferation of community implementations (whisper.cpp running on phones and laptops, faster-whisper deployed in production at scale, WhisperX adding diarization, Distil-Whisper compressing the model further) is now a frequently cited example of what becomes possible when a frontier lab releases a capable model with a permissive license.^[16]

At the same time, Whisper has become a case study in the limitations of large weakly supervised generative models for high-stakes settings. The 2024 ACM FAccT paper, the AP's October 2024 investigation, the Cornell and UVA hallucination studies, and ongoing reporting on medical-transcription deployments have made Whisper a recurring example in discussions of AI safety, model auditing, and the practical risks of deploying language-model-derived systems where mistakes have real consequences.^[13]^[15] The 2025 Calm-Whisper research and OpenAI's own redesign in gpt-4o-transcribe (claiming about 90% fewer hallucinations) suggest that the hallucination problem is partially tractable through architectural and training changes, but also that the open Whisper weights, frozen since November 2023 (large-v3) and October 2024 (large-v3-turbo), still carry the failure mode they shipped with.^[8]^[27]

References

OpenAI. "Introducing Whisper." September 21, 2022. <https://openai.com/index/whisper/>. Accessed 2026-05-24. ↩
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356, December 2022 (later ICML 2023). <https://arxiv.org/abs/2212.04356>; full PDF at <https://cdn.openai.com/papers/whisper.pdf>. Accessed 2026-05-24. ↩
OpenAI. "openai/whisper LICENSE (MIT)." GitHub. <https://github.com/openai/whisper/blob/main/LICENSE>. Accessed 2026-05-24. ↩
OpenAI. "Announcing the large-v2 model." `openai/whisper` Discussion #661, December 8, 2022. <https://github.com/openai/whisper/discussions/661>. Accessed 2026-05-24. ↩
OpenAI. "`large-v3` release." `openai/whisper` Discussion #1762, November 6, 2023. <https://github.com/openai/whisper/discussions/1762>. Accessed 2026-05-24. ↩
OpenAI. "`turbo` model release." `openai/whisper` Discussion #2363, October 1, 2024. <https://github.com/openai/whisper/discussions/2363>. Accessed 2026-05-24. ↩
OpenAI. "Speech to text guide." OpenAI Platform documentation. <https://platform.openai.com/docs/guides/speech-to-text>. Accessed 2026-05-24. ↩
OpenAI. "Introducing next-generation audio models in the API." March 20, 2025. <https://openai.com/index/introducing-our-next-generation-audio-models/>. Accessed 2026-05-24. ↩
Gerganov, G. "whisper.cpp: Port of OpenAI's Whisper model in C/C++." GitHub. <https://github.com/ggml-org/whisper.cpp>. Accessed 2026-05-24. ↩
SYSTRAN. "faster-whisper: Faster Whisper transcription with CTranslate2." GitHub. <https://github.com/SYSTRAN/faster-whisper>. Accessed 2026-05-24. ↩
Bain, M., Huh, J., Han, T., & Zisserman, A. "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio." INTERSPEECH 2023; arXiv:2303.00747. <https://arxiv.org/abs/2303.00747>; code: <https://github.com/m-bain/whisperX>. Accessed 2026-05-24. ↩
Gandhi, S., von Platen, P., & Rush, A.M. "Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling." arXiv:2311.00430, 2023. <https://arxiv.org/abs/2311.00430>; model: <https://huggingface.co/distil-whisper/distil-large-v3>. Accessed 2026-05-24. ↩
Burke, G. & Schellmann, H. "Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said." Associated Press, October 26, 2024. Republished at <https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14> and Pulitzer Center summary <https://pulitzercenter.org/stories/researchers-say-ai-powered-transcription-tool-used-hospitals-invents-things-no-one-ever>. Accessed 2026-05-24. ↩
Bushard, B. "OpenAI's transcription tool hallucinates more than any other, experts say, but hospitals keep using it." Fortune, October 26, 2024. <https://fortune.com/2024/10/26/openai-transcription-tool-whisper-hallucination-rate-ai-tools-hospitals-patients-doctors/>. Accessed 2026-05-24. ↩
Koenecke, A., Choi, A.S.G., Mei, K.X., Schellmann, H., & Sloane, M. "Careless Whisper: Speech-to-Text Hallucination Harms." Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 2024; arXiv:2402.08021. <https://dl.acm.org/doi/10.1145/3630106.3658996>; <https://arxiv.org/abs/2402.08021>. Accessed 2026-05-24. ↩
OpenAI. "Introducing ChatGPT and Whisper APIs." March 1, 2023. <https://openai.com/index/introducing-chatgpt-and-whisper-apis/>. Accessed 2026-05-24. ↩
Hugging Face. "Whisper documentation in Transformers." <https://huggingface.co/docs/transformers/model_doc/whisper>. Accessed 2026-05-24. ↩
OpenAI. "How does Temperature fallback with beam search work?" `openai/whisper` Discussion #549. <https://github.com/openai/whisper/discussions/549>; source: <https://github.com/openai/whisper/blob/main/whisper/decoding.py>. Accessed 2026-05-24. ↩
OpenAI. "openai/whisper README, available models and languages." <https://github.com/openai/whisper#available-models-and-languages>. Accessed 2026-05-24. ↩
OpenAI. "openai/whisper repository (model card and README)." GitHub. <https://github.com/openai/whisper>. Accessed 2026-05-24. ↩
Hugging Face. "openai/whisper-large-v3-turbo model card." <https://huggingface.co/openai/whisper-large-v3-turbo>. Accessed 2026-05-24. ↩
Hugging Face. "openai/whisper-large-v3 model card." <https://huggingface.co/openai/whisper-large-v3>. Accessed 2026-05-24. ↩
Cornell Chronicle. "AI speech-to-text can hallucinate violent language." June 2024. <https://news.cornell.edu/stories/2024/06/ai-speech-text-can-hallucinate-violent-language>. Accessed 2026-05-24. ↩
Field, H. "OpenAI's Whisper transcription tool has hallucination issues, researchers say." TechCrunch, October 26, 2024. <https://techcrunch.com/2024/10/26/openais-whisper-transcription-tool-has-hallucination-issues-researchers-say/>. Accessed 2026-05-24. ↩
PBS NewsHour. "What to know about an AI transcription tool that 'hallucinates' medical interactions." October 2024. <https://www.pbs.org/newshour/show/what-to-know-about-an-ai-transcription-tool-that-hallucinates-medical-interactions>. Accessed 2026-05-24. ↩
Healthcare Brew. "OpenAI's transcription tool Whisper makes up words patients have never said." November 2024. <https://www.healthcare-brew.com/stories/2024/11/18/openai-transcription-tool-whisper-hallucinations>. Accessed 2026-05-24. ↩
Wang, Y. et al. "Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down." arXiv:2505.12969, 2025 (accepted to INTERSPEECH 2025). <https://arxiv.org/abs/2505.12969>; INTERSPEECH archive <https://www.isca-archive.org/interspeech_2025/wang25b_interspeech.html>. Accessed 2026-05-24. ↩
Sicard, M., et al. "Fine-tuning Whisper on Low-Resource Languages for Real-World Applications." arXiv:2412.15726, December 2024. <https://arxiv.org/abs/2412.15726>. Accessed 2026-05-24. ↩
OpenAI. "Realtime transcription guide and `gpt-realtime-whisper` model." OpenAI Developers documentation. <https://developers.openai.com/api/docs/guides/realtime-transcription>; <https://developers.openai.com/api/docs/models/gpt-realtime-whisper>. Accessed 2026-05-24. ↩
Microsoft. "Speech to text with Whisper, Azure AI Foundry." Microsoft Learn. <https://learn.microsoft.com/en-us/azure/foundry/openai/whisper-quickstart>; "Accelerate your productivity with the Whisper model in Azure AI now generally available." Microsoft Azure Blog. <https://azure.microsoft.com/en-us/blog/accelerate-your-productivity-with-the-whisper-model-in-azure-ai-now-generally-available/>. Accessed 2026-05-24. ↩
AssemblyAI. "Whisper alternatives." Company blog. <https://www.assemblyai.com/blog/whisper-alternatives>. Accessed 2026-05-24. ↩
ScribeWave. "OpenAI launches GPT-4o-transcribe: A powerful yet limited transcription model." <https://scribewave.com/blog/openai-launches-gpt-4o-transcribe-a-powerful-yet-limited-transcription-model>. Accessed 2026-05-24. ↩
Willison, S. "OpenAI: Introducing ChatGPT and Whisper APIs." March 1, 2023. <https://simonwillison.net/2023/Mar/1/openai-introducing-chatgpt-and-whisper-apis/>. Accessed 2026-05-24. ↩
OpenAI. "whisper-1 model reference and API options." OpenAI Developers documentation. <https://developers.openai.com/api/docs/models/whisper-1>; transcription options guide <https://platform.openai.com/docs/api-reference/audio>. Accessed 2026-05-24. ↩
OpenAI. "Hello GPT-4o." May 13, 2024. <https://openai.com/index/hello-gpt-4o/>. Accessed 2026-05-24. ↩
OpenAI. "Introducing GPT-4o Transcribe Diarize: Now Available in the Audio API." OpenAI Developer Community, October 2025. <https://community.openai.com/t/introducing-gpt-4o-transcribe-diarize-now-available-in-the-audio-api/1362933>. Accessed 2026-05-24. ↩
OpenAI. "gpt-4o-transcribe-diarize model." OpenAI Developers documentation. <https://developers.openai.com/api/docs/models/gpt-4o-transcribe-diarize>. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

11 revisions by 1 contributors · full history

Suggest edit