Whisper is an automatic speech recognition (ASR) system developed by OpenAI, first released in September 2022, that is capable of generating transcriptions and translations using an audio track as input. OpenAI stated that the model has been "trained on 680,000 hours of multilingual and multitask supervised data collected from the web," approaching "human level robustness and accuracy on English speech recognition." Unlike DALL-E 2 and GPT-3, the research lab made Whisper publicly available as an open-source project on GitHub under a permissive MIT license. Since its release, Whisper has become one of the most widely adopted open-source speech recognition systems, with its GitHub repository accumulating over 96,000 stars and spawning a large ecosystem of community-built tools and derivative implementations.
In the fields of artificial intelligence (AI) and machine learning, speech recognition remains a challenging problem. However, highly capable speech recognition systems have been developed and are used by major corporations like Google, Amazon, and Meta. According to OpenAI, Whisper differs from those systems because it was trained on hundreds of thousands of hours of multilingual and "multitask" data gathered from the web, increasing its robustness in the recognition of accents, background noise, and technical language. This leads to strong speech transcription across different languages, even with poor-quality audio or excessive background noise. The model supports transcription for 99 languages and translation from those languages into English. While OpenAI's analysis of other languages besides English is not comprehensive, users have reported good results across many language families.
Most of the software's audio dataset is comprised of English, but about a third is from other languages. The increased capabilities of Whisper in speech recognition come from having a large and diverse dataset. The model performs competitively against commercial ASR systems like Alexa, Siri, and Google Assistant in many benchmark scenarios. Whisper's high degree of accuracy and ease of use has enabled the integration of voice interfaces into a broader range of applications. The model was released with multiple sizes, with the tradeoff being between accuracy and speed (the smallest model is roughly 32 times faster than the largest one), giving users the option to run it on laptops, desktops, mobile devices, or cloud servers. Also, due to its open-source nature, developers and users can integrate it on their own servers, improving data privacy.
Figure 1. Whisper architecture. Source: OpenAI.
Whisper uses an encoder-decoder transformer architecture, sometimes referred to as a sequence-to-sequence model. This design is conceptually similar to the architecture used in language models like GPT-3 and translation models, but adapted for audio input rather than text. The entire pipeline can be broken down into an audio preprocessing stage, an encoder that processes acoustic features, and a decoder that generates text tokens.
Before audio enters the transformer, it goes through several preprocessing steps. The raw input audio is first resampled to 16,000 Hz (16 kHz). The audio is then divided into non-overlapping 30-second segments, which are processed sequentially. Each 30-second chunk is converted into an 80-channel log-magnitude Mel spectrogram representation, computed using 25-millisecond windows with a 10-millisecond stride. For the Large-v3 and Large-v3 Turbo models, the number of Mel frequency bins was increased from 80 to 128, providing finer frequency resolution. The resulting spectrogram is normalized to have approximately zero mean and a range of [-1, 1].
The spectrogram representation is first passed through a small convolutional stem consisting of two convolutional layers (Conv1D) with GELU activation functions. These convolutional layers downsample the input and extract local acoustic features. Sinusoidal positional encodings are then added to the feature representations before they enter the main transformer encoder blocks. The encoder consists of a stack of standard transformer blocks, each containing multi-head self-attention and feed-forward layers with residual connections and layer normalization. The number of encoder layers varies by model size, ranging from 4 layers in the Tiny model to 32 layers in the Large model.
The decoder follows a standard autoregressive transformer decoder design. It has the same width (hidden dimension) and number of transformer blocks as the encoder. The decoder uses learned position embeddings and tied input-output token representations, meaning the same weight matrix is shared for both the input embedding layer and the output prediction layer. Whisper uses a byte-pair encoding (BPE) tokenizer similar to the one used in GPT-2. The multilingual models use a vocabulary of 51,865 tokens, while the English-only models use the GPT-2 tokenizer with 50,257 tokens. The decoder generates text tokens one at a time, conditioned on the encoder output and previously generated tokens.
A distinctive feature of Whisper's architecture is its use of special tokens that allow a single model to handle multiple tasks. The decoder is guided by a sequence of special tokens that specify:
<|en|>, <|fr|>, <|zh|>).<|transcribe|> for speech recognition in the source language or <|translate|> for translation into English.<|notimestamps|> token suppresses timestamp prediction. When this token is absent, the decoder predicts timestamps quantized to 20-millisecond intervals, enabling word-level and segment-level timing information.<|nospeech|> token.This multitask design means Whisper functions as a general-purpose audio model rather than a narrowly scoped transcription tool.
Figure 2. Different Whisper models. Source: OpenAI.
Whisper ships in several model sizes, each with a different balance of accuracy, speed, and resource requirements. The smaller models (.en variants) are available for English-only applications and tend to perform better on English tasks at their respective sizes, particularly for the Tiny and Base variants. The larger models (Large and Turbo) are available only as multilingual models.
| Model | Parameters | Encoder Layers | Decoder Layers | Hidden Dimension | Attention Heads | Approx. VRAM | Relative Speed |
|---|---|---|---|---|---|---|---|
| Tiny | 39 M | 4 | 4 | 384 | 6 | ~1 GB | ~10x |
| Base | 74 M | 6 | 6 | 512 | 8 | ~1 GB | ~7x |
| Small | 244 M | 12 | 12 | 768 | 12 | ~2 GB | ~4x |
| Medium | 769 M | 24 | 24 | 1024 | 16 | ~5 GB | ~2x |
| Large | 1,550 M | 32 | 32 | 1280 | 20 | ~10 GB | 1x |
| Large-v3 Turbo | 809 M | 32 | 4 | 1280 | 20 | ~6 GB | ~8x |
The English-only variants (tiny.en, base.en, small.en, medium.en) share the same architectures as their multilingual counterparts but are trained exclusively on English data. No English-only variant exists for the Large or Turbo models.
OpenAI has released several iterations of the Large model:
The original Whisper models were trained on 680,000 hours of multilingual and multitask supervised data collected from the internet. OpenAI describes this as "weakly supervised" data because the transcript labels were not hand-verified by human annotators. Instead, the labels came from sources such as subtitles, closed captions, and other user-generated text paired with audio found on the web.
The dataset composition breaks down roughly as follows:
| Category | Hours | Share |
|---|---|---|
| English speech recognition | ~438,000 | ~65% |
| Non-English speech recognition (96 languages) | ~117,000 | ~17% |
| X-to-English translation | ~125,000 | ~18% |
For Large-v3, the training data was expanded dramatically to approximately 5 million hours total, consisting of 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio. The pseudo-labeled portion was generated by running Large-v2 over unlabeled audio and using its transcription outputs as training labels, a form of self-training.
The data pipeline involved several filtering and cleaning steps:
After filtering, audio files were segmented into 30-second chunks paired with the corresponding portion of the transcript.
Whisper is trained as a multitask model, jointly learning to perform speech recognition, speech translation, language identification, and voice activity detection. During training, special tokens prepended to the decoder input specify which task to perform for a given audio segment. This approach is similar in spirit to the text-based multitask training used in models like T5 and GPT-3, where a single model learns multiple capabilities through formatted input sequences.
The multitask training framework enables four core capabilities:
Because the model was trained on a very large and diverse dataset rather than a smaller, curated dataset, Whisper does not necessarily beat models that specialize in specific benchmarks like LibriSpeech. However, its zero-shot performance is substantially more robust, with roughly 50% fewer errors than comparable models when evaluated on out-of-distribution data.
Figure 3. Comparison of Whisper's Word-error-rate (WER) with other models. Source: OpenAI.
Whisper's accuracy is measured primarily using word error rate (WER), the standard metric for ASR systems. On clean English speech, Whisper Large-v3 achieves WER as low as 2.7%. On mixed real-world recordings with varying audio quality, WER increases to approximately 7 to 8%. The newer gpt-4o-transcribe models released by OpenAI in March 2025 achieve an even lower 2.46% WER on English benchmarks.
Performance degrades on particularly challenging audio. On low-quality call center recordings, for example, WER can rise to around 17 to 18%.
Figure 4. Whisper's WER as a function of language. Source: OpenAI.
Multilingual performance varies significantly depending on how well a language is represented in the training data. Languages with substantial training data, such as Spanish, French, German, Italian, Portuguese, Japanese, and Korean, achieve relatively low word error rates. Romance languages and other well-resourced European languages tend to perform best after English.
However, for languages less represented in the training set, WER can exceed 20% or even reach much higher levels. OpenAI only officially lists languages where Whisper achieves better than 50% WER. Low-resource languages and those with less internet-available transcription data show markedly worse accuracy. Large-v3 improved multilingual performance by 10 to 20% over Large-v2 across a wide range of languages.
OpenAI offers Whisper through its commercial API as part of the Audio API suite. The API provides a hosted, managed transcription service that does not require users to run inference infrastructure.
As of early 2026, the OpenAI Audio API offers several transcription models:
| Model | Price per Minute | Notes |
|---|---|---|
| whisper-1 (legacy) | $0.006 | Original hosted Whisper model |
| gpt-4o-transcribe | $0.006 | Improved accuracy over Whisper, lower WER |
| gpt-4o-mini-transcribe | $0.003 | 50% cheaper, strong accuracy |
The API bills based on the duration of the uploaded audio file, not the amount of detected speech. Uploading a 10-minute file with 9 minutes of silence incurs charges for the full 10 minutes. The maximum file size is 25 MB, and the API accepts formats including MP3, WAV, FLAC, M4A, WEBM, MPGA, and MPEG.
The Whisper API supports several features:
In March 2025, OpenAI introduced gpt-4o-transcribe and gpt-4o-mini-transcribe as successors to the Whisper-based API. These models build on the GPT-4o architecture and deliver improvements over Whisper in several areas: lower word error rates across 33 tested languages, built-in noise cancellation, a semantic voice activity detector for improved endpoint detection, and approximately 90% fewer hallucinations compared to Whisper v2. OpenAI currently recommends gpt-4o-mini-transcribe for most use cases due to its balance of cost, accuracy, and reliability.
Whisper was released as open-source software on September 21, 2022, under the MIT license. OpenAI published both the model weights and the inference code on GitHub, making it one of the few major AI models from OpenAI to be fully open-sourced. This decision spurred rapid and widespread community adoption.
As of early 2026, the main Whisper repository on GitHub has accumulated over 96,000 stars and nearly 12,000 forks, making it one of the most popular open-source machine learning repositories. The model weights are also hosted on Hugging Face, where they have been downloaded millions of times. The open-source release catalyzed a broad ecosystem of third-party tools, optimized implementations, and derivative models.
The open-source nature of Whisper has led to a rich ecosystem of community-built tools that optimize, extend, or adapt the model for specific use cases.
faster-whisper is a reimplementation of Whisper using CTranslate2, an optimized C++ inference engine originally developed for translation models. It focuses on efficiency through quantization (INT8 and FP16) to reduce memory usage and speed up inference. faster-whisper achieves up to 4 times the speed of the original OpenAI implementation while maintaining the same accuracy and using less memory. The project has gained over 21,000 GitHub stars and is a popular choice for server-side deployments. It supports both CPU and GPU execution.
whisper.cpp is a plain C/C++ port of the Whisper model developed by Georgi Gerganov (the creator of llama.cpp). Initially designed for CPU-only inference, it has since added CUDA and Core ML support. whisper.cpp uses the GGML tensor library and supports quantized model formats (GGML), reducing model file sizes from 2.9 GB for the Large model down to smaller quantized representations. The project has over 47,000 GitHub stars and is the foundation for many desktop and mobile applications that run Whisper locally.
WhisperX extends Whisper into a full speech processing pipeline rather than focusing solely on transcription speed. It calls faster-whisper under the hood for transcription, then adds additional processing layers: voice activity detection (VAD) to segment audio properly before transcription, forced alignment using wav2vec2 models to obtain precise word-level timestamps, and optional speaker diarization using pyannote-audio to label which speaker produced each segment. WhisperX has over 20,000 GitHub stars and is commonly used for applications requiring accurate word timing and speaker identification.
insanely-fast-whisper is a CLI tool built on top of Hugging Face Transformers that focuses on maximizing GPU throughput. It restructures attention computations and leverages BetterTransformer (an optimized transformer runtime) and FlashAttention-2 (a memory-efficient attention algorithm) to push inference speed to its limits on high-end NVIDIA GPUs. It is best suited for batch processing large volumes of audio on powerful GPU hardware.
WhisperKit is a Swift package developed by Argmax that deploys Whisper models natively on Apple Silicon using Core ML. It supports real-time streaming, word timestamps, voice activity detection, and speaker diarization, all running entirely on-device without cloud connectivity. WhisperKit is used in iOS and macOS applications where privacy and offline capability are requirements.
| Tool | Language | Primary Optimization | GPU Support | Key Feature | GitHub Stars |
|---|---|---|---|---|---|
| faster-whisper | Python/C++ | CTranslate2 quantization | Yes (CUDA) | 4x faster, less memory | ~21,600 |
| whisper.cpp | C/C++ | GGML quantization | Yes (CUDA, Core ML) | CPU-friendly, small binaries | ~47,600 |
| WhisperX | Python | Pipeline integration | Yes (CUDA) | VAD, alignment, diarization | ~20,700 |
| insanely-fast-whisper | Python | FlashAttention-2 | Yes (CUDA) | Maximum GPU throughput | ~7,000+ |
| WhisperKit | Swift | Core ML (Apple Silicon) | Apple Neural Engine | On-device iOS/macOS | ~5,000+ |
Distil-Whisper is a family of distilled Whisper models created by Hugging Face that apply knowledge distillation to produce smaller, faster versions of Whisper while preserving most of its accuracy. The distillation approach copies the entire encoder from Whisper and freezes it during training, while reducing the decoder to just two layers (initialized from the first and last decoder layers of the teacher model).
The flagship distil-large-v3 model is 6 times faster than Whisper Large-v3 and 49% smaller in parameter count, while performing within 1% WER on out-of-distribution evaluation sets. On Apple M1 hardware, distil-large-v3 achieves over 5 times the speed of Large-v3 while staying within 0.8% WER on long-form audio.
Key Distil-Whisper releases include:
| Model | Teacher Model | Training Data | Speed vs. Teacher | WER Gap |
|---|---|---|---|---|
| distil-large-v2 | Whisper Large-v2 | 22,000 hours | ~6x faster | < 1% |
| distil-large-v3 | Whisper Large-v3 | 22,000 hours | ~6x faster | < 1% |
| distil-large-v3.5 | Whisper Large-v3 | 98,000 hours | ~6x faster | < 1% |
| distil-small.en | Whisper Small | English subset | ~6x faster | < 1% |
| distil-medium.en | Whisper Medium | English subset | ~6x faster | < 1% |
Distil-Whisper can also be used as an assistant model for speculative decoding with Whisper, providing 2 times faster inference while mathematically guaranteeing identical outputs to the full Whisper model.
Whisper occupies a unique position in the ASR landscape as a large, general-purpose, open-source model trained on weakly supervised web data. This sets it apart from both commercial cloud APIs and other open-source approaches.
| Feature | OpenAI Whisper | Google Cloud Speech-to-Text | AWS Transcribe | Azure Speech |
|---|---|---|---|---|
| Deployment | Open-source / API | Cloud API only | Cloud API only | Cloud API only |
| Price (per minute) | $0.006 (API) / Free (self-hosted) | ~$0.024 | ~$0.024 | ~$0.017 |
| Languages | 99 | 125+ | 100+ | 100+ |
| Real-time streaming | Not native (community tools) | Yes | Yes | Yes |
| Speaker diarization | Via community tools (WhisperX) | Built-in | Built-in | Built-in |
| Custom vocabulary | Via prompt conditioning | Yes | Yes | Yes |
| On-premise / local | Yes | No | No | Limited |
| English WER (clean) | ~2.7% (Large-v3) | ~3-5% | ~4-6% | ~3-5% |
Whisper's primary advantages over commercial APIs are its open-source availability (enabling self-hosting with no per-minute costs), strong multilingual performance, and robustness to background noise. Its main disadvantages are the lack of native real-time streaming, the absence of built-in speaker diarization, and the hallucination issues described below.
Wav2Vec 2.0, developed by Meta, takes a fundamentally different approach. It uses self-supervised pretraining on unlabeled audio, learning speech representations before being fine-tuned on smaller labeled datasets. Whisper, by contrast, is trained end-to-end on a massive weakly supervised dataset. In head-to-head comparisons, Whisper generally achieves lower WER on English and multilingual benchmarks. Whisper Large-v3 averages around 7.4% WER on mixed benchmarks, while Wav2Vec 2.0 can reach 37% WER on clean speech and over 54% on noisy speech in some evaluations. However, Wav2Vec 2.0 has advantages in low-resource language scenarios where fine-tuning on small labeled datasets is the primary option.
Whisper has found adoption across a wide range of applications:
The model's primary use case is batch audio and video transcription. It powers transcription features on platforms such as YouTube (through integrations offered by Hugging Face), podcast transcription services, and meeting note tools. The combination of multilingual support and strong noise robustness makes it well suited for transcribing diverse content.
Whisper is used in accessibility tools that generate real-time captions for deaf and hard-of-hearing users. Community implementations with streaming support (such as WhisperLive and WhisperLiveKit) have enabled near-real-time captioning applications.
OpenAI integrated Whisper into ChatGPT's voice input capabilities when voice mode launched in September 2023. Whisper handles the speech-to-text component of the pipeline, converting spoken input into text that is then processed by the language model. The Realtime API, which reached general availability in August 2025, enables developers to build conversational AI applications with streaming audio input and output.
Whisper's open-source availability has made it a popular baseline for ASR research. Researchers use it to study model robustness, multilingual performance, bias in speech recognition, and the effects of weak supervision. It has also been fine-tuned for specialized domains including medical transcription, legal proceedings, and technical dictation.
While Whisper performs well as a general-purpose model, it can also be fine-tuned on domain-specific data to improve performance for particular use cases. Common fine-tuning targets include improving recognition of medical or technical terminology, boosting accuracy for underrepresented languages, enhancing performance on specific accents or dialects, and adapting to particular audio environments like telephony or broadcast.
One of Whisper's most significant limitations is its tendency to hallucinate, generating text that was never actually spoken in the audio. This is an inherent risk of the sequence-to-sequence architecture combined with weak supervision training. The Associated Press documented Whisper as "plagued by hallucinations." A University of Michigan study found hallucination artifacts in eight out of every ten transcriptions analyzed. Common hallucination patterns include inserting words during silent segments, generating repetitive phrases, and producing plausible-sounding but fabricated content.
Research has identified that more than 75% of non-speech hallucinations in Whisper Large-v3 are attributable to three specific decoder self-attention heads. The "Calm-Whisper" technique, proposed in 2025, addresses this by dampening those attention heads, reducing non-speech hallucinations by over 80% while maintaining transcription accuracy.
OpenAI warns that Whisper should not be used in "high-risk domains" and recommends against using it in "decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes." The newer gpt-4o-transcribe models produce approximately 90% fewer hallucinations compared to Whisper v2.
Whisper's performance varies dramatically across languages due to imbalanced training data. English accounts for approximately 65% of the training data, while all 96 other non-English languages share the remaining 17% of speech recognition data (with 18% allocated to translation data). Languages with limited internet-available transcription data, such as many African and Southeast Asian languages, exhibit significantly higher error rates.
The autoregressive decoder can sometimes enter loops, generating the same phrase or sentence repeatedly. This is a known failure mode of sequence-to-sequence architectures and tends to occur more frequently with low-quality or very long audio inputs.
Whisper processes audio in 30-second chunks and is designed for batch transcription rather than real-time streaming. While community tools have added streaming capabilities, the model itself does not natively support low-latency real-time transcription.
The larger models require significant computational resources. The Large model needs approximately 10 GB of VRAM, making it impractical for many consumer devices. Even the Turbo variant requires around 6 GB. While the smaller models can run on CPUs, inference speed on slow processors may be impractical for real-time use.
The official Whisper package can be installed via pip:
pip install openai-whisper
Alternatively, for the latest development version:
pip install git+https://github.com/openai/whisper.git
Command-line transcription:
whisper audio.mp3 --model large-v3
Python API usage:
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe("audio.mp3")
print(result["text"])
Several notable developments have occurred since Whisper's initial release:
While OpenAI's commercial API has moved toward GPT-4o-based transcription models, the open-source Whisper models remain widely used for self-hosted deployments, edge computing, and research applications where cost, privacy, and customization are priorities.