NVIDIA Parakeet
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,702 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,702 words
Add missing citations, update stale details, or suggest a clearer explanation.
Parakeet is a family of open automatic speech recognition (ASR) models developed by NVIDIA as part of the NeMo conversational AI toolkit. The models are designed for transcribing spoken English (and, in later releases, additional European languages) with state-of-the-art accuracy and unusually high throughput. Parakeet is best known for repeatedly topping the Hugging Face Open ASR Leaderboard while running far faster than competing systems, a combination achieved by pairing an efficient FastConformer encoder with transducer-style decoders. The earliest Parakeet checkpoints were developed jointly by the NVIDIA NeMo and Suno.ai teams and released in early 2024 under a permissive Creative Commons license.
The name Parakeet is used for the transducer and CTC branch of NVIDIA's NeMo speech models, distinguishing it from Canary, the companion family of attention-based encoder-decoder models that additionally performs speech translation. Both families share the same FastConformer encoder but differ in their decoders and in the breadth of tasks they support.
Open ASR systems are typically benchmarked on the Hugging Face Open ASR Leaderboard, which ranks models by two complementary measures: the average word error rate (WER) across a suite of English test sets such as LibriSpeech, AMI, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM and VoxPopuli, and the inverse real-time factor (RTFx), which measures transcription speed as the ratio of audio duration to processing time. A higher RTFx means more hours of audio can be transcribed per hour of compute. Parakeet was engineered to win on both axes simultaneously, rather than trading accuracy for speed.
Four initial Parakeet models were introduced in January 2024 and quickly reached the top of the Open ASR Leaderboard. NVIDIA reported that, as of early March 2024, its Parakeet and Canary models occupied all five of the top-ranking positions on the leaderboard. The models were trained to be resilient to non-speech audio such as music and silence, and to generalize across a wide range of accents, dialects, vocal ranges and noise conditions.
All Parakeet models use a FastConformer encoder, an optimized variant of the Conformer architecture introduced by NVIDIA in the paper "Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition." Compared with the standard Conformer, FastConformer adds an additional 2x convolutional subsampling stage so that the input is downsampled by a factor of 8 overall, uses depthwise-separable convolutions in the subsampling layers, reduces the number of channels in those layers from 512 to 256, and shrinks the convolution kernel size inside the Conformer blocks from 31 to 9. These changes make the encoder roughly 2.8x faster than the original Conformer while preserving accuracy, and they allow the architecture to scale to roughly a billion parameters without structural changes.
FastConformer also supports limited (local) context attention, which lets a single model transcribe very long recordings without running out of memory. NVIDIA reported that the architecture could process up to 11 hours of speech in one pass on an 80 GB NVIDIA A100 GPU using local attention.
Parakeet's distinguishing feature is that the same FastConformer encoder is offered with several different decoders, each making a different trade-off between accuracy and inference speed.
NeMo also ships hybrid TDT-CTC checkpoints (for example, models of around 110M and 1.1B parameters, plus a Japanese variant) that combine a TDT decoder and a CTC decoder on a shared encoder, letting a single model serve either decoding path.
The first Parakeet release, jointly developed by the NVIDIA NeMo and Suno.ai teams, consisted of English-only models at 0.6 billion and 1.1 billion parameters. The Hugging Face model cards for these checkpoints state explicitly that they were "jointly developed by NVIDIA NeMo and Suno.ai teams." They were trained on a 64,000-hour English speech corpus assembled by the two teams, comprising roughly 40,000 hours of private data and about 24,000 hours from public datasets including LibriSpeech, the Fisher Corpus, Switchboard, the Wall Street Journal corpus, the National Speech Corpus, VCTK, VoxPopuli, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice and People's Speech.
The Token-and-Duration Transducer model, Parakeet-TDT-1.1B, was highlighted as the most accurate of the family. NVIDIA reported that it was the first model to achieve an average WER below 7.0 on the Open ASR Leaderboard, and that it ran about 64% faster than the previously best Parakeet model, Parakeet-RNNT-1.1B. All of these checkpoints were released under the CC-BY-4.0 license, which permits commercial use.
| Model | Decoder | Parameters | Open ASR Leaderboard avg WER | RTFx |
|---|---|---|---|---|
| Parakeet-CTC-0.6B | CTC | 0.6B | (English ASR) | very high |
| Parakeet-CTC-1.1B | CTC | 1.1B | (English ASR) | very high |
| Parakeet-RNNT-0.6B | RNN-T | 0.6B | (English ASR) | high |
| Parakeet-RNNT-1.1B | RNN-T | 1.1B | 7.12 | 2,053 |
| Parakeet-TDT-1.1B | TDT | 1.1B | 7.02 | 2,391 |
Reported single-GPU throughput for the 1.1B models illustrates the speed differences between decoders: NVIDIA cited transcription rates of about 1,336 hours of audio per hour of real time for Parakeet-CTC-1.1B, 1,212 hours for Parakeet-TDT-1.1B, and 1,120 hours for Parakeet-RNNT-1.1B. Leaderboard WER and RTFx figures above are drawn from the respective Hugging Face model cards and vary with evaluation batch size and audio length.
In May 2025 NVIDIA released Parakeet-TDT-0.6B-v2, a 600-million-parameter FastConformer-TDT model for English transcription that topped the Hugging Face Open ASR Leaderboard at launch. The model achieved an average WER of 6.05% across the leaderboard's test sets, and NVIDIA reported an RTFx of about 3,380 at batch size 128, equivalent to transcribing roughly 60 minutes of audio in about one second on GPU-accelerated hardware. Beyond raw transcription, the model produces automatic punctuation and capitalization and accurate word-level timestamps, and it was tuned to handle spoken numbers and song-lyric transcription robustly.
Parakeet-TDT-0.6B-v2 was trained on the Granary dataset, a large open corpus of about 120,000 hours of English audio made up of roughly 10,000 hours of human-transcribed speech and about 110,000 hours of pseudo-labeled data. Like the rest of the family it was released under the CC-BY-4.0 license, making it fully usable in commercial products, and it runs through the NeMo toolkit.
In August 2025 NVIDIA released Parakeet-TDT-0.6B-v3, extending the same 600M-parameter FastConformer-TDT design to 25 European languages, including Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian and Ukrainian. It was trained on a much larger Granary corpus of around 660,000 hours of pseudo-labeled multilingual data plus roughly 10,000 hours of human-transcribed speech. On the English Open ASR Leaderboard the v3 model reported an average WER of about 6.34% with an RTFx near 3,333, while also providing competitive multilingual results on benchmarks such as FLEURS, Multilingual LibriSpeech and CoVoST. It retains punctuation, capitalization, and word- and segment-level timestamps, and continues to support long-form audio through local attention.
| Model | Release | Languages | Parameters | Avg WER (English) | RTFx |
|---|---|---|---|---|---|
| Parakeet-TDT-0.6B-v2 | May 2025 | English | 0.6B | 6.05% | ~3,380 |
| Parakeet-TDT-0.6B-v3 | Aug 2025 | 25 European | 0.6B | 6.34% | ~3,333 |
Parakeet and Canary are sibling model families within NeMo that share the FastConformer encoder but diverge in decoder design and scope. Canary uses an attention-based encoder-decoder (AED) architecture, pairing a FastConformer encoder with a Transformer decoder, and is multilingual and multitask: it performs both speech recognition and speech-to-text translation between English and other supported languages. Parakeet, by contrast, uses transducer or CTC decoders and is focused purely on transcription, which tends to make it faster at inference. The two families are frequently released and benchmarked together, and in 2025 NVIDIA paired Canary-1B-v2 with Parakeet-TDT-0.6B-v3 as complementary multilingual offerings built on the shared Granary training data.
All publicly released Parakeet checkpoints are distributed under the CC-BY-4.0 license, which allows commercial use with attribution. The models are available on Hugging Face and through NVIDIA's NGC catalog, and they are run and fine-tuned with the open-source NeMo toolkit, which provides the FastConformer, transducer and TDT implementations. NVIDIA also offers Parakeet models as packaged inference microservices (NVIDIA NIM) for production deployment.