NVIDIA Parakeet

NVIDIA Open Source AI Speech & Audio AI

9 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 1,699 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Parakeet is a family of open automatic speech recognition (ASR) models developed by NVIDIA as part of the NeMo conversational AI toolkit. The models are designed for transcribing spoken English (and, in later releases, additional European languages) with state-of-the-art accuracy and unusually high throughput. Parakeet is best known for repeatedly topping the Hugging Face Open ASR Leaderboard while running far faster than competing systems, a combination achieved by pairing an efficient FastConformer encoder with transducer-style decoders. The earliest Parakeet checkpoints were developed jointly by the NVIDIA NeMo and Suno.ai teams and released in early 2024 under a permissive Creative Commons license.^[1]

The name Parakeet is used for the transducer and CTC branch of NVIDIA's NeMo speech models, distinguishing it from Canary, the companion family of attention-based encoder-decoder models that additionally performs speech translation. Both families share the same FastConformer encoder but differ in their decoders and in the breadth of tasks they support.

Background and positioning

Open ASR systems are typically benchmarked on the Hugging Face Open ASR Leaderboard, which ranks models by two complementary measures: the average word error rate (WER) across a suite of English test sets such as LibriSpeech, AMI, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM and VoxPopuli, and the inverse real-time factor (RTFx), which measures transcription speed as the ratio of audio duration to processing time.^[10] A higher RTFx means more hours of audio can be transcribed per hour of compute. Parakeet was engineered to win on both axes simultaneously, rather than trading accuracy for speed.

Four initial Parakeet models were introduced in January 2024 and quickly reached the top of the Open ASR Leaderboard.^[1] NVIDIA reported that, as of early March 2024, its Parakeet and Canary models occupied all five of the top-ranking positions on the leaderboard.^[9] The models were trained to be resilient to non-speech audio such as music and silence, and to generalize across a wide range of accents, dialects, vocal ranges and noise conditions.^[1]

FastConformer architecture

All Parakeet models use a FastConformer encoder, an optimized variant of the Conformer architecture introduced by NVIDIA in the paper "Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition."^[7] Compared with the standard Conformer, FastConformer adds an additional 2x convolutional subsampling stage so that the input is downsampled by a factor of 8 overall, uses depthwise-separable convolutions in the subsampling layers, reduces the number of channels in those layers from 512 to 256, and shrinks the convolution kernel size inside the Conformer blocks from 31 to 9. These changes make the encoder roughly 2.8x faster than the original Conformer while preserving accuracy, and they allow the architecture to scale to roughly a billion parameters without structural changes.^[7]

FastConformer also supports limited (local) context attention, which lets a single model transcribe very long recordings without running out of memory. NVIDIA reported that the architecture could process up to 11 hours of speech in one pass on an 80 GB NVIDIA A100 GPU using local attention.^[7]

Decoder variants

Parakeet's distinguishing feature is that the same FastConformer encoder is offered with several different decoders, each making a different trade-off between accuracy and inference speed.

CTC (Connectionist Temporal Classification). A non-autoregressive softmax prediction head. Because each output does not depend on previous outputs, CTC decoding is highly parallel and very fast, at some cost in accuracy relative to transducer decoders.
RNN-T (RNN-Transducer, RNNT). Adds a prediction network and a joint network to the encoder, producing an autoregressive model in which each token depends on the previously emitted tokens. This generally improves accuracy over CTC.
TDT (Token-and-Duration Transducer). A refined transducer objective introduced by NVIDIA in the paper "Efficient Sequence Transduction by Jointly Predicting Tokens and Durations." A TDT model's joint network produces two independently normalized distributions: one over output tokens and one over durations, where the duration is the number of input frames the emitted token spans. By predicting durations, the decoder can skip ahead over blank frames during inference instead of advancing one frame at a time. The paper reported that TDT models achieved better accuracy and up to 2.82x faster inference than conventional transducers.^[8]

NeMo also ships hybrid TDT-CTC checkpoints (for example, models of around 110M and 1.1B parameters, plus a Japanese variant) that combine a TDT decoder and a CTC decoder on a shared encoder, letting a single model serve either decoding path.

The original model family

The first Parakeet release, jointly developed by the NVIDIA NeMo and Suno.ai teams, consisted of English-only models at 0.6 billion and 1.1 billion parameters. The Hugging Face model cards for these checkpoints state explicitly that they were "jointly developed by NVIDIA NeMo and Suno.ai teams."^[3]^[4] They were trained on a 64,000-hour English speech corpus assembled by the two teams, comprising roughly 40,000 hours of private data and about 24,000 hours from public datasets including LibriSpeech, the Fisher Corpus, Switchboard, the Wall Street Journal corpus, the National Speech Corpus, VCTK, VoxPopuli, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice and People's Speech.^[3]^[4]

The Token-and-Duration Transducer model, Parakeet-TDT-1.1B, was highlighted as the most accurate of the family. NVIDIA reported that it was the first model to achieve an average WER below 7.0 on the Open ASR Leaderboard, and that it ran about 64% faster than the previously best Parakeet model, Parakeet-RNNT-1.1B.^[2] All of these checkpoints were released under the CC-BY-4.0 license, which permits commercial use.^[4]

Model	Decoder	Parameters	Open ASR Leaderboard avg WER	RTFx
Parakeet-CTC-0.6B	CTC	0.6B	(English ASR)	very high
Parakeet-CTC-1.1B	CTC	1.1B	(English ASR)	very high
Parakeet-RNNT-0.6B	RNN-T	0.6B	(English ASR)	high
Parakeet-RNNT-1.1B	RNN-T	1.1B	7.12^[3]	2,053
Parakeet-TDT-1.1B	TDT	1.1B	7.02^[4]	2,391

Reported single-GPU throughput for the 1.1B models illustrates the speed differences between decoders: NVIDIA cited transcription rates of about 1,336 hours of audio per hour of real time for Parakeet-CTC-1.1B, 1,212 hours for Parakeet-TDT-1.1B, and 1,120 hours for Parakeet-RNNT-1.1B.^[1] Leaderboard WER and RTFx figures above are drawn from the respective Hugging Face model cards and vary with evaluation batch size and audio length.

Parakeet-TDT-0.6B-v2

In May 2025 NVIDIA released Parakeet-TDT-0.6B-v2, a 600-million-parameter FastConformer-TDT model for English transcription that topped the Hugging Face Open ASR Leaderboard at launch.^[5]^[11] The model achieved an average WER of 6.05% across the leaderboard's test sets, and NVIDIA reported an RTFx of about 3,380 at batch size 128, equivalent to transcribing roughly 60 minutes of audio in about one second on GPU-accelerated hardware.^[5] Beyond raw transcription, the model produces automatic punctuation and capitalization and accurate word-level timestamps, and it was tuned to handle spoken numbers and song-lyric transcription robustly.^[5]

Parakeet-TDT-0.6B-v2 was trained on the Granary dataset, a large open corpus of about 120,000 hours of English audio made up of roughly 10,000 hours of human-transcribed speech and about 110,000 hours of pseudo-labeled data.^[5] Like the rest of the family it was released under the CC-BY-4.0 license, making it fully usable in commercial products, and it runs through the NeMo toolkit.

Parakeet-TDT-0.6B-v3 and the multilingual expansion

In August 2025 NVIDIA released Parakeet-TDT-0.6B-v3, extending the same 600M-parameter FastConformer-TDT design to 25 European languages, including Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian and Ukrainian.^[6] It was trained on a much larger Granary corpus of around 660,000 hours of pseudo-labeled multilingual data plus roughly 10,000 hours of human-transcribed speech.^[6] On the English Open ASR Leaderboard the v3 model reported an average WER of about 6.34% with an RTFx near 3,333, while also providing competitive multilingual results on benchmarks such as FLEURS, Multilingual LibriSpeech and CoVoST.^[6] It retains punctuation, capitalization, and word- and segment-level timestamps, and continues to support long-form audio through local attention.

Model	Release	Languages	Parameters	Avg WER (English)	RTFx
Parakeet-TDT-0.6B-v2	May 2025	English	0.6B	6.05%^[5]	~3,380
Parakeet-TDT-0.6B-v3	Aug 2025	25 European	0.6B	6.34%^[6]	~3,333

Relationship to Canary

Parakeet and Canary are sibling model families within NeMo that share the FastConformer encoder but diverge in decoder design and scope. Canary uses an attention-based encoder-decoder (AED) architecture, pairing a FastConformer encoder with a Transformer decoder, and is multilingual and multitask: it performs both speech recognition and speech-to-text translation between English and other supported languages. Parakeet, by contrast, uses transducer or CTC decoders and is focused purely on transcription, which tends to make it faster at inference. The two families are frequently released and benchmarked together, and in 2025 NVIDIA paired Canary-1B-v2 with Parakeet-TDT-0.6B-v3 as complementary multilingual offerings built on the shared Granary training data.

Licensing and availability

All publicly released Parakeet checkpoints are distributed under the CC-BY-4.0 license, which allows commercial use with attribution. The models are available on Hugging Face and through NVIDIA's NGC catalog, and they are run and fine-tuned with the open-source NeMo toolkit, which provides the FastConformer, transducer and TDT implementations. NVIDIA also offers Parakeet models as packaged inference microservices (NVIDIA NIM) for production deployment.

References

NVIDIA. "Pushing the Boundaries of Speech Recognition with NVIDIA NeMo Parakeet ASR Models." NVIDIA Technical Blog, April 18, 2024. https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/ ↩
NVIDIA. "Turbocharge ASR Accuracy and Speed with NVIDIA NeMo Parakeet-TDT." NVIDIA Technical Blog, April 2024. https://developer.nvidia.com/blog/turbocharge-asr-accuracy-and-speed-with-nvidia-nemo-parakeet-tdt/ ↩
NVIDIA. "nvidia/parakeet-rnnt-1.1b." Hugging Face model card. https://huggingface.co/nvidia/parakeet-rnnt-1.1b ↩
NVIDIA. "nvidia/parakeet-tdt-1.1b." Hugging Face model card. https://huggingface.co/nvidia/parakeet-tdt-1.1b ↩
NVIDIA. "nvidia/parakeet-tdt-0.6b-v2." Hugging Face model card. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 ↩
NVIDIA. "nvidia/parakeet-tdt-0.6b-v3." Hugging Face model card. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 ↩
Rekesh, D., et al. "Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition." arXiv:2305.05084, 2023. https://arxiv.org/abs/2305.05084 ↩
Xu, H., et al. "Efficient Sequence Transduction by Jointly Predicting Tokens and Durations." arXiv:2304.06795, 2023. https://arxiv.org/abs/2304.06795 ↩
NVIDIA. "NVIDIA Speech and Translation AI Models Set Records for Speed and Accuracy." NVIDIA Technical Blog, March 19, 2024. https://developer.nvidia.com/blog/nvidia-speech-and-translation-ai-models-set-records-for-speed-and-accuracy/ ↩
Hugging Face. "Open ASR Leaderboard." https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ↩
VentureBeat. "Nvidia launches fully open source transcription AI model Parakeet-TDT-0.6B-V2 on Hugging Face." May 2025. https://venturebeat.com/ai/nvidia-launches-fully-open-source-transcription-ai-model-parakeet-tdt-0-6b-v2-on-hugging-face ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

NVIDIA Canary Nemotron

Background and positioning

FastConformer architecture

Decoder variants

The original model family

Parakeet-TDT-0.6B-v2

Parakeet-TDT-0.6B-v3 and the multilingual expansion

Relationship to Canary

Licensing and availability

References

Improve this article

Related Articles

NVIDIA Canary

Sesame (AI company)

Moshi

Sesame CSM

F5-TTS

XTTS (Coqui XTTS)

What links here

Related Articles

NVIDIA Canary

Sesame (AI company)

Moshi

Sesame CSM

F5-TTS

XTTS (Coqui XTTS)

What links here