NVIDIA Canary
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,873 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,873 words
Add missing citations, update stale details, or suggest a clearer explanation.
Canary is a family of open speech models developed by Nvidia as part of its NeMo conversational AI toolkit. Canary models perform both automatic speech recognition (ASR), the transcription of spoken audio into text, and automatic speech-to-text translation (AST), the rendering of speech in one language as text in another. Since the debut of the flagship Canary-1B model in April 2024, successive members of the family have repeatedly topped the Hugging Face Open ASR Leaderboard, the most widely cited public ranking of open speech recognition systems. The models are released with permissive licensing and downloadable weights, making them a prominent reference point for open multilingual speech AI alongside systems such as OpenAI's Whisper and Nvidia's own Parakeet family.
Canary models share a common design philosophy: an encoder-decoder neural architecture that pairs a FastConformer acoustic encoder with a Transformer text decoder, trained jointly on transcription and translation so that a single model can switch tasks and target languages at inference time. The family spans several parameter scales, from compact sub-200-million-parameter "Flash" variants optimized for throughput up to a speech-augmented language model built on top of a large language model.
Canary is a multitask, multilingual speech model. Given a single-channel audio waveform sampled at 16 kHz, a Canary model can either transcribe the speech in its original language or translate it into another supported language, in each case with the option of producing output with or without punctuation and capitalization (abbreviated PnC). The encoder converts the input audio into log-mel spectrogram features and then into a sequence of acoustic representations; the decoder generates output text tokens autoregressively, conditioned on the requested source language, target language, and task. This task-prompting scheme lets one set of weights serve transcription and translation across many language directions, distinguishing Canary from single-task ASR systems.
Canary models are distributed through Hugging Face and are also offered as hosted inference endpoints through Nvidia's deployment stack, including Nvidia Riva and the build.nvidia.com catalog. They are designed to run on Nvidia GPUs, with the various model cards citing support for Ampere, Hopper, and Blackwell-class hardware, and require only a few gigabytes of memory to load.
The Canary lineup has grown through several releases. The table below summarizes the principal members, with parameter counts, supported languages, tasks, and release timing drawn from their Hugging Face model cards.
| Model | Parameters | Languages | Tasks | License | Released |
|---|---|---|---|---|---|
| Canary-1B | ~1 billion | English, German, French, Spanish | ASR, AST | CC-BY-NC-4.0 | April 2024 |
| Canary-180M-Flash | 182 million | English, German, French, Spanish | ASR, AST, timestamps | CC-BY-4.0 | March 2025 |
| Canary-1B-Flash | 883 million | English, German, French, Spanish | ASR, AST, timestamps | CC-BY-4.0 | March 2025 |
| Canary-1B-v2 | 978 million | 25 European languages | ASR, AST, timestamps | CC-BY-4.0 | August 2025 |
| Canary-Qwen-2.5B | 2.5 billion | English | ASR, LLM post-processing | CC-BY-4.0 | July 2025 |
Canary-1B was the original release. It supports four languages, English, German, French, and Spanish, with bidirectional translation between English and each of the other three.
Canary-1B-Flash and Canary-180M-Flash are throughput-optimized variants introduced in 2025. They retain the same four languages and add experimental word-level and segment-level timestamp prediction. Despite the "1B" label, Canary-1B-Flash has 883 million parameters; the 180M variant has 182 million. Both are released under the commercially permissive CC-BY-4.0 license.
Canary-1B-v2 greatly expanded language coverage. With 978 million parameters, it supports 25 European languages for transcription and bidirectional translation between English and 24 other languages. It was trained on the open Granary dataset and released in August 2025.
Canary-Qwen-2.5B is a different kind of model: a speech-augmented language model (SALM) that fuses a Canary speech encoder with a Qwen large language model. Rather than a pure ASR system, it combines transcription with general language-model capabilities and is described below.
Every Canary model is an encoder-decoder system built on two main components. The encoder is a FastConformer, an Nvidia-developed variant of the Conformer architecture that increases the downsampling factor in the convolutional front end. According to Nvidia, this design yields roughly 3x compute savings and roughly 4x memory savings relative to a standard Conformer encoder, enabling faster training and inference on long audio. The encoder ingests audio as log-mel spectrogram features and produces a sequence of acoustic embeddings.
The decoder is a standard Transformer decoder that generates output text tokens autoregressively. Tasks and language directions are signaled to the decoder through special prompt tokens, allowing the same weights to transcribe or translate on demand. Tokenization uses SentencePiece: the original Canary-1B concatenated per-language SentencePiece tokenizers, while Canary-1B-v2 adopted a single unified SentencePiece tokenizer with a vocabulary of 16,384 tokens to span its 25 languages.
The specific layer configurations differ across the family. Canary-1B uses 24 encoder layers and 24 decoder layers. The Flash and v2 models rebalance toward a heavier encoder and a lighter decoder, which improves inference speed: Canary-1B-Flash uses 32 encoder layers and 4 decoder layers, Canary-180M-Flash uses 17 encoder layers and 4 decoder layers, and Canary-1B-v2 uses 32 encoder layers and 8 decoder layers. The shallow-decoder design is a key reason the Flash and v2 models achieve very high real-time factors.
Canary-Qwen-2.5B departs from the pure encoder-decoder pattern. It is a hybrid that connects the FastConformer speech encoder from Canary-1B-Flash to the Qwen3-1.7B language model through a linear projection, with low-rank adaptation (LoRA) applied to the LLM. The result, which Nvidia calls a speech-augmented language model, operates in two modes. In ASR mode it transcribes speech into text but does not exercise the LLM's reasoning skills. In LLM mode it retains the full capabilities of the underlying language model and can post-process a transcript, for example to summarize it or answer questions about its content. The model is English-only and totals about 2.5 billion parameters.
The Hugging Face Open ASR Leaderboard ranks open speech recognition models primarily by average word error rate (WER, lower is better) across a standard suite of English test sets, alongside an inverse real-time factor (RTFx, higher is faster) that measures throughput. Canary models have set the pace on this benchmark across multiple releases.
| Model | Mean WER | RTFx | Leaderboard note |
|---|---|---|---|
| Canary-1B (2024) | 6.67% (launch), 6.5 (current card) | ~235 | Topped the leaderboard at release |
| Canary-180M-Flash | 7.12 | >1200 | High-throughput compact model |
| Canary-1B-Flash | 6.35 | >1000 | More accurate and faster than Canary-1B |
| Canary-1B-v2 | 7.15 | ~749 | Tops the multilingual open ranking |
| Canary-Qwen-2.5B | 5.63 | ~418 | First place on the leaderboard (2025) |
At its April 2024 launch, Canary-1B reached the top of the leaderboard with an average WER of 6.67%, which Nvidia reported as the most accurate result among open models at the time. Nvidia emphasized that Canary-1B achieved this while training on roughly 85,000 hours of speech, an order of magnitude less data than some competing systems, and that on the Mozilla Common Voice 16.1 test sets it outperformed Whisper-large-v3 and SeamlessM4T-Medium.
The 2025 Flash models pushed throughput dramatically while improving accuracy: Canary-1B-Flash records a mean WER of 6.35 with an RTFx above 1,000 on the leaderboard datasets, and the smaller Canary-180M-Flash exceeds 1,200 RTFx. Canary-1B-v2 leads the multilingual portion of the ranking while, by Nvidia's account, delivering transcription and translation quality comparable to models three times larger at up to ten times the inference speed. Canary-Qwen-2.5B claimed first place on the leaderboard in 2025 with a mean WER of 5.63%, the lowest figure reported among the family. Because the leaderboard is recomputed as models are added and harnesses change, the precise WER values shown on individual model cards can differ slightly from the figures quoted at launch.
The original Canary-1B and the Flash variants support four languages: English, German, French, and Spanish. They perform ASR in all four and AST between English and each of German, French, and Spanish in both directions.
Canary-1B-v2 widened coverage to 25 European languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, and Ukrainian. It transcribes all 25 and translates between English and the other 24 in both directions.
Across the family the common tasks are transcription and translation, each available with or without punctuation and capitalization. The 2025 models added word-level and segment-level timestamp prediction, initially marked experimental, with Canary-1B-v2 extending segment-level timestamps to translated output as well.
Canary-1B and the Flash variants were trained on about 85,000 hours of speech, combining roughly 31,000 hours of public data, 20,000 hours collected by Suno, and 34,000 hours of in-house data. Training used Nvidia A100 GPUs.
Canary-1B-v2 was trained on a far larger corpus of roughly 1.7 million hours, drawing on the open Granary dataset together with Nvidia's NeMo ASR Set 3.0. Announced alongside the model in August 2025, Granary is an open-source corpus of about a million hours of multilingual audio, split roughly between speech recognition and speech translation and covering the same 25 European languages, including lower-resource languages such as Croatian, Estonian, and Maltese. Canary-Qwen-2.5B was trained on about 234,500 hours of publicly available English speech.
Licensing differs between the original model and the later releases. Canary-1B weights are distributed under the research-oriented, non-commercial CC-BY-NC-4.0 license. The 2025 models, Canary-180M-Flash, Canary-1B-Flash, Canary-1B-v2, and Canary-Qwen-2.5B, are released under the more permissive CC-BY-4.0 license and are explicitly cleared for commercial use, reflecting Nvidia's shift toward openly usable speech models. The associated training and inference code ships with the NeMo toolkit under the Apache 2.0 license.
Canary is a flagship demonstration of the speech capabilities in Nvidia's NeMo ecosystem and a recurring showcase for the FastConformer encoder that NeMo popularized. By repeatedly topping the Open ASR Leaderboard, the family has helped establish Nvidia as a leader in open speech recognition and translation, a position reinforced by the parallel Parakeet models that share the same FastConformer foundation. The release of the Granary dataset alongside Canary-1B-v2 extended that contribution beyond model weights into open training data for multilingual speech, and the Canary-Qwen SALM illustrates how Nvidia is fusing speech encoders with large language models. Distributed under permissive licenses and integrated with Nvidia's Riva and NIM deployment paths, Canary models are positioned both as research baselines and as production-ready building blocks for transcription, captioning, voice interfaces, and multilingual translation.