NVIDIA Canary

AI Models NVIDIA Speech & Audio AI

9 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 1,870 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Canary is a family of open speech models developed by Nvidia as part of its NeMo conversational AI toolkit. Canary models perform both automatic speech recognition (ASR), the transcription of spoken audio into text, and automatic speech-to-text translation (AST), the rendering of speech in one language as text in another. Since the debut of the flagship Canary-1B model in April 2024, successive members of the family have repeatedly topped the Hugging Face Open ASR Leaderboard, the most widely cited public ranking of open speech recognition systems.^[6] The models are released with permissive licensing and downloadable weights, making them a prominent reference point for open multilingual speech AI alongside systems such as OpenAI's Whisper and Nvidia's own Parakeet family.

Canary models share a common design philosophy: an encoder-decoder neural architecture that pairs a FastConformer acoustic encoder with a Transformer text decoder, trained jointly on transcription and translation so that a single model can switch tasks and target languages at inference time. The family spans several parameter scales, from compact sub-200-million-parameter "Flash" variants optimized for throughput up to a speech-augmented language model built on top of a large language model.^[5]

What Canary is

Canary is a multitask, multilingual speech model. Given a single-channel audio waveform sampled at 16 kHz, a Canary model can either transcribe the speech in its original language or translate it into another supported language, in each case with the option of producing output with or without punctuation and capitalization (abbreviated PnC). The encoder converts the input audio into log-mel spectrogram features and then into a sequence of acoustic representations; the decoder generates output text tokens autoregressively, conditioned on the requested source language, target language, and task. This task-prompting scheme lets one set of weights serve transcription and translation across many language directions, distinguishing Canary from single-task ASR systems.^[1]

Canary models are distributed through Hugging Face and are also offered as hosted inference endpoints through Nvidia's deployment stack, including Nvidia Riva and the build.nvidia.com catalog. They are designed to run on Nvidia GPUs, with the various model cards citing support for Ampere, Hopper, and Blackwell-class hardware, and require only a few gigabytes of memory to load.^[1]

The model family and variants

The Canary lineup has grown through several releases. The table below summarizes the principal members, with parameter counts, supported languages, tasks, and release timing drawn from their Hugging Face model cards.

Model	Parameters	Languages	Tasks	License	Released
Canary-1B	~1 billion	English, German, French, Spanish	ASR, AST	CC-BY-NC-4.0	April 2024^[1]
Canary-180M-Flash	182 million	English, German, French, Spanish	ASR, AST, timestamps	CC-BY-4.0	March 2025^[3]
Canary-1B-Flash	883 million	English, German, French, Spanish	ASR, AST, timestamps	CC-BY-4.0	March 2025^[2]
Canary-1B-v2	978 million	25 European languages	ASR, AST, timestamps	CC-BY-4.0	August 2025^[4]
Canary-Qwen-2.5B	2.5 billion	English	ASR, LLM post-processing	CC-BY-4.0	July 2025^[5]

Canary-1B was the original release. It supports four languages, English, German, French, and Spanish, with bidirectional translation between English and each of the other three.^[1]

Canary-1B-Flash and Canary-180M-Flash are throughput-optimized variants introduced in 2025. They retain the same four languages and add experimental word-level and segment-level timestamp prediction. Despite the "1B" label, Canary-1B-Flash has 883 million parameters; the 180M variant has 182 million. Both are released under the commercially permissive CC-BY-4.0 license.^[2]

Canary-1B-v2 greatly expanded language coverage. With 978 million parameters, it supports 25 European languages for transcription and bidirectional translation between English and 24 other languages. It was trained on the open Granary dataset and released in August 2025.^[4]

Canary-Qwen-2.5B is a different kind of model: a speech-augmented language model (SALM) that fuses a Canary speech encoder with a Qwen large language model. Rather than a pure ASR system, it combines transcription with general language-model capabilities and is described below.^[5]

Architecture

Every Canary model is an encoder-decoder system built on two main components. The encoder is a FastConformer, an Nvidia-developed variant of the Conformer architecture that increases the downsampling factor in the convolutional front end. According to Nvidia, this design yields roughly 3x compute savings and roughly 4x memory savings relative to a standard Conformer encoder, enabling faster training and inference on long audio.^[6] The encoder ingests audio as log-mel spectrogram features and produces a sequence of acoustic embeddings.

The decoder is a standard Transformer decoder that generates output text tokens autoregressively. Tasks and language directions are signaled to the decoder through special prompt tokens, allowing the same weights to transcribe or translate on demand. Tokenization uses SentencePiece: the original Canary-1B concatenated per-language SentencePiece tokenizers, while Canary-1B-v2 adopted a single unified SentencePiece tokenizer with a vocabulary of 16,384 tokens to span its 25 languages.^[4]

The specific layer configurations differ across the family. Canary-1B uses 24 encoder layers and 24 decoder layers.^[1] The Flash and v2 models rebalance toward a heavier encoder and a lighter decoder, which improves inference speed: Canary-1B-Flash uses 32 encoder layers and 4 decoder layers,^[2] Canary-180M-Flash uses 17 encoder layers and 4 decoder layers,^[3] and Canary-1B-v2 uses 32 encoder layers and 8 decoder layers.^[4] The shallow-decoder design is a key reason the Flash and v2 models achieve very high real-time factors.

Canary-Qwen-2.5B and the SALM design

Canary-Qwen-2.5B departs from the pure encoder-decoder pattern. It is a hybrid that connects the FastConformer speech encoder from Canary-1B-Flash to the Qwen3-1.7B language model through a linear projection, with low-rank adaptation (LoRA) applied to the LLM. The result, which Nvidia calls a speech-augmented language model, operates in two modes. In ASR mode it transcribes speech into text but does not exercise the LLM's reasoning skills. In LLM mode it retains the full capabilities of the underlying language model and can post-process a transcript, for example to summarize it or answer questions about its content. The model is English-only and totals about 2.5 billion parameters.^[5]

Open ASR Leaderboard results

The Hugging Face Open ASR Leaderboard ranks open speech recognition models primarily by average word error rate (WER, lower is better) across a standard suite of English test sets, alongside an inverse real-time factor (RTFx, higher is faster) that measures throughput. Canary models have set the pace on this benchmark across multiple releases.^[9]

Model	Mean WER	RTFx	Leaderboard note
Canary-1B (2024)	6.67% (launch), 6.5 (current card)	~235	Topped the leaderboard at release^[6]
Canary-180M-Flash	7.12	>1200	High-throughput compact model^[3]
Canary-1B-Flash	6.35	>1000	More accurate and faster than Canary-1B^[2]
Canary-1B-v2	7.15	~749	Tops the multilingual open ranking^[4]
Canary-Qwen-2.5B	5.63	~418	First place on the leaderboard (2025)^[5]

At its April 2024 launch, Canary-1B reached the top of the leaderboard with an average WER of 6.67%, which Nvidia reported as the most accurate result among open models at the time.^[6] Nvidia emphasized that Canary-1B achieved this while training on roughly 85,000 hours of speech, an order of magnitude less data than some competing systems, and that on the Mozilla Common Voice 16.1 test sets it outperformed Whisper-large-v3 and SeamlessM4T-Medium.^[6]

The 2025 Flash models pushed throughput dramatically while improving accuracy: Canary-1B-Flash records a mean WER of 6.35 with an RTFx above 1,000 on the leaderboard datasets, and the smaller Canary-180M-Flash exceeds 1,200 RTFx.^[8] Canary-1B-v2 leads the multilingual portion of the ranking while, by Nvidia's account, delivering transcription and translation quality comparable to models three times larger at up to ten times the inference speed.^[7] Canary-Qwen-2.5B claimed first place on the leaderboard in 2025 with a mean WER of 5.63%, the lowest figure reported among the family.^[5] Because the leaderboard is recomputed as models are added and harnesses change, the precise WER values shown on individual model cards can differ slightly from the figures quoted at launch.

Supported languages and tasks

The original Canary-1B and the Flash variants support four languages: English, German, French, and Spanish. They perform ASR in all four and AST between English and each of German, French, and Spanish in both directions.^[1]

Canary-1B-v2 widened coverage to 25 European languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, and Ukrainian. It transcribes all 25 and translates between English and the other 24 in both directions.^[4]

Across the family the common tasks are transcription and translation, each available with or without punctuation and capitalization. The 2025 models added word-level and segment-level timestamp prediction, initially marked experimental, with Canary-1B-v2 extending segment-level timestamps to translated output as well.^[4]

Training data

Canary-1B and the Flash variants were trained on about 85,000 hours of speech, combining roughly 31,000 hours of public data, 20,000 hours collected by Suno, and 34,000 hours of in-house data. Training used Nvidia A100 GPUs.^[1]

Canary-1B-v2 was trained on a far larger corpus of roughly 1.7 million hours, drawing on the open Granary dataset together with Nvidia's NeMo ASR Set 3.0.^[4] Announced alongside the model in August 2025, Granary is an open-source corpus of about a million hours of multilingual audio, split roughly between speech recognition and speech translation and covering the same 25 European languages, including lower-resource languages such as Croatian, Estonian, and Maltese.^[7] Canary-Qwen-2.5B was trained on about 234,500 hours of publicly available English speech.^[5]

Licensing

Licensing differs between the original model and the later releases. Canary-1B weights are distributed under the research-oriented, non-commercial CC-BY-NC-4.0 license.^[1] The 2025 models, Canary-180M-Flash, Canary-1B-Flash, Canary-1B-v2, and Canary-Qwen-2.5B, are released under the more permissive CC-BY-4.0 license and are explicitly cleared for commercial use, reflecting Nvidia's shift toward openly usable speech models. The associated training and inference code ships with the NeMo toolkit under the Apache 2.0 license.

Significance within NVIDIA NeMo

Canary is a flagship demonstration of the speech capabilities in Nvidia's NeMo ecosystem and a recurring showcase for the FastConformer encoder that NeMo popularized. By repeatedly topping the Open ASR Leaderboard, the family has helped establish Nvidia as a leader in open speech recognition and translation, a position reinforced by the parallel Parakeet models that share the same FastConformer foundation. The release of the Granary dataset alongside Canary-1B-v2 extended that contribution beyond model weights into open training data for multilingual speech,^[7] and the Canary-Qwen SALM illustrates how Nvidia is fusing speech encoders with large language models. Distributed under permissive licenses and integrated with Nvidia's Riva and NIM deployment paths, Canary models are positioned both as research baselines and as production-ready building blocks for transcription, captioning, voice interfaces, and multilingual translation.

References

NVIDIA, "nvidia/canary-1b," Hugging Face model card. https://huggingface.co/nvidia/canary-1b ↩
NVIDIA, "nvidia/canary-1b-flash," Hugging Face model card. https://huggingface.co/nvidia/canary-1b-flash ↩
NVIDIA, "nvidia/canary-180m-flash," Hugging Face model card. https://huggingface.co/nvidia/canary-180m-flash ↩
NVIDIA, "nvidia/canary-1b-v2," Hugging Face model card. https://huggingface.co/nvidia/canary-1b-v2 ↩
NVIDIA, "nvidia/canary-qwen-2.5b," Hugging Face model card. https://huggingface.co/nvidia/canary-qwen-2.5b ↩
NVIDIA Technical Blog, "New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model." https://developer.nvidia.com/blog/new-standard-for-speech-recognition-and-translation-from-the-nvidia-nemo-canary-model/ ↩
NVIDIA Blog, "Now We're Talking: NVIDIA Releases Open Dataset, Models for Multilingual Speech AI." https://blogs.nvidia.com/blog/speech-ai-dataset-models/ ↩
NVIDIA Technical Blog, "NVIDIA Speech and Translation AI Models Set Records for Speed and Accuracy." https://developer.nvidia.com/blog/nvidia-speech-and-translation-ai-models-set-records-for-speed-and-accuracy/ ↩
Hugging Face, "Open ASR Leaderboard." https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

NVIDIA Parakeet Nemotron

What Canary is

The model family and variants

Architecture

Canary-Qwen-2.5B and the SALM design

Open ASR Leaderboard results

Supported languages and tasks

Training data

Licensing

Significance within NVIDIA NeMo

References

Improve this article

Related Articles

NVIDIA Parakeet

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

What links here

Related Articles

NVIDIA Parakeet

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

What links here