Qwen3-Omni
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,155 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,155 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3-Omni is a natively end-to-end omni-modal foundation model developed by the Qwen team at Alibaba Cloud, capable of understanding text, images, audio, and video and generating both text and natural speech in real time. [1][2] It was released in September 2025 as part of the Qwen3 family, with a technical report posted to arXiv on 22 September 2025 and open weights distributed under the Apache 2.0 license. [3][4] The headline claim is that a single model holds state-of-the-art performance across all four input modalities at once, without the degradation that usually appears when a system is stretched to cover speech and vision alongside text. [3]
The model is built on a Thinker-Talker design that uses a mixture of experts layout, paired with a new audio encoder the team calls AuT. [1][3] Across 36 audio and audio-visual benchmarks, Alibaba reports open-source SOTA on 32 and overall SOTA on 22, with results that the company says match or beat closed systems such as Gemini 2.5 Pro, Seed-ASR, and GPT-4o-Transcribe on the audio tasks. [3][4] Three public variants shipped at launch: an Instruct model that speaks, a Thinking model tuned for reasoning, and a Captioner model aimed at detailed audio description. [1][4]
Many systems that advertise broad modality coverage are really stacks of separate components glued together. A speech recognizer turns audio into text, a large language model reasons over that text, and a separate text-to-speech engine voices the reply. Each handoff loses information, adds latency, and means the language model never actually hears tone, pacing, or background sound. Qwen3-Omni instead trains one network to ingest text, images, audio, and video together and to produce text and speech as outputs from the same model. [1][3]
The practical payoff is that the model can reason over what it perceives directly. It can listen to a clip of music and describe the instrumentation, watch a video and answer questions about both what is shown and what is said, or hold a spoken conversation while keeping track of an image the user shared earlier. [1][2] Alibaba also reports that mixing modalities during training did not cost the model its text and image ability. The single-modality and cross-modality results stay competitive with the comparable text-only and vision-only Qwen3 models, which is the part the team treats as the real result rather than the breadth alone. [3]
The lineage matters here. Qwen3-Omni follows the earlier Qwen2.5-Omni model, and it inherits the broader Qwen3 recipe for text and reasoning. The omni model is the branch of the family that folds audio and video perception and speech generation into that base. [1][3]
The architecture splits the work between two cooperating modules. The Thinker is the reasoning core. It takes in the encoded text, image, audio, and video and produces the high-level semantic representation and the text response. The Talker is the voice. It receives those high-level representations straight from the Thinker and generates streaming speech tokens, so the model can start talking before the full text answer is finished. [1][5]
Both modules use a mixture-of-experts transformer rather than a dense one. In an MoE layer the network routes each token to a small subset of expert sub-networks, so the total parameter count can be large while the count of parameters actually used on any given token stays small. That keeps inference cheap and supports the high concurrency Alibaba wants for serving many users at once. [1][5] The naming reflects this: the model is Qwen3-Omni-30B-A3B, where 30B is the Thinker's total parameter count and A3B marks roughly 3 billion active parameters per token. The Talker is a smaller MoE in the few-billion-parameter range, dedicated to speech. [4][6]
Speech generation does not stop at speech tokens. The Talker autoregressively predicts a multi-codebook sequence, producing one codec frame per step. A multi-token prediction module then fills in the residual codebooks for each frame, and a lightweight renderer the team calls Code2Wav incrementally synthesizes the waveform. [4][5] Doing this frame by frame is what lets audio stream out continuously instead of waiting for a complete utterance.
The audio front end is a component named AuT, short for Audio Transformer. Alibaba pretrained it on 20 million hours of audio so that it produces strong general-purpose audio representations, covering speech, sound, and music rather than speech alone. [2][5] AuT uses block-wise window attention, which lets the system cache its prefill computation and keep latency low when audio arrives as a live stream. [5] That streaming-friendly design is part of why the model can react quickly in a spoken conversation.
Real-time speech only feels real-time if the first sound comes back fast. Alibaba reports a theoretical end-to-end first-packet latency of 234 milliseconds in cold-start settings, meaning the gap between the end of a user's input and the first chunk of generated audio. [3] In the company's own measurements of running systems, it cites latency as low as 211 milliseconds in audio-only scenarios and as low as 507 milliseconds when video is also in the mix. [2] These are best-case figures and depend on hardware and serving setup, so they describe the design ceiling rather than a guarantee for any deployment.
Three models were published, all on the 30B-A3B base. They differ in which components they include and how they are tuned. [1][4]
| Variant | Components | Purpose |
|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct | Thinker plus Talker | General use. Takes audio, video, image, and text in; produces text and speech out |
| Qwen3-Omni-30B-A3B-Thinking | Thinker only | Chain-of-thought reasoning over multimodal input; produces text out, no speech |
| Qwen3-Omni-30B-A3B-Captioner | Fine-tuned from Instruct | Detailed, low-hallucination captioning of arbitrary audio |
The Instruct model is the full system, with both the reasoning core and the voice, so it is the one most people will use for spoken interaction. The Thinking model drops the Talker and adds explicit step-by-step reasoning, which suits harder analytical questions where speech output is not needed. The Captioner is the most specialized. Alibaba describes it as a general audio captioning model with low hallucination, released in part to fill a gap the team saw in the open-source community, where detailed audio description models were scarce. [1][4]
On input, the model handles text, still images, audio, and video, including video with its own soundtrack. [1][2] It supports audio inputs up to 30 minutes long, which is enough for a lecture, a meeting recording, or a long piece of music. [2] Context length for the released checkpoints is 32,768 tokens. [4]
On output, it produces text in any of its supported languages and speaks in a smaller set. The language coverage is asymmetric by design: text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. [3][4] The 19 speech-input languages include English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, and Urdu. [4] The 10 speech-output languages are English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, and Korean. [4]
Put together, the capability set covers speech recognition and translation, audio and music analysis, image and document understanding, video question answering, and live spoken dialogue with natural turn-taking. [1][2] Because the Thinker is a full reasoning model, the same system can also do ordinary text tasks.
The central performance claim is the audio and audio-visual sweep: across 36 such benchmarks, open-source SOTA on 32 and overall SOTA on 22. [3][4] The model cards report specific numbers behind that summary. On automatic speech recognition, the Instruct model posts low word error rates across English and Chinese test sets, where lower is better. The Thinking model's strength shows up on reasoning-heavy audio and vision tasks. The figures below are drawn from the official model cards. [6][7]
| Benchmark | Type | Model | Score |
|---|---|---|---|
| Librispeech test-clean | ASR word error rate (lower better) | Instruct | 1.22 |
| Librispeech test-other | ASR word error rate (lower better) | Instruct | 2.48 |
| Wenetspeech test-net | ASR word error rate (lower better) | Instruct | 4.69 |
| Wenetspeech test-meeting | ASR word error rate (lower better) | Instruct | 5.89 |
| Common Voice 15 English | ASR word error rate (lower better) | Instruct | 6.05 |
| Common Voice 15 Chinese | ASR word error rate (lower better) | Instruct | 4.31 |
| VoiceBench overall | Spoken-dialogue understanding | Instruct | 85.5 |
| MMAU (audio reasoning) | Audio understanding and reasoning | Instruct | 77.5 |
| MMAU (audio reasoning) | Audio understanding and reasoning | Thinking | 75.4 |
| MMSU | Spoken-language understanding | Thinking | 70.2 |
| Video-MME | Video understanding | Thinking | 69.7 |
| MLVU | Long-video understanding | Thinking | 72.9 |
| LVBench | Long-video understanding | Thinking | 49.0 |
| WorldSense | Audio-visual understanding | Thinking | 54.0 |
| MMMU validation | Image and document reasoning | Thinking | 75.6 |
| MathVista mini | Visual mathematics | Thinking | 80.0 |
| MMStar | Multimodal reasoning | Thinking | 74.9 |
| HallusionBench | Visual hallucination | Thinking | 62.8 |
For reference on the vision side, the Thinking model's published comparison places its MMMU validation score of 75.6 and MathVista mini score of 80.0 above the figures listed for GPT-4o on the same card, which reads 69.1 and 63.8, while sitting close to Gemini 2.5 Flash Thinking. [7] These comparisons come from Alibaba's own reporting and use the test versions and prompting the team chose, so they are best read as the developer's results rather than an independent audit.
All three variants are released under the Apache 2.0 license, the same permissive open-source license used across much of the Qwen3 family. [3][4] Apache 2.0 allows commercial use, modification, and redistribution with few conditions, which puts Qwen3-Omni among the more openly licensed omni-modal models available. The weights are distributed through Hugging Face and ModelScope, and the team published inference code, a cookbook, and deployment recipes in the GitHub repository. [1][4]
Qwen3-Omni sits inside the wider Qwen ecosystem built by Alibaba Cloud. It shares the Qwen3 reasoning foundation and follows the earlier Qwen2.5-Omni, extending that line with the larger AuT encoder, the MoE Thinker-Talker design, and broader language coverage. [1][3] Within the family it is the model to reach for when a task needs hearing or speech rather than text and images alone.
The obvious external comparison is to the GPT-4o style of model, the class of systems that accept speech, vision, and text and reply by voice with low latency. [2] Qwen3-Omni targets the same interaction pattern but takes the open-weight route, so developers can run it on their own hardware and fine-tune it, which is not possible with the closed commercial systems it is measured against. On the audio benchmarks Alibaba positions it ahead of several of those closed systems, though the cross-model comparisons come from the developer and cover the tasks the team selected. [3][4]
The latency numbers are theoretical or best-case figures from Alibaba's own setup, so real deployments on different hardware should expect higher response times. [2][3] Speech output is also far narrower than text: the model writes in 119 languages but speaks in only 10, so most of its language coverage is text-only. [4] Voice input is itself limited to 19 languages. [4]
The benchmark claims, while detailed, are self-reported and use the test variants and prompting the team chose, which is the norm for new model releases but still calls for independent replication before the SOTA labels can be treated as settled. [3][4] As a 30-billion-parameter MoE model with a separate speech stack, it also asks for substantial GPU memory to run at full capability, even though its active parameter count keeps the per-token compute closer to a 3-billion-parameter model. [4][6] And like other generative audio systems, the speech and captioning outputs can still contain errors or hallucinations, which is the exact failure the Captioner variant was tuned to reduce rather than eliminate. [4]