Qwen3-Omni

Chinese AI Large Language Models Multimodal AI Open Source AI

11 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 2,155 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen3-Omni is a natively end-to-end omni-modal foundation model developed by the Qwen team at Alibaba Cloud, capable of understanding text, images, audio, and video and generating both text and natural speech in real time. ^[1]^[2] It was released in September 2025 as part of the Qwen3 family, with a technical report posted to arXiv on 22 September 2025 and open weights distributed under the Apache 2.0 license. ^[3]^[4] The headline claim is that a single model holds state-of-the-art performance across all four input modalities at once, without the degradation that usually appears when a system is stretched to cover speech and vision alongside text. ^[3]

The model is built on a Thinker-Talker design that uses a mixture of experts layout, paired with a new audio encoder the team calls AuT. ^[1]^[3] Across 36 audio and audio-visual benchmarks, Alibaba reports open-source SOTA on 32 and overall SOTA on 22, with results that the company says match or beat closed systems such as Gemini 2.5 Pro, Seed-ASR, and GPT-4o-Transcribe on the audio tasks. ^[3]^[4] Three public variants shipped at launch: an Instruct model that speaks, a Thinking model tuned for reasoning, and a Captioner model aimed at detailed audio description. ^[1]^[4]

Many systems that advertise broad modality coverage are really stacks of separate components glued together. A speech recognizer turns audio into text, a large language model reasons over that text, and a separate text-to-speech engine voices the reply. Each handoff loses information, adds latency, and means the language model never actually hears tone, pacing, or background sound. Qwen3-Omni instead trains one network to ingest text, images, audio, and video together and to produce text and speech as outputs from the same model. ^[1]^[3]

The practical payoff is that the model can reason over what it perceives directly. It can listen to a clip of music and describe the instrumentation, watch a video and answer questions about both what is shown and what is said, or hold a spoken conversation while keeping track of an image the user shared earlier. ^[1]^[2] Alibaba also reports that mixing modalities during training did not cost the model its text and image ability. The single-modality and cross-modality results stay competitive with the comparable text-only and vision-only Qwen3 models, which is the part the team treats as the real result rather than the breadth alone. ^[3]

The lineage matters here. Qwen3-Omni follows the earlier Qwen2.5-Omni model, and it inherits the broader Qwen3 recipe for text and reasoning. The omni model is the branch of the family that folds audio and video perception and speech generation into that base. ^[1]^[3]

Thinker-Talker architecture

The architecture splits the work between two cooperating modules. The Thinker is the reasoning core. It takes in the encoded text, image, audio, and video and produces the high-level semantic representation and the text response. The Talker is the voice. It receives those high-level representations straight from the Thinker and generates streaming speech tokens, so the model can start talking before the full text answer is finished. ^[1]^[5]

Both modules use a mixture-of-experts transformer rather than a dense one. In an MoE layer the network routes each token to a small subset of expert sub-networks, so the total parameter count can be large while the count of parameters actually used on any given token stays small. That keeps inference cheap and supports the high concurrency Alibaba wants for serving many users at once. ^[1]^[5] The naming reflects this: the model is Qwen3-Omni-30B-A3B, where 30B is the Thinker's total parameter count and A3B marks roughly 3 billion active parameters per token. The Talker is a smaller MoE in the few-billion-parameter range, dedicated to speech. ^[4]^[6]

Speech generation does not stop at speech tokens. The Talker autoregressively predicts a multi-codebook sequence, producing one codec frame per step. A multi-token prediction module then fills in the residual codebooks for each frame, and a lightweight renderer the team calls Code2Wav incrementally synthesizes the waveform. ^[4]^[5] Doing this frame by frame is what lets audio stream out continuously instead of waiting for a complete utterance.

The AuT audio encoder

The audio front end is a component named AuT, short for Audio Transformer. Alibaba pretrained it on 20 million hours of audio so that it produces strong general-purpose audio representations, covering speech, sound, and music rather than speech alone. ^[2]^[5] AuT uses block-wise window attention, which lets the system cache its prefill computation and keep latency low when audio arrives as a live stream. ^[5] That streaming-friendly design is part of why the model can react quickly in a spoken conversation.

Latency

Real-time speech only feels real-time if the first sound comes back fast. Alibaba reports a theoretical end-to-end first-packet latency of 234 milliseconds in cold-start settings, meaning the gap between the end of a user's input and the first chunk of generated audio. ^[3] In the company's own measurements of running systems, it cites latency as low as 211 milliseconds in audio-only scenarios and as low as 507 milliseconds when video is also in the mix. ^[2] These are best-case figures and depend on hardware and serving setup, so they describe the design ceiling rather than a guarantee for any deployment.

Model variants

Three models were published, all on the 30B-A3B base. They differ in which components they include and how they are tuned. ^[1]^[4]

Variant	Components	Purpose
Qwen3-Omni-30B-A3B-Instruct	Thinker plus Talker	General use. Takes audio, video, image, and text in; produces text and speech out
Qwen3-Omni-30B-A3B-Thinking	Thinker only	Chain-of-thought reasoning over multimodal input; produces text out, no speech
Qwen3-Omni-30B-A3B-Captioner	Fine-tuned from Instruct	Detailed, low-hallucination captioning of arbitrary audio

The Instruct model is the full system, with both the reasoning core and the voice, so it is the one most people will use for spoken interaction. The Thinking model drops the Talker and adds explicit step-by-step reasoning, which suits harder analytical questions where speech output is not needed. The Captioner is the most specialized. Alibaba describes it as a general audio captioning model with low hallucination, released in part to fill a gap the team saw in the open-source community, where detailed audio description models were scarce. ^[1]^[4]

Capabilities

On input, the model handles text, still images, audio, and video, including video with its own soundtrack. ^[1]^[2] It supports audio inputs up to 30 minutes long, which is enough for a lecture, a meeting recording, or a long piece of music. ^[2] Context length for the released checkpoints is 32,768 tokens. ^[4]

On output, it produces text in any of its supported languages and speaks in a smaller set. The language coverage is asymmetric by design: text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. ^[3]^[4] The 19 speech-input languages include English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, and Urdu. ^[4] The 10 speech-output languages are English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, and Korean. ^[4]

Put together, the capability set covers speech recognition and translation, audio and music analysis, image and document understanding, video question answering, and live spoken dialogue with natural turn-taking. ^[1]^[2] Because the Thinker is a full reasoning model, the same system can also do ordinary text tasks.

Benchmark results

The central performance claim is the audio and audio-visual sweep: across 36 such benchmarks, open-source SOTA on 32 and overall SOTA on 22. ^[3]^[4] The model cards report specific numbers behind that summary. On automatic speech recognition, the Instruct model posts low word error rates across English and Chinese test sets, where lower is better. The Thinking model's strength shows up on reasoning-heavy audio and vision tasks. The figures below are drawn from the official model cards. ^[6]^[7]

Benchmark	Type	Model	Score
Librispeech test-clean	ASR word error rate (lower better)	Instruct	1.22
Librispeech test-other	ASR word error rate (lower better)	Instruct	2.48
Wenetspeech test-net	ASR word error rate (lower better)	Instruct	4.69
Wenetspeech test-meeting	ASR word error rate (lower better)	Instruct	5.89
Common Voice 15 English	ASR word error rate (lower better)	Instruct	6.05
Common Voice 15 Chinese	ASR word error rate (lower better)	Instruct	4.31
VoiceBench overall	Spoken-dialogue understanding	Instruct	85.5
MMAU (audio reasoning)	Audio understanding and reasoning	Instruct	77.5
MMAU (audio reasoning)	Audio understanding and reasoning	Thinking	75.4
MMSU	Spoken-language understanding	Thinking	70.2
Video-MME	Video understanding	Thinking	69.7
MLVU	Long-video understanding	Thinking	72.9
LVBench	Long-video understanding	Thinking	49.0
WorldSense	Audio-visual understanding	Thinking	54.0
MMMU validation	Image and document reasoning	Thinking	75.6
MathVista mini	Visual mathematics	Thinking	80.0
MMStar	Multimodal reasoning	Thinking	74.9
HallusionBench	Visual hallucination	Thinking	62.8

For reference on the vision side, the Thinking model's published comparison places its MMMU validation score of 75.6 and MathVista mini score of 80.0 above the figures listed for GPT-4o on the same card, which reads 69.1 and 63.8, while sitting close to Gemini 2.5 Flash Thinking. ^[7] These comparisons come from Alibaba's own reporting and use the test versions and prompting the team chose, so they are best read as the developer's results rather than an independent audit.

Licensing

All three variants are released under the Apache 2.0 license, the same permissive open-source license used across much of the Qwen3 family. ^[3]^[4] Apache 2.0 allows commercial use, modification, and redistribution with few conditions, which puts Qwen3-Omni among the more openly licensed omni-modal models available. The weights are distributed through Hugging Face and ModelScope, and the team published inference code, a cookbook, and deployment recipes in the GitHub repository. ^[1]^[4]

Relation to Qwen and other omni models

Qwen3-Omni sits inside the wider Qwen ecosystem built by Alibaba Cloud. It shares the Qwen3 reasoning foundation and follows the earlier Qwen2.5-Omni, extending that line with the larger AuT encoder, the MoE Thinker-Talker design, and broader language coverage. ^[1]^[3] Within the family it is the model to reach for when a task needs hearing or speech rather than text and images alone.

The obvious external comparison is to the GPT-4o style of model, the class of systems that accept speech, vision, and text and reply by voice with low latency. ^[2] Qwen3-Omni targets the same interaction pattern but takes the open-weight route, so developers can run it on their own hardware and fine-tune it, which is not possible with the closed commercial systems it is measured against. On the audio benchmarks Alibaba positions it ahead of several of those closed systems, though the cross-model comparisons come from the developer and cover the tasks the team selected. ^[3]^[4]

Limitations

The latency numbers are theoretical or best-case figures from Alibaba's own setup, so real deployments on different hardware should expect higher response times. ^[2]^[3] Speech output is also far narrower than text: the model writes in 119 languages but speaks in only 10, so most of its language coverage is text-only. ^[4] Voice input is itself limited to 19 languages. ^[4]

The benchmark claims, while detailed, are self-reported and use the test variants and prompting the team chose, which is the norm for new model releases but still calls for independent replication before the SOTA labels can be treated as settled. ^[3]^[4] As a 30-billion-parameter MoE model with a separate speech stack, it also asks for substantial GPU memory to run at full capability, even though its active parameter count keeps the per-token compute closer to a 3-billion-parameter model. ^[4]^[6] And like other generative audio systems, the speech and captioning outputs can still contain errors or hallucinations, which is the exact failure the Captioner variant was tuned to reduce rather than eliminate. ^[4]

References

"Qwen3-Omni." QwenLM GitHub repository. https://github.com/QwenLM/Qwen3-Omni ↩
"Qwen3-Omni: Natively Omni-Modal Foundation Models." Alibaba Cloud Community. https://www.alibabacloud.com/blog/qwen3-omni-natively-omni-modal-foundation-models_602581 ↩
Xu, Jin, et al. "Qwen3-Omni Technical Report." arXiv:2509.17765, 22 September 2025. https://arxiv.org/abs/2509.17765 ↩
"Qwen3-Omni-30B-A3B-Instruct." Hugging Face model card. https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ↩
"Qwen3-Omni Technical Report (HTML)." arXiv. https://arxiv.org/html/2509.17765v1 ↩
"Qwen3-Omni-30B-A3B-Instruct - Model Info, Parameters, Benchmarks." SiliconFlow. https://www.siliconflow.com/models/qwen3-omni-30b-a3b-instruct ↩
"Qwen3-Omni-30B-A3B-Thinking." Hugging Face model card. https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Thinking ↩
"Qwen3-Omni-30B-A3B-Captioner." Hugging Face model card. https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Captioner
"Alibaba Cloud Releases the Open Qwen3-Omni, Its First Natively End-to-End Omni-Modal AI." Hackster.io. https://www.hackster.io/news/alibaba-cloud-releases-the-open-qwen3-omni-its-first-natively-end-to-end-omni-modal-ai-320a414212cc
"Qwen3-Omni Now on SiliconFlow: Alibaba's Next-Gen Multimodal Foundation Model." SiliconFlow. https://www.siliconflow.com/blog/qwen3-omni-now-on-siliconflow-alibaba-s-next-gen-multimodal-foundation-model
"Qwen3-Omni: Alibaba's Open-source Omni Model." The Unwind AI. https://www.theunwindai.com/p/alibaba-s-open-source-qwen3-omni-model

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Qwen2-Audio

What natively omni-modal means