Qwen2-Audio
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,635 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,635 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen2-Audio is an audio-language model developed by the Qwen team at Alibaba Cloud, released in August 2024 [1][2]. It accepts audio inputs (human speech, natural sounds, music, and singing) alongside text, and produces text responses. The model supports two interaction modes: a voice chat mode in which a user speaks to the model with no text input, and an audio analysis mode in which the user supplies audio together with text instructions about what to do with it [1]. Qwen2-Audio is the successor to the original Qwen-Audio (November 2023) and sits in the lineage that later led to the fully multimodal Qwen2.5-Omni and Qwen3-Omni systems [3][4]. The open-weight checkpoints are released under the Apache 2.0 license [5].
The model combines an audio encoder initialised from OpenAI's Whisper-large-v3 with a Qwen-7B language-model backbone, for a total of about 8.2 billion parameters [2]. It was trained with a three-stage recipe of multi-task pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO) [1][2]. On the audio-instruction benchmark AIR-Bench, the Qwen team reported that Qwen2-Audio outperformed prior state-of-the-art systems including Gemini-1.5-pro on audio-centric chat tasks [1].
The audio-model line began with Qwen-Audio, described in the paper "Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models" (arXiv:2311.07919), first submitted on 14 November 2023 [6]. Like its successor, Qwen-Audio pairs an audio encoder with the Qwen-7B large language model; the encoder was initialised from Whisper-large-v2 [3]. The open weights were published at the end of November 2023 [3].
The central design goal of Qwen-Audio was a single model able to handle many audio types and tasks at once. The authors scaled audio-language pretraining to cover more than 30 tasks across human speech, natural sounds, music, and songs [6]. Because the textual labels for different datasets vary widely in language, granularity, and structure, naively co-training all of them causes one-to-many interference, where the same audio could map to very different target text depending on the task. Qwen-Audio addressed this with a multi-task framework that conditions the decoder on a sequence of hierarchical tags: shared tags encourage knowledge transfer between related tasks, while task-specific tags isolate them and prevent interference [6]. The result was a model that reached strong results across a range of benchmarks without any task-specific fine-tuning. A second model, Qwen-Audio-Chat, was instruction-tuned on top of the base model to support multi-turn dialogue with mixed audio and text inputs [6].
Qwen-Audio reported competitive or state-of-the-art results on several public benchmarks at release, summarised below [3].
| Task / dataset | Metric | Qwen-Audio |
|---|---|---|
| Aishell1 (dev / test) | WER (%) | 1.2 / 1.3 |
| LibriSpeech test-clean | WER (%) | 2.0 |
| LibriSpeech test-other | WER (%) | 4.2 |
| CochlScene (acoustic scene) | Accuracy | 0.795 |
| ClothoAQA (audio question answering) | Accuracy | 0.579 |
| VocalSound (vocal sound classification) | Accuracy | 0.9289 |
Lower word error rate (WER) is better, while higher accuracy is better. The Aishell1, CochlScene, and VocalSound figures were reported as state of the art at the time [3].
Qwen2-Audio keeps the two-part structure of its predecessor: an audio encoder feeds audio representations into an autoregressive language model that generates text. The audio encoder is initialised from Whisper-large-v3, and the language backbone is Qwen-7B, giving a combined parameter count of roughly 8.2 billion [2].
Audio is preprocessed by resampling to 16 kHz and converting the waveform into a 128-channel mel-spectrogram using a 25 ms window and a 10 ms hop [2]. A pooling layer with a stride of two then shortens the sequence so that each frame of the encoder output corresponds to about a 40 ms span of the original audio [2]. The encoded audio is interleaved with text tokens and passed to the Qwen-7B decoder, which produces the textual answer.
The most visible change from Qwen-Audio is the removal of the hierarchical-tag scheme. Instead of conditioning on structured task tags, Qwen2-Audio uses plain natural-language prompts to distinguish data sources and tasks during pretraining, which the authors describe as a simplification that also let them expand the training data substantially [1].
Qwen2-Audio exposes two modes of use, and notably it does not rely on a system prompt to switch between them; the model infers the intended mode from the input itself [1].
In voice chat mode the user speaks to the model directly, with no separate automatic speech recognition front end and no text input. The model interprets the spoken request and replies in text [1]. In audio analysis mode the user provides an audio clip together with a text instruction, for example asking the model to transcribe, translate, classify, or describe the audio [1].
A capability the team highlighted is that the model can disentangle these uses within a single clip. Given an audio segment that contains background sounds, a multi-speaker conversation, and an embedded voice command at the same time, Qwen2-Audio can recognise the command and respond to it while also interpreting the rest of the audio [1].
Qwen2-Audio is trained in three stages [1][2].
The first stage is multi-task pretraining that aligns the audio and language representations. This replaces the hierarchical-tag conditioning of Qwen-Audio with natural-language prompts and uses a larger and broader audio dataset spanning speech, sound, and music [1].
The second stage is supervised fine-tuning (SFT), which uses instruction-style data to improve the model's ability to follow user instructions across both interaction modes [1][2].
The third stage applies Direct Preference Optimization. DPO optimises the model against pairs of preferred and dispreferred responses, which the authors credit for improvements in factuality and in adherence to desired behaviour [1]. The model released for general use after these three stages is Qwen2-Audio-7B-Instruct.
Two checkpoints were published [5].
| Model | Description |
|---|---|
| Qwen2-Audio-7B | Base pretrained audio-language model |
| Qwen2-Audio-7B-Instruct | Instruction-tuned model with voice chat and audio analysis modes, the recommended checkpoint for interaction |
Both checkpoints are about 8.2 billion parameters and are distributed in BF16 precision [2][5].
The headline evaluation in the technical report is AIR-Bench, a benchmark for audio-instruction-following whose chat split is scored by GPT-4 on a 0 to 10 scale across four audio categories [1]. Qwen2-Audio improved on both its predecessor and on contemporary proprietary models. The reported chat scores are shown below [4].
| Model | Speech | Sound | Music | Mixed-Audio |
|---|---|---|---|---|
| Qwen2-Audio | 7.18 | 6.99 | 6.79 | 6.77 |
| Qwen-Audio | 6.47 | 6.95 | 5.52 | 6.08 |
| Gemini-1.5-pro | 6.97 | 5.49 | 5.06 | 5.27 |
The technical report notes that the Gemini-1.5-pro evaluation covered a smaller sample than the others because some inputs were rejected by that model's safety filters [4].
On core speech tasks, Qwen2-Audio reported the following figures [4].
| Task | Dataset | Result |
|---|---|---|
| Speech recognition | LibriSpeech test-clean / test-other | 1.6 / 3.6 WER (%) |
| Speech recognition | Aishell2 (Mic / iOS / Android) | 3.0 / 3.0 / 2.9 WER (%) |
| Speech recognition | Common Voice 15 (en / zh / yue / fr) | 8.6 / 6.9 / 5.9 / 9.6 WER (%) |
| Speech recognition | Fleurs (zh) | 7.5 WER (%) |
| Speech translation | CoVoST2 (en-de / de-en / en-zh / zh-en) | 29.9 / 35.2 / 45.2 / 24.4 BLEU |
| Speech emotion recognition | MELD | 0.553 (accuracy) |
| Vocal sound classification | VocalSound | 0.9392 (accuracy) |
For speech recognition, lower WER is better; for translation, higher BLEU is better; for classification, higher accuracy is better. The CoVoST2 results also include es-en (40.0), fr-en (38.5), and it-en (36.3) BLEU [4]. The instruction-tuned model's AIR-Bench chat scores by GPT-4 evaluation were reported in the GitHub repository as 7.24 (Speech), 6.83 (Sound), 6.73 (Music), and 6.42 (Mixed-Audio) [5].
The benchmark coverage spans several languages, including English, Mandarin Chinese, Cantonese, French, German, Spanish, and Italian, reflected in the multilingual ASR and speech-translation results above [5]. The underlying Whisper-large-v3 encoder is itself multilingual, and the speech-translation evaluation exercises translation between English and German, Chinese, Spanish, French, and Italian [4].
The Qwen2-Audio weights are released under the Apache 2.0 license, as stated on the Hugging Face model cards for both the base and instruct checkpoints [5]. The repository notes that, unlike some earlier Qwen releases, no separate request is required for commercial use [5]. Code and inference examples are available in the QwenLM/Qwen2-Audio GitHub repository [5].
Qwen2-Audio was followed by a shift toward end-to-end omni-modal models rather than audio-only ones. Qwen2.5-Omni, released on 26 March 2025 (arXiv:2503.20215), perceives text, images, audio, and video and generates both text and natural speech in a streaming manner, using a "Thinker-Talker" architecture and a time-aligned multimodal position embedding called TMRoPE [7]. The Qwen2.5-Omni report states that it outperforms Qwen2-Audio on audio understanding [7]. The line continued with Qwen3-Omni, the Qwen team's later natively end-to-end omni-modal model that understands text, audio, images, and video and generates real-time speech [8]. The Qwen series as a whole is also marketed in China under the name Tongyi Qianwen.