Qwen2-Audio

Chinese AI Multimodal AI Speech & Audio AI

8 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,633 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Qwen2-Audio is an audio-language model developed by the Qwen team at Alibaba Cloud, released in August 2024 ^[1]^[2]. It accepts audio inputs (human speech, natural sounds, music, and singing) alongside text, and produces text responses. The model supports two interaction modes: a voice chat mode in which a user speaks to the model with no text input, and an audio analysis mode in which the user supplies audio together with text instructions about what to do with it ^[1]. Qwen2-Audio is the successor to the original Qwen-Audio (November 2023) and sits in the lineage that later led to the fully multimodal Qwen2.5-Omni and Qwen3-Omni systems ^[3]^[4]. The open-weight checkpoints are released under the Apache 2.0 license ^[5].

The model combines an audio encoder initialised from OpenAI's Whisper-large-v3 with a Qwen-7B language-model backbone, for a total of about 8.2 billion parameters ^[2]. It was trained with a three-stage recipe of multi-task pretraining, supervised fine-tuning, and Direct Preference Optimization (DPO) ^[1]^[2]. On the audio-instruction benchmark AIR-Bench, the Qwen team reported that Qwen2-Audio outperformed prior state-of-the-art systems including Gemini-1.5-pro on audio-centric chat tasks ^[1].

Background: Qwen-Audio

The audio-model line began with Qwen-Audio, described in the paper "Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models" (arXiv:2311.07919), first submitted on 14 November 2023 ^[6]. Like its successor, Qwen-Audio pairs an audio encoder with the Qwen-7B large language model; the encoder was initialised from Whisper-large-v2 ^[3]. The open weights were published at the end of November 2023 ^[3].

The central design goal of Qwen-Audio was a single model able to handle many audio types and tasks at once. The authors scaled audio-language pretraining to cover more than 30 tasks across human speech, natural sounds, music, and songs ^[6]. Because the textual labels for different datasets vary widely in language, granularity, and structure, naively co-training all of them causes one-to-many interference, where the same audio could map to very different target text depending on the task. Qwen-Audio addressed this with a multi-task framework that conditions the decoder on a sequence of hierarchical tags: shared tags encourage knowledge transfer between related tasks, while task-specific tags isolate them and prevent interference ^[6]. The result was a model that reached strong results across a range of benchmarks without any task-specific fine-tuning. A second model, Qwen-Audio-Chat, was instruction-tuned on top of the base model to support multi-turn dialogue with mixed audio and text inputs ^[6].

Qwen-Audio reported competitive or state-of-the-art results on several public benchmarks at release, summarised below ^[3].

Task / dataset	Metric	Qwen-Audio
Aishell1 (dev / test)	WER (%)	1.2 / 1.3
LibriSpeech test-clean	WER (%)	2.0
LibriSpeech test-other	WER (%)	4.2
CochlScene (acoustic scene)	Accuracy	0.795
ClothoAQA (audio question answering)	Accuracy	0.579
VocalSound (vocal sound classification)	Accuracy	0.9289

Lower word error rate (WER) is better, while higher accuracy is better. The Aishell1, CochlScene, and VocalSound figures were reported as state of the art at the time ^[3].

Architecture

Qwen2-Audio keeps the two-part structure of its predecessor: an audio encoder feeds audio representations into an autoregressive language model that generates text. The audio encoder is initialised from Whisper-large-v3, and the language backbone is Qwen-7B, giving a combined parameter count of roughly 8.2 billion ^[2].

Audio is preprocessed by resampling to 16 kHz and converting the waveform into a 128-channel mel-spectrogram using a 25 ms window and a 10 ms hop ^[2]. A pooling layer with a stride of two then shortens the sequence so that each frame of the encoder output corresponds to about a 40 ms span of the original audio ^[2]. The encoded audio is interleaved with text tokens and passed to the Qwen-7B decoder, which produces the textual answer.

The most visible change from Qwen-Audio is the removal of the hierarchical-tag scheme. Instead of conditioning on structured task tags, Qwen2-Audio uses plain natural-language prompts to distinguish data sources and tasks during pretraining, which the authors describe as a simplification that also let them expand the training data substantially ^[1].

Interaction modes

Qwen2-Audio exposes two modes of use, and notably it does not rely on a system prompt to switch between them; the model infers the intended mode from the input itself ^[1].

In voice chat mode the user speaks to the model directly, with no separate automatic speech recognition front end and no text input. The model interprets the spoken request and replies in text ^[1]. In audio analysis mode the user provides an audio clip together with a text instruction, for example asking the model to transcribe, translate, classify, or describe the audio ^[1].

A capability the team highlighted is that the model can disentangle these uses within a single clip. Given an audio segment that contains background sounds, a multi-speaker conversation, and an embedded voice command at the same time, Qwen2-Audio can recognise the command and respond to it while also interpreting the rest of the audio ^[1].

Training

Qwen2-Audio is trained in three stages ^[1]^[2].

The first stage is multi-task pretraining that aligns the audio and language representations. This replaces the hierarchical-tag conditioning of Qwen-Audio with natural-language prompts and uses a larger and broader audio dataset spanning speech, sound, and music ^[1].

The second stage is supervised fine-tuning (SFT), which uses instruction-style data to improve the model's ability to follow user instructions across both interaction modes ^[1]^[2].

The third stage applies Direct Preference Optimization. DPO optimises the model against pairs of preferred and dispreferred responses, which the authors credit for improvements in factuality and in adherence to desired behaviour ^[1]. The model released for general use after these three stages is Qwen2-Audio-7B-Instruct.

Variants

Two checkpoints were published ^[5].

Model	Description
Qwen2-Audio-7B	Base pretrained audio-language model
Qwen2-Audio-7B-Instruct	Instruction-tuned model with voice chat and audio analysis modes, the recommended checkpoint for interaction

Both checkpoints are about 8.2 billion parameters and are distributed in BF16 precision ^[2]^[5].

Benchmarks

The headline evaluation in the technical report is AIR-Bench, a benchmark for audio-instruction-following whose chat split is scored by GPT-4 on a 0 to 10 scale across four audio categories ^[1]. Qwen2-Audio improved on both its predecessor and on contemporary proprietary models. The reported chat scores are shown below ^[4].

Model	Speech	Sound	Music	Mixed-Audio
Qwen2-Audio	7.18	6.99	6.79	6.77
Qwen-Audio	6.47	6.95	5.52	6.08
Gemini-1.5-pro	6.97	5.49	5.06	5.27

The technical report notes that the Gemini-1.5-pro evaluation covered a smaller sample than the others because some inputs were rejected by that model's safety filters ^[4].

On core speech tasks, Qwen2-Audio reported the following figures ^[4].

Task	Dataset	Result
Speech recognition	LibriSpeech test-clean / test-other	1.6 / 3.6 WER (%)
Speech recognition	Aishell2 (Mic / iOS / Android)	3.0 / 3.0 / 2.9 WER (%)
Speech recognition	Common Voice 15 (en / zh / yue / fr)	8.6 / 6.9 / 5.9 / 9.6 WER (%)
Speech recognition	Fleurs (zh)	7.5 WER (%)
Speech translation	CoVoST2 (en-de / de-en / en-zh / zh-en)	29.9 / 35.2 / 45.2 / 24.4 BLEU
Speech emotion recognition	MELD	0.553 (accuracy)
Vocal sound classification	VocalSound	0.9392 (accuracy)

For speech recognition, lower WER is better; for translation, higher BLEU is better; for classification, higher accuracy is better. The CoVoST2 results also include es-en (40.0), fr-en (38.5), and it-en (36.3) BLEU ^[4]. The instruction-tuned model's AIR-Bench chat scores by GPT-4 evaluation were reported in the GitHub repository as 7.24 (Speech), 6.83 (Sound), 6.73 (Music), and 6.42 (Mixed-Audio) ^[5].

Languages

The benchmark coverage spans several languages, including English, Mandarin Chinese, Cantonese, French, German, Spanish, and Italian, reflected in the multilingual ASR and speech-translation results above ^[5]. The underlying Whisper-large-v3 encoder is itself multilingual, and the speech-translation evaluation exercises translation between English and German, Chinese, Spanish, French, and Italian ^[4].

Licensing

The Qwen2-Audio weights are released under the Apache 2.0 license, as stated on the Hugging Face model cards for both the base and instruct checkpoints ^[5]. The repository notes that, unlike some earlier Qwen releases, no separate request is required for commercial use ^[5]. Code and inference examples are available in the QwenLM/Qwen2-Audio GitHub repository ^[5].

Successors

Qwen2-Audio was followed by a shift toward end-to-end omni-modal models rather than audio-only ones. Qwen2.5-Omni, released on 26 March 2025 (arXiv:2503.20215), perceives text, images, audio, and video and generates both text and natural speech in a streaming manner, using a "Thinker-Talker" architecture and a time-aligned multimodal position embedding called TMRoPE ^[7]. The Qwen2.5-Omni report states that it outperforms Qwen2-Audio on audio understanding ^[7]. The line continued with Qwen3-Omni, the Qwen team's later natively end-to-end omni-modal model that understands text, audio, images, and video and generates real-time speech ^[8]. The Qwen series as a whole is also marketed in China under the name Tongyi Qianwen.

References

Qwen2-Audio: arXiv abstract, "Qwen2-Audio Technical Report" (arXiv:2407.10759). https://arxiv.org/abs/2407.10759 ↩
"Qwen2-Audio: Large-Scale Audio-Language Model" (architecture summary, Whisper-large-v3 encoder, Qwen-7B backbone, 8.2B parameters, audio preprocessing). https://www.emergentmind.com/papers/2407.10759 ↩
QwenLM/Qwen-Audio GitHub repository (Whisper-large-v2 encoder, Qwen-7B, benchmark results, release November 2023). https://github.com/QwenLM/Qwen-Audio ↩
"Qwen2-Audio Technical Report" full text, HTML version (AIR-Bench scores, ASR/translation/SER/VSC results). https://arxiv.org/html/2407.10759v1 ↩
QwenLM/Qwen2-Audio GitHub repository and Hugging Face model card Qwen/Qwen2-Audio-7B-Instruct (variants, Apache 2.0 license, release 9 August 2024, AIR-Bench chat scores). https://github.com/QwenLM/Qwen2-Audio ↩
"Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models" (arXiv:2311.07919). https://arxiv.org/abs/2311.07919 ↩
"Qwen2.5-Omni Technical Report" (arXiv:2503.20215). https://arxiv.org/abs/2503.20215 ↩
QwenLM/Qwen3-Omni GitHub repository. https://github.com/QwenLM/Qwen3-Omni ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

InternVL