Audio Models

AI Models Speech & Audio AI

30 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

45 citations

Revision

v5 · 5,959 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Audio and Models

Audio models are machine learning systems that take audio as input, produce audio as output, or both, spanning speech recognition, speech synthesis, music generation, sound effect generation, voice conversion, speaker analysis, and neural audio compression. The category covers speech (recognition, synthesis, voice activity detection, voice conversion), music (generation, source separation, transcription), and general audio (classification of sound events and scenes, captioning, enhancement, codecs). It is one of the three main modalities handled by modern foundation models, alongside text and vision, and the audio branch of generative AI. By 2024 to 2025 the leading systems became fast enough for natural conversation: OpenAI reported that GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, "similar to human response time in a conversation."^[30]

The field began to consolidate around deep learning in the mid 2010s with WaveNet, the Tacotron family, and connectionist temporal classification based recognizers. Self supervised pretraining (Wav2Vec, HuBERT, WavLM) and weakly supervised large scale training (Whisper, trained on 680,000 hours of multilingual web audio) reshaped speech systems between 2019 and 2022.^[10] Around the same period, neural audio codecs (SoundStream, EnCodec, DAC, Mimi) made it practical to treat audio as a sequence of discrete tokens, which unlocked autoregressive audio language models such as AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and the full duplex Moshi. By 2024 and 2025, audio was a first class input and output in multimodal models such as GPT-4o, Gemini 1.5 and 2.0, Qwen2-Audio, Qwen2.5-Omni and Qwen3-Omni, and Phi-4-multimodal, while open audio-language models such as NVIDIA's Audio Flamingo series pushed audio understanding and reasoning.

What is an audio model?

An audio model is any model whose primary input or output is audio. In practice this includes three things:

Models that consume audio and produce something else, usually text (speech recognition, audio captioning, audio question answering, classification).
Models that consume something else (usually text) and produce audio (text to speech, text to music, text to sound effect).
Models that consume and produce audio (denoising, source separation, voice conversion, codecs, full duplex speech to speech systems).

Audio differs from text in three ways that drive architectural choices. It is a continuous one dimensional signal sampled tens of thousands of times per second, so raw waveforms are very long sequences. It carries information at multiple time scales at once, from sub millisecond phonetic detail to multi second prosody and minute long musical structure. And the same linguistic content can be realized by many physically different waveforms, because speakers, microphones, rooms, and background noise vary widely. Most audio models therefore operate on a more compact representation: a mel spectrogram, a learned self supervised feature, or a sequence of discrete tokens emitted by a neural audio codec.

This article is the canonical hub for audio models as a whole; the concept page audio redirects here. Detailed subcategories live in their own articles, which this hub cross-links rather than duplicates: automatic speech recognition models, text to speech models, voice activity detection models, audio classification models, audio to audio models, and music generation.

How are audio models categorized by task?

Task	Description	Representative models
Automatic speech recognition (ASR)	Convert speech to text.	Whisper, Wav2Vec 2.0, Conformer^[11], Universal Speech Model^[12], Canary, Parakeet
Text to speech (TTS)	Convert text to speech audio.	Tacotron 2, FastSpeech 2, VALL-E, NaturalSpeech 3, Voicebox, Bark, XTTS
Voice activity detection (VAD)	Decide which frames of audio contain speech.	WebRTC VAD, Silero VAD, pyannote VAD
Speaker tasks	Identify, verify, or diarize speakers.	ECAPA-TDNN, x-vector, pyannote, NeMo TitaNet
Speech enhancement and separation	Denoise, dereverberate, or split mixed speech.	RNNoise, DeepFilterNet, Conv-TasNet, SepFormer, Demucs voice
Voice conversion and cloning	Re-render speech in another voice while preserving content.	YourTTS, FreeVC, OpenVoice, RVC
Audio classification	Tag sound events, scenes, music genres.	YAMNet, PANNs, AST, BEATs, CLAP
Audio captioning	Generate a natural language description of an audio clip.	AAC baselines, Pengi, Qwen2-Audio
Music source separation	Split a mix into stems (vocals, drums, bass, other).	Demucs, Spleeter, HT Demucs, Open-Unmix
Music transcription	Convert audio to symbolic notation (MIDI or scores).	Onsets and Frames, MT3, Basic Pitch
Music generation	Generate music from text or other conditions.	MusicLM, MusicGen, Stable Audio, Suno, ElevenLabs Music, Udio, Riffusion
Sound effect generation	Generate non musical sounds from text.	AudioGen, AudioLDM, AudioLDM 2, Stable Audio Open
Speech to speech translation	Translate speech in one language to speech in another.	SeamlessM4T, AudioPaLM, Translatotron 2
Spoken language modeling	Model speech directly without going through text.	AudioLM, GSLM, TWIST, Moshi
Neural audio coding	Compress audio with a neural network for transmission or as tokens.	SoundStream, EnCodec, DAC, Mimi, Lyra

Many modern systems span more than one row. Whisper does both recognition and translation. SeamlessM4T does recognition, translation, text to speech, and speech to speech translation in one model. Qwen2-Audio handles classification, captioning, recognition, and audio reasoning from a single checkpoint.

How did audio models evolve?

Before 2012, speech and audio systems were dominated by Gaussian mixture model hidden Markov models for recognition and concatenative or hidden Markov model based synthesis for TTS. Beginning around 2009, hybrid deep neural network hidden Markov model systems from teams at the University of Toronto and Microsoft cut word error rates significantly, and by 2014 to 2015 they were standard in commercial ASR.

The first wave of end to end neural audio models arrived between 2014 and 2017. Connectionist temporal classification (Graves et al. 2006, applied at scale by Hannun et al. 2014 in Baidu's Deep Speech) made it possible to train a single neural network to map waveforms or features directly to characters.^[1]^[2] Listen, Attend and Spell (Chan et al. 2015) brought sequence to sequence attention to ASR. WaveNet (van den Oord et al., DeepMind, September 2016) was the first neural vocoder good enough to replace concatenative TTS back ends, using dilated causal convolutions over raw waveforms at 16 kHz.^[3] Tacotron (Wang et al., Google, 2017) and Tacotron 2 (Shen et al. 2018) produced mel spectrograms from text with an attention based sequence to sequence model, then handed those spectrograms to WaveNet or to faster vocoders such as WaveGlow and HiFi-GAN.^[4]^[5]

Self supervised learning landed in audio between 2018 and 2021. Wav2Vec (Schneider et al., Facebook AI, 2019) pretrained on raw audio using a contrastive predictive coding objective.^[6] Wav2Vec 2.0 (Baevski et al., 2020) added a transformer encoder and quantized targets and reached competitive ASR after fine tuning on as little as ten minutes of labeled speech.^[7] HuBERT (Hsu et al., 2021) replaced contrastive learning with a masked prediction objective over discrete cluster assignments, and WavLM (Chen et al., Microsoft, 2022) extended HuBERT with noisy and overlapping speech augmentation to handle speaker and paralinguistic tasks.^[8]^[9] These features became standard backbones for downstream speech systems.

Large scale weakly supervised training matured in 2022 with Whisper (Radford et al., OpenAI, September 2022), trained on 680,000 hours of multilingual web audio, of which 563,000 hours were English and 117,000 hours covered 96 other languages.^[10] Whisper handled multilingual recognition, translation to English, and timestamp generation in one model, and it was open weight from day one. Several labs followed with similar scale models: Meta's MMS for 1,000 plus languages (May 2023), NVIDIA's Canary and Parakeet, and AssemblyAI's Universal-2.^[13]

Neural audio codecs and audio language models converged in 2022 and 2023. SoundStream (Zeghidour et al., Google, 2021) and EnCodec (Défossez et al., Meta, 2022) showed that residual vector quantized neural codecs could compress speech and music to a few kbps and emit a small set of discrete tokens per frame.^[14]^[15] AudioLM (Borsos et al., Google, September 2022) treated those tokens as a language to model, splitting them into semantic tokens (from a w2v-BERT model) and acoustic tokens (from SoundStream), then training transformers to generate continuations that sounded coherent in voice and content.^[17] VALL-E (Wang et al., Microsoft, January 2023) applied the same idea to text to speech, using a three second voice prompt plus a phoneme sequence to produce speech in any voice with an EnCodec back end.^[18] MusicLM (Agostinelli et al., Google, January 2023) extended AudioLM to text to music with MuLan as a joint text audio embedder.^[19] MusicGen (Copet et al., Meta, August 2023) simplified the recipe to a single stage transformer that generates four EnCodec codebooks at 50 Hz in one pass, using a delay pattern and 20,000 hours of licensed music.^[20] AudioPaLM (Rubenstein et al., Google, June 2023) merged a text PaLM-2 model with audio tokens to do speech to speech translation in one decoder.^[21]

Diffusion and flow matching gained ground in parallel. DiffWave (Kong et al., 2020) and WaveGrad (Chen et al., 2020) were early diffusion vocoders.^[44] AudioLDM (Liu et al., 2023) and AudioLDM 2 produced general audio from text via latent diffusion.^[23] Stable Audio (Stability AI, September 2023, version 2 April 2024) used a long context latent diffusion model to generate up to several minutes of stereo music at 44.1 kHz from text and timing prompts.^[24] Voicebox (Le et al., Meta, June 2023) trained a flow matching model over filterbank features for in context TTS, denoising, content editing, and zero shot speech generation.^[22]

Multilingual speech to speech translation was packaged for production with SeamlessM4T (Barrault et al., Meta, August 2023), which supported nearly one hundred input languages and around thirty five output languages for speech.^[25] SeamlessM4T v2 (November 2023) and Seamless Streaming (February 2024) added expressive output and low latency simultaneous translation.^[26]

In 2024 audio became fluently bidirectional and multimodal. Moshi (Défossez et al., Kyutai, September 2024) introduced an end to end full duplex speech to speech model that listens and speaks at the same time, using two parallel streams of Mimi codec tokens and an inner monologue of text tokens for grounding.^[27] GPT-4o (OpenAI, May 2024) added native audio input and output to the GPT-4 family, cutting average voice latency to 320 milliseconds from the 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) of the earlier three model voice pipeline.^[30] Gemini 1.5 (Google, February 2024) accepted hours of audio in context, and Gemini 2.0 and 2.5 extended that to live audio agents.^[31] Qwen2-Audio (Alibaba, August 2024) and Qwen2.5-Omni (March 2025) treated audio as an equal input to text,^[28]^[29] and Phi-4-multimodal (Microsoft, February 2025) added speech and vision to the Phi-4 base model.^[32] By 2025, almost every major frontier laboratory shipped audio capable models alongside their text-only releases.

In 2025 the open audio-language model line advanced quickly. NVIDIA released Audio Flamingo 2 (ICML 2025), a 3 billion parameter model that reached state of the art across more than twenty audio benchmarks and understood audio up to five minutes long, followed by Audio Flamingo 3 (July 2025, NeurIPS 2025 Spotlight), a fully open 7 billion parameter model built on a Whisper large-v3 based encoder and a Qwen2.5-7B decoder that handled speech, sound, and music, supported inputs up to ten minutes, and added chain of thought reasoning, multi-turn multi-audio chat, and voice to voice interaction.[^af] Alibaba released Qwen3-Omni (arXiv, September 22, 2025), a natively end to end omni-modal model with a Thinker-Talker mixture of experts architecture, support for 119 text languages, 19 speech input languages, and 10 speech output languages, and a theoretical first packet latency of about 234 ms for streaming speech.[^qwen3] On the generation side, expressive and low latency text to speech proliferated: OpenAI shipped gpt-4o-mini-tts (March 2025), and the open community released compact models such as Sesame's 1 billion parameter Conversational Speech Model (CSM) and the 82 million parameter Kokoro.[^tts2025] Commercial voice also scaled: ElevenLabs, founded in 2022, offered more than 1,000 synthetic voices across 32 languages and raised a 500 million dollar Series D in February 2026 at an 11 billion dollar valuation.[^eleven] Music generation also advanced with Suno v4.5 (May 2025), which extended single generations to about eight minutes, Suno v4.5+ (July 2025), and Udio v1.5, while the AI music lawsuits reshaped the commercial landscape (see Limitations).[^suno45]

When was each foundational model released?

Model	Year	Organization	Primary task
Deep Speech	2014	Baidu	Speech recognition
Listen, Attend and Spell	2015	Google	Speech recognition
WaveNet	2016	DeepMind	Neural vocoder, TTS back end
Tacotron	2017	Google	End to end TTS
Tacotron 2	2018	Google	TTS, mel spectrogram prediction
HiFi-GAN	2020	Kakao	Fast neural vocoder
Wav2Vec	2019	Facebook AI	Self supervised speech pretraining
Wav2Vec 2.0	2020	Facebook AI	Self supervised ASR pretraining
Conformer	2020	Google	ASR encoder architecture
DiffWave	2020	Nvidia, KAIST	Diffusion vocoder
HuBERT	2021	Meta AI	Self supervised speech representation
SoundStream	2021	Google	Neural audio codec
WavLM	2022	Microsoft	Self supervised speech, full stack
AudioLM	2022	Google	Audio language model
EnCodec	2022	Meta AI	Neural audio codec
Whisper	2022	OpenAI	Multilingual ASR and translation
VALL-E	2023	Microsoft	Zero shot TTS via codec language model
MusicLM	2023	Google	Text to music
AudioPaLM	2023	Google	Speech to speech translation, audio LM
MusicGen	2023	Meta AI	Text and melody to music
Voicebox	2023	Meta AI	Flow matching TTS and editing
MMS	2023	Meta AI	ASR and TTS for 1,000 plus languages
AudioLDM	2023	University of Surrey, others	Latent diffusion text to audio
Stable Audio	2023	Stability AI	Long form text to music
Bark	2023	Suno	Generative speech and audio model
DAC	2023	Descript	High fidelity neural codec
SeamlessM4T	2023	Meta AI	Multilingual speech translation
Mimi	2024	Kyutai	Streaming neural codec for Moshi
Moshi	2024	Kyutai	Full duplex speech to speech LM
GPT-4o audio	2024	OpenAI	Multimodal model with audio in and out
Qwen2-Audio	2024	Alibaba	Audio understanding multimodal LLM
Qwen2.5-Omni	2025	Alibaba	Omni modal LLM with audio output
Phi-4-multimodal	2025	Microsoft	Multimodal LLM with speech input
Audio Flamingo 2	2025	Nvidia	Audio-language model, long audio reasoning
Audio Flamingo 3	2025	Nvidia	Open audio-language model, speech, sound, music
Qwen3-Omni	2025	Alibaba	Omni modal LLM, Thinker-Talker MoE

This list is illustrative rather than exhaustive. Many strong models from CMU, Nvidia, Tencent, Naver, and the open community are not shown.

What representations do audio models use?

Audio models choose between four main input and output representations, often combining them.

Raw waveform

A waveform is a sequence of amplitude samples at a fixed sample rate (8 kHz telephony, 16 kHz speech, 22.05 kHz and 24 kHz neural TTS, 44.1 kHz CD and music, 48 kHz studio). A single ten second speech clip at 16 kHz is 160,000 samples, which is too long for naive attention but tractable for convolutional networks. WaveNet, SampleRNN, and the original Wav2Vec operate directly on waveforms.^[3]^[6] Operating on the waveform avoids the loss inherent in spectrogram inversion but raises sequence length and compute cost.

Spectrogram and mel spectrogram

A short time Fourier transform converts a waveform into a complex two dimensional time frequency representation. Taking the magnitude and applying a perceptually motivated mel filterbank, then a log, gives the log mel spectrogram, which is the standard input feature for most speech and audio classifiers, and the standard intermediate output for two stage TTS. Mel cepstral coefficients (MFCC) compress the spectrogram further with a discrete cosine transform; they remain common for low resource classifiers and forced aligners. Spectrograms are dense, real valued, and roughly 100 frames per second, which fits well into transformer or convolutional models.

Voicebox, FastSpeech, and many ASR systems work in mel space, then either use a vocoder (HiFi-GAN, BigVGAN, Vocos)^[43]^[42] or a diffusion based decoder to recover the waveform.

Self supervised features

Wav2Vec 2.0, HuBERT, WavLM, w2v-BERT, BEST-RQ, and Whisper encoder features all produce learned vectors at roughly 25 ms or 50 ms hops. These features carry phonetic and acoustic information in a form well suited to downstream classifiers, recognizers, and aligners. They are widely used as the encoder for ASR, speaker tasks, and audio language models, including the semantic stream of AudioLM and MusicLM.

Audio tokens from neural codecs

Neural audio codecs compress audio to a small set of discrete tokens per frame using residual vector quantization. A typical setup emits between four and eight codebooks of ten bit tokens at 50 to 75 frames per second, giving a few kbps of bandwidth. The decoder reconstructs the waveform from the tokens. Major codecs include:

SoundStream (Google, 2021): the first end to end neural codec with residual vector quantization, originally targeted at 3 to 18 kbps.^[14]
EnCodec (Meta, 2022): an open source codec at 1.5, 3, 6, 12, and 24 kbps for 24 kHz speech and 48 kHz music, used by MusicGen, AudioGen, and many open audio LMs.^[15]
DAC (Descript, 2023): a higher fidelity universal codec at 8 kbps that handles speech, music, and general audio with one model.^[16]
Mimi (Kyutai, 2024): a low latency streaming codec at 1.1 kbps with a semantic token in the first codebook, designed for Moshi.

By turning audio into a token stream, codecs let audio models reuse the transformer language modeling stack. This is the basis for AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and Moshi.

How are audio models built? Modeling paradigms

Autoregressive token models

After a codec discretizes audio, an autoregressive transformer can model the token sequence with the same next token loss used for language. AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and Moshi all follow this pattern. They differ in how they handle the multi codebook structure: VALL-E uses a separate autoregressive model for the first codebook and a non autoregressive model for the rest,^[18] MusicGen interleaves codebooks with a delay pattern,^[20] and Moshi uses a depth transformer to model codebooks at each step.^[27]

Diffusion and flow matching

Diffusion models learn to reverse a noising process from Gaussian noise back to data. DiffWave and WaveGrad operate on waveforms;^[44] AudioLDM, AudioLDM 2, Stable Audio, and Tango operate on latent or spectrogram features.^[23]^[24] Flow matching is closely related and is the basis for Voicebox, NaturalSpeech 3, and several music models.^[22] These approaches tend to give high fidelity output and flexible inpainting and editing, at the cost of multi step sampling. Conditioning is usually through text encoders such as T5 or CLAP, sometimes with melody, lyrics, or timing conditioning on top.

Non autoregressive and masked models

FastSpeech, Glow-TTS, and Parakeet TDT use parallel decoding. SoundStorm (Google, 2023) used a MaskGIT style masked token model over SoundStream tokens to generate audio in parallel, cutting sampling cost dramatically compared to fully autoregressive baselines.^[45] NaturalSpeech 3 used factorized non autoregressive diffusion over disentangled content, prosody, and timbre tokens.

Hybrid architectures

Many production systems combine paradigms. A typical TTS stack is non autoregressive text to mel followed by a GAN or diffusion vocoder. A typical music generator is text to codec tokens via autoregressive transformer, then waveform decoding by the codec. Moshi has a text language model decoder with parallel audio token streams. AudioPaLM has a unified decoder over text and audio tokens.

Which multimodal models handle audio?

The broader category of multimodal models increasingly includes audio as a first class modality. Notable examples:

GPT-4o (OpenAI, May 2024) is a single model that handles text, image, and audio input and output.^[30] The audio mode supports interruption and emotional prosody and is the back end of the ChatGPT Advanced Voice Mode. OpenAI reported audio response times averaging 320 milliseconds, down from a multi second pipeline of three separate models.^[30]
Gemini 1.5, 2.0, 2.5 (Google) accept audio in long context (hours of input for 1.5 Pro).^[31] Gemini Live and the multimodal Live API target real time spoken interaction.
Qwen2-Audio (Alibaba, August 2024) is an open weight audio understanding model that performs ASR, audio captioning, sound classification, and spoken question answering from a single checkpoint.^[28] Qwen2.5-Omni (March 2025) added speech output via a Talker module,^[29] and Qwen3-Omni (September 2025) reorganized the family around a Thinker-Talker mixture of experts decoder with real time streaming speech in ten languages.[^qwen3]
Moshi (Kyutai, September 2024) is a fully open full duplex speech to speech model with a 7B Helium backbone, parallel input and output Mimi streams, and an inner monologue of text tokens.^[27]
Audio Flamingo 2 and 3 (Nvidia, 2025) are open audio-language models focused on understanding and reasoning over speech, environmental sound, and music rather than on speech synthesis. Audio Flamingo 3 is a fully open 7 billion parameter model with a Whisper large-v3 based encoder, support for audio up to ten minutes, on-demand chain of thought thinking, and voice to voice interaction.[^af] These models are typically evaluated on audio-language benchmarks such as those described under audio classification models.
Phi-4-multimodal (Microsoft, February 2025) is an open weight 5.6B parameter model with vision and speech adapters on the Phi-4 base.^[32]
NExT-GPT, AnyGPT, SALMONN, LTU, Pengi, BLSP are research multimodal audio LLMs that paired self supervised audio encoders with frozen or fine tuned text LLMs.
LLaMA 3 and other open base models have community speech extensions, including Llama-Omni, AudioGPT, and SpeechT5 adapters.

These systems blur the line between an audio model and a general purpose assistant. They retain audio specific subcomponents, in particular codecs, ASR front ends, and TTS modules, but the central reasoning happens in a transformer trained on a mix of modalities.

What libraries and frameworks support audio models?

Library	Maintainer	Focus
Hugging Face Transformers and Datasets	Hugging Face	Pretrained audio models, ASR, TTS, classification, audio LMs
Hugging Face Audio Course	Hugging Face	Open course covering speech and audio tasks^[40]
PyTorch torchaudio	PyTorch team	Audio I/O, transforms, basic models
SpeechBrain	EPFL and community	All purpose speech toolkit, recipes for ASR, TTS, speaker, enhancement^[38]
NVIDIA NeMo	Nvidia	Production speech and audio (Conformer, Canary, Parakeet, FastConformer)
ESPnet	Carnegie Mellon, JAIST, others	Research toolkit for ASR, TTS, speech translation, enhancement^[39]
Kaldi	Daniel Povey and community	Classical and hybrid ASR pipelines
K2 and Icefall	Next-gen Kaldi team	Modern lattice based ASR with PyTorch
WeNet	Mobvoi and Chengdu	Streaming production ASR
fairseq	Meta AI	Self supervised audio research code
Coqui TTS	Coqui (now community fork)	Open TTS toolkit with XTTS, VITS, Tacotron 2
OpenVoice and Coqui XTTS	MyShell and community	Voice cloning
Bark	Suno	Generative speech and sound effect model^[41]
AudioCraft	Meta	MusicGen, AudioGen, EnCodec reference code
Stable Audio Tools	Stability AI	Reference inference and training for Stable Audio
Demucs	Meta AI	Music source separation
Spleeter	Deezer	Music source separation, legacy
pyannote.audio	LIMSI and Inria	Speaker diarization, VAD^[37]
librosa	Brian McFee and contributors	Audio feature extraction in Python
TorchCREPE	community	Monophonic pitch tracking
openSMILE	audEERING	Paralinguistic and emotion features
WhisperX, faster-whisper, whisper.cpp	community	Optimized Whisper inference

Most current research code ships on PyTorch with Hugging Face checkpoints. JAX is common at Google. Production inference often goes through ONNX, TensorRT, or framework specific runtimes such as faster-whisper.

How is progress on audio models measured?

Progress on audio models is tracked through a mix of task specific and general purpose benchmarks. The largest classification corpus, AudioSet, contains 2,084,320 human-labeled 10-second clips drawn from YouTube across 632 sound event classes.^[34]

Benchmark	Year	Task
LibriSpeech	2015	English read speech ASR^[33]
LibriLight	2019	Low resource and self supervised ASR
Common Voice	2017 onward	Crowdsourced multilingual ASR
CHiME	2011 onward	Noisy and far field ASR
TED-LIUM	2012 onward	Lecture style ASR
FLEURS	2022	Multilingual ASR for 102 languages^[36]
MLS	2020	Multilingual LibriSpeech for 8 languages
VoxPopuli	2021	Multilingual European Parliament speech
SUPERB	2021	Self supervised speech evaluation across many tasks^[35]
SLUE	2022	Spoken language understanding
AudioSet	2017	2,084,320 YouTube clips of 632 sound classes^[34]
ESC-50	2015	50 class environmental sounds
UrbanSound8K	2014	Urban sound classification
FSD50K	2020	Freesound based sound event tagging
DCASE challenges	2013 onward	Acoustic scene and event detection
MUSDB18	2017	Music source separation reference set
MAESTRO	2018	Piano transcription and synthesis
Slakh2100	2019	Multitrack synthetic music separation
LJSpeech	2017	English single speaker TTS
VCTK	2017	Multispeaker TTS in English
LibriTTS and LibriTTS-R	2019 and 2023	Large scale multispeaker TTS corpora
MOS, P.808, UTMOS, NISQA	various	Subjective and predicted speech quality scores
AIR-Bench	2024	Audio LLM benchmark for chat and reasoning
Dynamic-SUPERB	2024	Instruction following speech evaluation
MMAU	2024	Multitask audio understanding for LLMs
SALMon	2024	Spoken language model evaluation for sentiment, sarcasm, prosody
Voicebench and VoxBench	2024	Voice assistant style evaluation
MELD, IEMOCAP	2018, 2008	Speech emotion recognition

Leaderboards on Hugging Face track Open ASR (English), Multilingual ASR, TTS Arena, and audio LLM evaluations.^[40] Word error rate (WER) and character error rate (CER) remain the default metrics for recognition; Frechet Audio Distance (FAD), KL divergence to a reference classifier, and CLAP score are common for music and sound generation; UTMOS and other predicted MOS scores are widely cited for TTS.

What are the limitations of audio models?

Despite rapid progress, audio models share several recurring weaknesses.

They degrade on out of distribution audio, particularly far field, noisy, accented, or code switched speech. Whisper performs much better on conversational web audio than on call center recordings, and almost all models lose accuracy below roughly 5 dB signal to noise ratio without targeted training. Long form audio handling is uneven; many systems do not gracefully handle silences, hesitations, or overlapping speakers, and segment based inference can hallucinate words at boundaries.

Generative speech and music models raise voice cloning, deepfake, and copyright concerns. A three second prompt is enough for several open TTS systems to copy a voice well enough to fool casual listeners, which has prompted research on audio watermarking (AudioSeal, WavMark) and on detection (ASVspoof challenges). Music generators have been the subject of training data disputes, beginning with the RIAA lawsuits filed against Suno and Udio in June 2024.[^riaa] Those cases moved toward commercial settlements in late 2025: Universal Music Group settled with Udio in October 2025 and announced plans for a jointly developed licensed AI music platform, and Warner Music Group settled with Suno in November 2025 with a licensing partnership and the sale of the Songkick concert discovery service to Suno.[^settle] Sony Music had not settled with either company as of early 2026, and its fair use cases were expected to produce influential rulings during 2026.[^sony]

Low resource languages and dialects remain underserved. MMS expanded ASR coverage to about 1,100 languages, but quality varies widely; for many languages the only available training data is religious recordings such as Bible translations, which biases vocabulary and domain.^[13] TTS coverage is even thinner.

Full duplex and real time evaluation is hard. Standard ASR and TTS benchmarks do not capture turn taking, interruption, latency, or persona consistency, all of which matter for voice assistants. Newer benchmarks (AIR-Bench, SALMon, Dynamic-SUPERB) try to address this but are not yet stable.

Finally, compute cost is high for high fidelity, long form generation. Stable Audio 2 generates a few minutes of 44.1 kHz stereo with hundreds of diffusion steps,^[24] and Moshi style full duplex inference requires careful streaming inference to stay near real time on a single GPU.^[27] As with text LLMs, distillation, quantization, and on device runtimes (whisper.cpp, faster-whisper, MLC, Apple Speech) are active areas of work.

References

Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML, 2006. ↩
Hannun, A., et al. "Deep Speech: Scaling up end-to-end speech recognition." arXiv:1412.5567, 2014. https://arxiv.org/abs/1412.5567 ↩
van den Oord, A., et al. "WaveNet: A Generative Model for Raw Audio." arXiv:1609.03499, 2016. https://arxiv.org/abs/1609.03499 ↩
Wang, Y., et al. "Tacotron: Towards End-to-End Speech Synthesis." Interspeech, 2017. https://arxiv.org/abs/1703.10135 ↩
Shen, J., et al. "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions." ICASSP, 2018. https://arxiv.org/abs/1712.05884 ↩
Schneider, S., et al. "wav2vec: Unsupervised Pre-training for Speech Recognition." Interspeech, 2019. https://arxiv.org/abs/1904.05862 ↩
Baevski, A., et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS, 2020. https://arxiv.org/abs/2006.11477 ↩
Hsu, W.-N., et al. "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." IEEE TASLP, 2021. https://arxiv.org/abs/2106.07447 ↩
Chen, S., et al. "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." IEEE JSTSP, 2022. https://arxiv.org/abs/2110.13900 ↩
Radford, A., et al. "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356, 2022. (Whisper) https://arxiv.org/abs/2212.04356 ↩
Gulati, A., et al. "Conformer: Convolution-augmented Transformer for Speech Recognition." Interspeech, 2020. https://arxiv.org/abs/2005.08100 ↩
Zhang, Y., et al. "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." arXiv:2303.01037, 2023. https://arxiv.org/abs/2303.01037 ↩
Pratap, V., et al. "Scaling Speech Technology to 1,000+ Languages." arXiv:2305.13516, 2023. (MMS) https://arxiv.org/abs/2305.13516 ↩
Zeghidour, N., et al. "SoundStream: An End-to-End Neural Audio Codec." IEEE TASLP, 2021. https://arxiv.org/abs/2107.03312 ↩
Défossez, A., et al. "High Fidelity Neural Audio Compression." arXiv:2210.13438, 2022. (EnCodec) https://arxiv.org/abs/2210.13438 ↩
Kumar, R., et al. "High-Fidelity Audio Compression with Improved RVQGAN." NeurIPS, 2023. (Descript Audio Codec) https://arxiv.org/abs/2306.06546 ↩
Borsos, Z., et al. "AudioLM: A Language Modeling Approach to Audio Generation." IEEE TASLP, 2023. https://arxiv.org/abs/2209.03143 ↩
Wang, C., et al. "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111, 2023. (VALL-E) https://arxiv.org/abs/2301.02111 ↩
Agostinelli, A., et al. "MusicLM: Generating Music From Text." arXiv:2301.11325, 2023. https://arxiv.org/abs/2301.11325 ↩
Copet, J., et al. "Simple and Controllable Music Generation." NeurIPS, 2023. (MusicGen) https://arxiv.org/abs/2306.05284 ↩
Rubenstein, P. K., et al. "AudioPaLM: A Large Language Model That Can Speak and Listen." arXiv:2306.12925, 2023. https://arxiv.org/abs/2306.12925 ↩
Le, M., et al. "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale." NeurIPS, 2023. https://arxiv.org/abs/2306.15687 ↩
Liu, H., et al. "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models." ICML, 2023. https://arxiv.org/abs/2301.12503 ↩
Evans, Z., et al. "Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion." ICML, 2024. https://arxiv.org/abs/2402.04825 ↩
Barrault, L., et al. "SeamlessM4T: Massively Multilingual & Multimodal Machine Translation." arXiv:2308.11596, 2023. https://arxiv.org/abs/2308.11596 ↩
Seamless Communication team, Meta. "Seamless: Multilingual Expressive and Streaming Speech Translation." arXiv:2312.05187, 2023. https://arxiv.org/abs/2312.05187 ↩
Défossez, A., et al. "Moshi: a speech-text foundation model for real-time dialogue." Kyutai technical report, 2024. https://arxiv.org/abs/2410.00037 ↩
Chu, Y., et al. "Qwen2-Audio Technical Report." arXiv:2407.10759, 2024. https://arxiv.org/abs/2407.10759 ↩
Xu, J., et al. "Qwen2.5-Omni Technical Report." arXiv:2503.20215, 2025. https://arxiv.org/abs/2503.20215 ↩
OpenAI. "Hello GPT-4o." OpenAI blog, May 13, 2024. https://openai.com/index/hello-gpt-4o/ ↩
Google DeepMind. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." Technical report, 2024. https://arxiv.org/abs/2403.05530 ↩
Microsoft. "Phi-4-multimodal and Phi-4-mini." Microsoft blog, February 2025. https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/ ↩
Panayotov, V., et al. "LibriSpeech: An ASR corpus based on public domain audio books." ICASSP, 2015. https://www.openslr.org/12 ↩
Gemmeke, J., et al. "Audio Set: An ontology and human-labeled dataset for audio events." ICASSP, 2017. https://research.google.com/audioset/ ↩
Yang, S.-w., et al. "SUPERB: Speech Processing Universal PERformance Benchmark." Interspeech, 2021. https://arxiv.org/abs/2105.01051 ↩
Conneau, A., et al. "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech." SLT, 2022. https://arxiv.org/abs/2205.12446 ↩
Bredin, H., et al. "pyannote.audio: neural building blocks for speaker diarization." ICASSP, 2020. https://github.com/pyannote/pyannote-audio ↩
Ravanelli, M., et al. "SpeechBrain: A General-Purpose Speech Toolkit." arXiv:2106.04624, 2021. https://arxiv.org/abs/2106.04624 ↩
Watanabe, S., et al. "ESPnet: End-to-End Speech Processing Toolkit." Interspeech, 2018. https://arxiv.org/abs/1804.00015 ↩
Hugging Face. "Open ASR Leaderboard," "TTS Arena," and the Audio Course. https://huggingface.co/learn/audio-course ↩
Suno. "Bark: text-prompted generative audio model." GitHub repository, 2023. https://github.com/suno-ai/bark ↩
Lee, S.-g., et al. "BigVGAN: A Universal Neural Vocoder with Large-Scale Training." ICLR, 2023. https://arxiv.org/abs/2206.04658 ↩
Kong, J., et al. "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." NeurIPS, 2020. https://arxiv.org/abs/2010.05646 ↩
Kong, Z., et al. "DiffWave: A Versatile Diffusion Model for Audio Synthesis." ICLR, 2021. https://arxiv.org/abs/2009.09761 ↩
Borsos, Z., et al. "SoundStorm: Efficient Parallel Audio Generation." arXiv:2305.09636, 2023. https://arxiv.org/abs/2305.09636 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Audio Classification Models Audio-to-Audio Models Automatic Speech Recognition Models EnCodec LibriSpeech Papers Score matching SoundStream Text-to-Speech Models Vector database Video-MMMU Voice Activity Detection Models

What is an audio model?

How are audio models categorized by task?

How did audio models evolve?

When was each foundational model released?

What representations do audio models use?

Raw waveform

Spectrogram and mel spectrogram

Self supervised features

Audio tokens from neural codecs

How are audio models built? Modeling paradigms

Autoregressive token models

Diffusion and flow matching

Non autoregressive and masked models

Hybrid architectures

Which multimodal models handle audio?

What libraries and frameworks support audio models?

How is progress on audio models measured?

What are the limitations of audio models?

See also

References

Improve this article

Related Articles

Audio-to-Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here

Related Articles

Audio-to-Audio Models

Automatic Speech Recognition Models

Text-to-Speech Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here