Audio Models
Last reviewed
May 13, 2026
Sources
45 citations
Review status
Source-backed
Revision
v2 · 4,949 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
45 citations
Review status
Source-backed
Revision
v2 · 4,949 words
Add missing citations, update stale details, or suggest a clearer explanation.
Audio models are machine learning systems that take audio as input, produce audio as output, or both. The category covers speech (recognition, synthesis, voice activity detection, voice conversion), music (generation, source separation, transcription), and general audio (classification of sound events and scenes, captioning, enhancement, codecs). It is one of the three main modalities handled by modern foundation models, alongside text and vision.
The field began to consolidate around deep learning in the mid 2010s with WaveNet, the Tacotron family, and connectionist temporal classification based recognizers. Self supervised pretraining (Wav2Vec, HuBERT, WavLM) and weakly supervised large scale training (Whisper) reshaped speech systems between 2019 and 2022. Around the same period, neural audio codecs (SoundStream, EnCodec, DAC, Mimi) made it practical to treat audio as a sequence of discrete tokens, which unlocked autoregressive audio language models such as AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and the full duplex Moshi. By 2024 and 2025, audio was a first class input and output in multimodal models such as GPT-4o, Gemini 1.5 and 2.0, Qwen2-Audio and Qwen2.5-Omni, and Phi-4-multimodal.
An audio model is any model whose primary input or output is audio. In practice this includes three things:
Audio differs from text in three ways that drive architectural choices. It is a continuous one dimensional signal sampled tens of thousands of times per second, so raw waveforms are very long sequences. It carries information at multiple time scales at once, from sub millisecond phonetic detail to multi second prosody and minute long musical structure. And the same linguistic content can be realized by many physically different waveforms, because speakers, microphones, rooms, and background noise vary widely. Most audio models therefore operate on a more compact representation: a mel spectrogram, a learned self supervised feature, or a sequence of discrete tokens emitted by a neural audio codec.
This article covers audio models as a whole. Detailed subcategories live in their own articles: automatic speech recognition models, text to speech models, voice activity detection models, audio classification models, audio to audio models, and music generation models.
| Task | Description | Representative models |
|---|---|---|
| Automatic speech recognition (ASR) | Convert speech to text. | Whisper, Wav2Vec 2.0, Conformer, Universal Speech Model, Canary, Parakeet |
| Text to speech (TTS) | Convert text to speech audio. | Tacotron 2, FastSpeech 2, VALL-E, NaturalSpeech 3, Voicebox, Bark, XTTS |
| Voice activity detection (VAD) | Decide which frames of audio contain speech. | WebRTC VAD, Silero VAD, pyannote VAD |
| Speaker tasks | Identify, verify, or diarize speakers. | ECAPA-TDNN, x-vector, pyannote, NeMo TitaNet |
| Speech enhancement and separation | Denoise, dereverberate, or split mixed speech. | RNNoise, DeepFilterNet, Conv-TasNet, SepFormer, Demucs voice |
| Voice conversion and cloning | Re-render speech in another voice while preserving content. | YourTTS, FreeVC, OpenVoice, RVC |
| Audio classification | Tag sound events, scenes, music genres. | YAMNet, PANNs, AST, BEATs, CLAP |
| Audio captioning | Generate a natural language description of an audio clip. | AAC baselines, Pengi, Qwen2-Audio |
| Music source separation | Split a mix into stems (vocals, drums, bass, other). | Demucs, Spleeter, HT Demucs, Open-Unmix |
| Music transcription | Convert audio to symbolic notation (MIDI or scores). | Onsets and Frames, MT3, Basic Pitch |
| Music generation | Generate music from text or other conditions. | MusicLM, MusicGen, Stable Audio, Suno, ElevenLabs Music, Udio, Riffusion |
| Sound effect generation | Generate non musical sounds from text. | AudioGen, AudioLDM, AudioLDM 2, Stable Audio Open |
| Speech to speech translation | Translate speech in one language to speech in another. | SeamlessM4T, AudioPaLM, Translatotron 2 |
| Spoken language modeling | Model speech directly without going through text. | AudioLM, GSLM, TWIST, Moshi |
| Neural audio coding | Compress audio with a neural network for transmission or as tokens. | SoundStream, EnCodec, DAC, Mimi, Lyra |
Many modern systems span more than one row. Whisper does both recognition and translation. SeamlessM4T does recognition, translation, text to speech, and speech to speech translation in one model. Qwen2-Audio handles classification, captioning, recognition, and audio reasoning from a single checkpoint.
Before 2012, speech and audio systems were dominated by Gaussian mixture model hidden Markov models for recognition and concatenative or hidden Markov model based synthesis for TTS. Beginning around 2009, hybrid deep neural network hidden Markov model systems from teams at the University of Toronto and Microsoft cut word error rates significantly, and by 2014 to 2015 they were standard in commercial ASR.
The first wave of end to end neural audio models arrived between 2014 and 2017. Connectionist temporal classification (Graves et al. 2006, applied at scale by Hannun et al. 2014 in Baidu's Deep Speech) made it possible to train a single neural network to map waveforms or features directly to characters. Listen, Attend and Spell (Chan et al. 2015) brought sequence to sequence attention to ASR. WaveNet (van den Oord et al., DeepMind, September 2016) was the first neural vocoder good enough to replace concatenative TTS back ends, using dilated causal convolutions over raw waveforms at 16 kHz. Tacotron (Wang et al., Google, 2017) and Tacotron 2 (Shen et al. 2018) produced mel spectrograms from text with an attention based sequence to sequence model, then handed those spectrograms to WaveNet or to faster vocoders such as WaveGlow and HiFi-GAN.
Self supervised learning landed in audio between 2018 and 2021. Wav2Vec (Schneider et al., Facebook AI, 2019) pretrained on raw audio using a contrastive predictive coding objective. Wav2Vec 2.0 (Baevski et al., 2020) added a transformer encoder and quantized targets and reached competitive ASR after fine tuning on as little as ten minutes of labeled speech. HuBERT (Hsu et al., 2021) replaced contrastive learning with a masked prediction objective over discrete cluster assignments, and WavLM (Chen et al., Microsoft, 2022) extended HuBERT with noisy and overlapping speech augmentation to handle speaker and paralinguistic tasks. These features became standard backbones for downstream speech systems.
Large scale weakly supervised training matured in 2022 with Whisper (Radford et al., OpenAI, September 2022), trained on 680,000 hours of multilingual web audio. Whisper handled multilingual recognition, translation to English, and timestamp generation in one model, and it was open weight from day one. Several labs followed with similar scale models: Meta's MMS for 1,000 plus languages (May 2023), NVIDIA's Canary and Parakeet, and AssemblyAI's Universal-2.
Neural audio codecs and audio language models converged in 2022 and 2023. SoundStream (Zeghidour et al., Google, 2021) and EnCodec (Défossez et al., Meta, 2022) showed that residual vector quantized neural codecs could compress speech and music to a few kbps and emit a small set of discrete tokens per frame. AudioLM (Borsos et al., Google, September 2022) treated those tokens as a language to model, splitting them into semantic tokens (from a w2v-BERT model) and acoustic tokens (from SoundStream), then training transformers to generate continuations that sounded coherent in voice and content. VALL-E (Wang et al., Microsoft, January 2023) applied the same idea to text to speech, using a three second voice prompt plus a phoneme sequence to produce speech in any voice with an EnCodec back end. MusicLM (Agostinelli et al., Google, January 2023) extended AudioLM to text to music with MuLan as a joint text audio embedder. MusicGen (Copet et al., Meta, August 2023) simplified the recipe to a single stage transformer over delayed EnCodec tokens. AudioPaLM (Rubenstein et al., Google, June 2023) merged a text PaLM-2 model with audio tokens to do speech to speech translation in one decoder.
Diffusion and flow matching gained ground in parallel. DiffWave (Kong et al., 2020) and WaveGrad (Chen et al., 2020) were early diffusion vocoders. AudioLDM (Liu et al., 2023) and AudioLDM 2 produced general audio from text via latent diffusion. Stable Audio (Stability AI, September 2023, version 2 April 2024) used a long context latent diffusion model to generate up to several minutes of stereo music at 44.1 kHz from text and timing prompts. Voicebox (Le et al., Meta, June 2023) trained a flow matching model over filterbank features for in context TTS, denoising, content editing, and zero shot speech generation.
Multilingual speech to speech translation was packaged for production with SeamlessM4T (Barrault et al., Meta, August 2023), which supported nearly one hundred input languages and around thirty five output languages for speech. SeamlessM4T v2 (November 2023) and Seamless Streaming (February 2024) added expressive output and low latency simultaneous translation.
In 2024 audio became fluently bidirectional and multimodal. Moshi (Défossez et al., Kyutai, September 2024) introduced an end to end full duplex speech to speech model that listens and speaks at the same time, using two parallel streams of Mimi codec tokens and an inner monologue of text tokens for grounding. GPT-4o (OpenAI, May 2024) added native audio input and output to the GPT-4 family. Gemini 1.5 (Google, February 2024) accepted hours of audio in context, and Gemini 2.0 and 2.5 extended that to live audio agents. Qwen2-Audio (Alibaba, August 2024) and Qwen2.5-Omni (March 2025) treated audio as an equal input to text, and Phi-4-multimodal (Microsoft, February 2025) added speech and vision to the Phi-4 base model. By 2025, almost every major frontier laboratory shipped audio capable models alongside their text-only releases.
| Model | Year | Organization | Primary task |
|---|---|---|---|
| Deep Speech | 2014 | Baidu | Speech recognition |
| Listen, Attend and Spell | 2015 | Speech recognition | |
| WaveNet | 2016 | DeepMind | Neural vocoder, TTS back end |
| Tacotron | 2017 | End to end TTS | |
| Tacotron 2 | 2018 | TTS, mel spectrogram prediction | |
| HiFi-GAN | 2020 | Kakao | Fast neural vocoder |
| Wav2Vec | 2019 | Facebook AI | Self supervised speech pretraining |
| Wav2Vec 2.0 | 2020 | Facebook AI | Self supervised ASR pretraining |
| Conformer | 2020 | ASR encoder architecture | |
| DiffWave | 2020 | Nvidia, KAIST | Diffusion vocoder |
| HuBERT | 2021 | Meta AI | Self supervised speech representation |
| SoundStream | 2021 | Neural audio codec | |
| WavLM | 2022 | Microsoft | Self supervised speech, full stack |
| AudioLM | 2022 | Audio language model | |
| EnCodec | 2022 | Meta AI | Neural audio codec |
| Whisper | 2022 | OpenAI | Multilingual ASR and translation |
| VALL-E | 2023 | Microsoft | Zero shot TTS via codec language model |
| MusicLM | 2023 | Text to music | |
| AudioPaLM | 2023 | Speech to speech translation, audio LM | |
| MusicGen | 2023 | Meta AI | Text and melody to music |
| Voicebox | 2023 | Meta AI | Flow matching TTS and editing |
| MMS | 2023 | Meta AI | ASR and TTS for 1,000 plus languages |
| AudioLDM | 2023 | University of Surrey, others | Latent diffusion text to audio |
| Stable Audio | 2023 | Stability AI | Long form text to music |
| Bark | 2023 | Suno | Generative speech and audio model |
| DAC | 2023 | Descript | High fidelity neural codec |
| SeamlessM4T | 2023 | Meta AI | Multilingual speech translation |
| Mimi | 2024 | Kyutai | Streaming neural codec for Moshi |
| Moshi | 2024 | Kyutai | Full duplex speech to speech LM |
| GPT-4o audio | 2024 | OpenAI | Multimodal model with audio in and out |
| Qwen2-Audio | 2024 | Alibaba | Audio understanding multimodal LLM |
| Qwen2.5-Omni | 2025 | Alibaba | Omni modal LLM with audio output |
| Phi-4-multimodal | 2025 | Microsoft | Multimodal LLM with speech input |
This list is illustrative rather than exhaustive. Many strong models from CMU, Nvidia, Tencent, Naver, and the open community are not shown.
Audio models choose between four main input and output representations, often combining them.
A waveform is a sequence of amplitude samples at a fixed sample rate (8 kHz telephony, 16 kHz speech, 22.05 kHz and 24 kHz neural TTS, 44.1 kHz CD and music, 48 kHz studio). A single ten second speech clip at 16 kHz is 160,000 samples, which is too long for naive attention but tractable for convolutional networks. WaveNet, SampleRNN, and the original Wav2Vec operate directly on waveforms. Operating on the waveform avoids the loss inherent in spectrogram inversion but raises sequence length and compute cost.
A short time Fourier transform converts a waveform into a complex two dimensional time frequency representation. Taking the magnitude and applying a perceptually motivated mel filterbank, then a log, gives the log mel spectrogram, which is the standard input feature for most speech and audio classifiers, and the standard intermediate output for two stage TTS. Mel cepstral coefficients (MFCC) compress the spectrogram further with a discrete cosine transform; they remain common for low resource classifiers and forced aligners. Spectrograms are dense, real valued, and roughly 100 frames per second, which fits well into transformer or convolutional models.
Voicebox, FastSpeech, and many ASR systems work in mel space, then either use a vocoder (HiFi-GAN, BigVGAN, Vocos) or a diffusion based decoder to recover the waveform.
Wav2Vec 2.0, HuBERT, WavLM, w2v-BERT, BEST-RQ, and Whisper encoder features all produce learned vectors at roughly 25 ms or 50 ms hops. These features carry phonetic and acoustic information in a form well suited to downstream classifiers, recognizers, and aligners. They are widely used as the encoder for ASR, speaker tasks, and audio language models, including the semantic stream of AudioLM and MusicLM.
Neural audio codecs compress audio to a small set of discrete tokens per frame using residual vector quantization. A typical setup emits between four and eight codebooks of ten bit tokens at 50 to 75 frames per second, giving a few kbps of bandwidth. The decoder reconstructs the waveform from the tokens. Major codecs include:
By turning audio into a token stream, codecs let audio models reuse the transformer language modeling stack. This is the basis for AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and Moshi.
After a codec discretizes audio, an autoregressive transformer can model the token sequence with the same next token loss used for language. AudioLM, VALL-E, MusicLM, MusicGen, AudioPaLM, and Moshi all follow this pattern. They differ in how they handle the multi codebook structure: VALL-E uses a separate autoregressive model for the first codebook and a non autoregressive model for the rest, MusicGen interleaves codebooks with a delay pattern, and Moshi uses a depth transformer to model codebooks at each step.
Diffusion models learn to reverse a noising process from Gaussian noise back to data. DiffWave and WaveGrad operate on waveforms; AudioLDM, AudioLDM 2, Stable Audio, and Tango operate on latent or spectrogram features. Flow matching is closely related and is the basis for Voicebox, NaturalSpeech 3, and several music models. These approaches tend to give high fidelity output and flexible inpainting and editing, at the cost of multi step sampling. Conditioning is usually through text encoders such as T5 or CLAP, sometimes with melody, lyrics, or timing conditioning on top.
FastSpeech, Glow-TTS, and Parakeet TDT use parallel decoding. SoundStorm (Google, 2023) used a MaskGIT style masked token model over SoundStream tokens to generate audio in parallel, cutting sampling cost dramatically compared to fully autoregressive baselines. NaturalSpeech 3 used factorized non autoregressive diffusion over disentangled content, prosody, and timbre tokens.
Many production systems combine paradigms. A typical TTS stack is non autoregressive text to mel followed by a GAN or diffusion vocoder. A typical music generator is text to codec tokens via autoregressive transformer, then waveform decoding by the codec. Moshi has a text language model decoder with parallel audio token streams. AudioPaLM has a unified decoder over text and audio tokens.
The broader category of multimodal models increasingly includes audio as a first class modality. Notable examples:
These systems blur the line between an audio model and a general purpose assistant. They retain audio specific subcomponents, in particular codecs, ASR front ends, and TTS modules, but the central reasoning happens in a transformer trained on a mix of modalities.
| Library | Maintainer | Focus |
|---|---|---|
| Hugging Face Transformers and Datasets | Hugging Face | Pretrained audio models, ASR, TTS, classification, audio LMs |
| Hugging Face Audio Course | Hugging Face | Open course covering speech and audio tasks |
| PyTorch torchaudio | PyTorch team | Audio I/O, transforms, basic models |
| SpeechBrain | EPFL and community | All purpose speech toolkit, recipes for ASR, TTS, speaker, enhancement |
| NVIDIA NeMo | Nvidia | Production speech and audio (Conformer, Canary, Parakeet, FastConformer) |
| ESPnet | Carnegie Mellon, JAIST, others | Research toolkit for ASR, TTS, speech translation, enhancement |
| Kaldi | Daniel Povey and community | Classical and hybrid ASR pipelines |
| K2 and Icefall | Next-gen Kaldi team | Modern lattice based ASR with PyTorch |
| WeNet | Mobvoi and Chengdu | Streaming production ASR |
| fairseq | Meta AI | Self supervised audio research code |
| Coqui TTS | Coqui (now community fork) | Open TTS toolkit with XTTS, VITS, Tacotron 2 |
| OpenVoice and Coqui XTTS | MyShell and community | Voice cloning |
| Bark | Suno | Generative speech and sound effect model |
| AudioCraft | Meta | MusicGen, AudioGen, EnCodec reference code |
| Stable Audio Tools | Stability AI | Reference inference and training for Stable Audio |
| Demucs | Meta AI | Music source separation |
| Spleeter | Deezer | Music source separation, legacy |
| pyannote.audio | LIMSI and Inria | Speaker diarization, VAD |
| librosa | Brian McFee and contributors | Audio feature extraction in Python |
| TorchCREPE | community | Monophonic pitch tracking |
| openSMILE | audEERING | Paralinguistic and emotion features |
| WhisperX, faster-whisper, whisper.cpp | community | Optimized Whisper inference |
Most current research code ships on PyTorch with Hugging Face checkpoints. JAX is common at Google. Production inference often goes through ONNX, TensorRT, or framework specific runtimes such as faster-whisper.
Progress on audio models is tracked through a mix of task specific and general purpose benchmarks.
| Benchmark | Year | Task |
|---|---|---|
| LibriSpeech | 2015 | English read speech ASR |
| LibriLight | 2019 | Low resource and self supervised ASR |
| Common Voice | 2017 onward | Crowdsourced multilingual ASR |
| CHiME | 2011 onward | Noisy and far field ASR |
| TED-LIUM | 2012 onward | Lecture style ASR |
| FLEURS | 2022 | Multilingual ASR for 102 languages |
| MLS | 2020 | Multilingual LibriSpeech for 8 languages |
| VoxPopuli | 2021 | Multilingual European Parliament speech |
| SUPERB | 2021 | Self supervised speech evaluation across many tasks |
| SLUE | 2022 | Spoken language understanding |
| AudioSet | 2017 | 2 million YouTube clips of 632 sound classes |
| ESC-50 | 2015 | 50 class environmental sounds |
| UrbanSound8K | 2014 | Urban sound classification |
| FSD50K | 2020 | Freesound based sound event tagging |
| DCASE challenges | 2013 onward | Acoustic scene and event detection |
| MUSDB18 | 2017 | Music source separation reference set |
| MAESTRO | 2018 | Piano transcription and synthesis |
| Slakh2100 | 2019 | Multitrack synthetic music separation |
| LJSpeech | 2017 | English single speaker TTS |
| VCTK | 2017 | Multispeaker TTS in English |
| LibriTTS and LibriTTS-R | 2019 and 2023 | Large scale multispeaker TTS corpora |
| MOS, P.808, UTMOS, NISQA | various | Subjective and predicted speech quality scores |
| AIR-Bench | 2024 | Audio LLM benchmark for chat and reasoning |
| Dynamic-SUPERB | 2024 | Instruction following speech evaluation |
| MMAU | 2024 | Multitask audio understanding for LLMs |
| SALMon | 2024 | Spoken language model evaluation for sentiment, sarcasm, prosody |
| Voicebench and VoxBench | 2024 | Voice assistant style evaluation |
| MELD, IEMOCAP | 2018, 2008 | Speech emotion recognition |
Leaderboards on Hugging Face track Open ASR (English), Multilingual ASR, TTS Arena, and audio LLM evaluations. Word error rate (WER) and character error rate (CER) remain the default metrics for recognition; Frechet Audio Distance (FAD), KL divergence to a reference classifier, and CLAP score are common for music and sound generation; UTMOS and other predicted MOS scores are widely cited for TTS.
Despite rapid progress, audio models share several recurring weaknesses.
They degrade on out of distribution audio, particularly far field, noisy, accented, or code switched speech. Whisper performs much better on conversational web audio than on call center recordings, and almost all models lose accuracy below roughly 5 dB signal to noise ratio without targeted training. Long form audio handling is uneven; many systems do not gracefully handle silences, hesitations, or overlapping speakers, and segment based inference can hallucinate words at boundaries.
Generative speech and music models raise voice cloning, deepfake, and copyright concerns. A three second prompt is enough for several open TTS systems to copy a voice well enough to fool casual listeners, which has prompted research on audio watermarking (AudioSeal, WavMark) and on detection (ASVspoof challenges). Music generators have been the subject of training data disputes, including the 2024 RIAA lawsuits against Suno and Udio.
Low resource languages and dialects remain underserved. MMS expanded ASR coverage to about 1,100 languages, but quality varies widely; for many languages the only available training data is religious recordings such as Bible translations, which biases vocabulary and domain. TTS coverage is even thinner.
Full duplex and real time evaluation is hard. Standard ASR and TTS benchmarks do not capture turn taking, interruption, latency, or persona consistency, all of which matter for voice assistants. Newer benchmarks (AIR-Bench, SALMon, Dynamic-SUPERB) try to address this but are not yet stable.
Finally, compute cost is high for high fidelity, long form generation. Stable Audio 2 generates a few minutes of 44.1 kHz stereo with hundreds of diffusion steps, and Moshi style full duplex inference requires careful streaming inference to stay near real time on a single GPU. As with text LLMs, distillation, quantization, and on device runtimes (whisper.cpp, faster-whisper, MLC, Apple Speech) are active areas of work.