Audio Classification Models

Deep Learning Machine Learning Speech & Audio AI

37 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

37 citations

Revision

v3 · 7,360 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Audio Models and Audio

Audio classification models are machine learning systems for audio classification, the task of assigning one or more labels to an audio recording or to short fragments of it. The labels can describe an event in a clip (a dog bark, a siren, a guitar chord), an environment (a restaurant, a forest, a subway), a speaker, a language, or a music tag such as a genre, an instrument, or a mood. The same family of models is also used for time localised tasks where the system has to report when in a recording a given event occurs.

Modern audio classifiers are almost all deep neural networks trained on log-mel spectrograms or raw waveforms. They draw heavily on architectures invented for vision, including VGG style convolutional networks, ResNets, the transformer, and vision transformers, and they are usually pretrained on AudioSet before being adapted to a target task. Most of the strongest backbones are trained with self-supervised learning on large pools of unlabelled audio. The dominant pretrained backbones in 2024 to 2026 include PANNs, the Audio Spectrogram Transformer (AST) and its descendants, BEATs, AudioMAE, EAT, SSLAM, Dasheng, and the CLAP family of contrastive audio language models that support zero shot classification.

Overview

Audio classification is a broad family of tasks that share one structural property: a model maps an audio signal, or a short window inside it, to one or more discrete labels. The signal is usually short, on the order of a few hundred milliseconds to thirty seconds, and the label space ranges from two classes to several thousand.

The community usually distinguishes the following sub tasks:

Sub task	What is predicted	Typical clip length	Representative datasets
Sound event classification	One or more event labels per clip	1 to 10 seconds	AudioSet, ESC-50, UrbanSound8K, FSD50K
Sound event detection (SED)	Labels with onset and offset times	10 to 60 seconds	DCASE SED tasks, URBAN-SED, DESED
Acoustic scene classification	One label describing the environment	10 to 30 seconds	DCASE ASC, TUT Acoustic Scenes
Music tagging	Multi label tags for genre, mood, instrument	30 seconds clip	MagnaTagATune (MTT), Million Song Dataset, MTG-Jamendo
Speaker identification	Speaker label from a closed set	1 to 8 seconds	VoxCeleb1, VoxCeleb2
Speaker verification	Same speaker yes or no	1 to 8 seconds pairs	VoxCeleb, NIST SRE
Spoken language identification	Language label	3 to 30 seconds	VoxLingua107, CommonLanguage
Keyword spotting	One of a fixed vocabulary	1 second	Google Speech Commands
Bioacoustic classification	Species or call type	a few seconds	BirdCLEF, BEANS, BirdSet

Sound event classification and sound event detection are related but they are not the same problem. A classifier is satisfied by a clip level decision, for example "this recording contains a baby crying somewhere inside it". A detector also has to provide the onset and offset times of each event in continuous time, which is harder and is usually evaluated by an event based F score rather than mean average precision.

The key design choices for an audio classifier are the input representation (raw waveform, short time Fourier transform, log mel spectrogram, MFCC, learnable filterbank), the backbone (CNN, transformer, conformer, state space model), the temporal pooling that produces a clip level decision from a sequence of frame level features, and the training objective. Most state of the art systems use log mel spectrograms because they have proved hard to beat as input, and the dominant pooling is either global average pooling for CNNs or a CLS token for transformers.

History

MFCC and GMM era

Before deep learning, audio classification systems were built on a long pipeline that started with hand engineered features and ended with a probabilistic or kernel classifier. The most common feature was the mel frequency cepstral coefficient (MFCC), first proposed for speech recognition by Davis and Mermelstein in 1980. MFCCs are produced by computing a short time Fourier transform, mapping the magnitude spectrum to a mel filterbank that imitates the frequency selectivity of the human cochlea, taking the logarithm, and applying a discrete cosine transform that decorrelates the resulting bands.

On top of MFCCs the standard classifier for decades was the Gaussian mixture model (GMM), in particular the GMM-UBM (universal background model) framework introduced by Reynolds and others in the early 2000s for speaker recognition. Closely related were hidden Markov models for sequence labelling, support vector machines with kernels over bags of frames, and the i-vector approach of Dehak and colleagues from 2010, which gave a fixed dimensional embedding for a recording and dominated speaker verification until the deep learning takeover around 2016.

The MFCC plus GMM stack works well when there is little data and when the target classes are well separated, but it does not scale to the long tailed multi label problems that the field now cares about, and it requires careful hand tuning.

CNN era

The shift to deep learning came in two waves. The first wave applied small convolutional networks to mel spectrograms and showed that they beat the MFCC stack on clean academic benchmarks such as ESC-50 and UrbanSound8K. Piczak's 2015 baseline CNN on ESC-50 is the canonical reference for this wave.¹²

The second wave, starting in 2016 and 2017, scaled the approach by training on YouTube. Aytar, Vondrick, and Torralba presented SoundNet at NeurIPS 2016, a 1D CNN trained on roughly two million unlabelled videos using a teacher student loss that distilled image classifier predictions on the video frames into the audio stream.³ SoundNet pushed ESC-50 accuracy by more than ten points and is one of the earliest demonstrations of cross modal self supervision for audio.

In 2017 Hershey and colleagues at Google published "CNN Architectures for Large-Scale Audio Classification" at ICASSP, training AlexNet, VGG, Inception, and ResNet variants on 70 million video soundtracks from YouTube with 30,871 video level labels.⁴ The VGG variant became known as VGGish and Google released it as a 128 dimensional embedding model. VGGish has been the default audio embedding for transfer learning ever since.

The same group released AudioSet in 2017 (Gemmeke et al., ICASSP 2017), a curated set of 1,789,621 ten second YouTube clips annotated with labels from a hierarchical ontology of 632 sound classes.⁵ AudioSet is by far the largest publicly available sound event dataset and almost every modern audio classifier is pretrained on it. Google later released YAMNet (Plakal and Ellis), a MobileNetV1 trained on AudioSet that predicts 521 of the 527 leaf classes, with 3.7 million weights and 69.2 million multiplies per 960 millisecond frame, which makes it about twenty times smaller than VGGish.⁶

The most influential pure CNN system on AudioSet is the family of Pretrained Audio Neural Networks (PANNs) by Kong, Cao, Iqbal, Wang, Wang, and Plumbley, published in IEEE/ACM Transactions on Audio, Speech and Language Processing in 2020. PANNs comprises fifteen architectures, including CNN10, CNN14, ResNet38, MobileNetV1 and V2, and the Wavegram-Logmel-CNN that combines a learnable 1D front end with a 2D log mel branch. CNN14 reaches a mean average precision of 0.431 on AudioSet tagging and Wavegram-Logmel-CNN reaches 0.439, both of which improved the previous state of the art (0.392).⁷ The PANN checkpoints are still a common transfer learning baseline.

Transformer era

The transformer wave arrived in 2021 with the Audio Spectrogram Transformer (AST) by Yuan Gong, Yu-An Chung, and James Glass at MIT (Interspeech 2021). AST is the first convolution free model to set state of the art on audio classification benchmarks: 0.485 mAP on AudioSet, 95.6 percent accuracy on ESC-50, and 98.1 percent on Speech Commands V2.⁸ AST tokenises a log mel spectrogram into 16 by 16 patches with a small stride and feeds them through a ViT initialised from ImageNet. The choice to initialise from ImageNet was crucial; without it the model overfits.

AST kicked off a long line of variants. PaSST (Koutini et al., Interspeech 2022) replaces dense self attention with patchout, a structured dropout over time and frequency patches, which gives a 4x training speedup and a small accuracy gain.⁹ HTS-AT (Chen et al., ICASSP 2022) introduces a hierarchical Swin Transformer style backbone with a token semantic module that allows the same network to perform clip level classification and time localised detection.¹⁰ HTS-AT reaches state of the art on AudioSet, ESC-50, and Speech Commands V2 while using only 35 percent of the parameters and 15 percent of the training time of AST.

The self supervised variants followed. SSAST (Gong et al., AAAI 2022) pretrains the AST backbone on unlabelled AudioSet and LibriSpeech with a joint discriminative and generative masked spectrogram patch modelling objective, and improves over the original ImageNet pretrained AST by 60.9 percent on average across audio and speech tasks.¹¹ AudioMAE (Huang et al., NeurIPS 2022) is the audio version of the masked autoencoder of He et al.; it masks 80 percent of patches, uses a ViT encoder on the unmasked patches, and a ViT decoder with local window attention to reconstruct the spectrogram. AudioMAE set new state of the art on six classification benchmarks at the time of publication.¹²

BEATs (Chen et al., ICML 2023, Microsoft) closes the gap between masked reconstruction and discrete token prediction. Instead of reconstructing raw spectrogram pixels, BEATs trains an acoustic tokenizer in parallel with the SSL model, and the model predicts the tokenizer indices on the masked patches. This iterative co training reached 50.6 percent mAP on AudioSet-2M without external data and 98.1 percent accuracy on ESC-50.¹³ EAT (Chen et al., IJCAI 2024) introduces an utterance frame objective and large inverse block masks, and reduces pretraining time by 15x compared to BEATs iter 3 and 10x compared to AudioMAE while matching or exceeding their downstream scores.¹⁴

Two other self supervised families sit alongside BEATs and EAT. ATST (Audio Teacher-Student Transformer) by Li, Shao, and Li, published in IEEE/ACM TASLP in 2024, trains a transformer with a teacher student bootstrap and releases two variants: ATST-Clip for clip level tasks and ATST-Frame for frame level tasks; the frame variant is especially strong on sound event detection.¹⁵ Masked Modeling Duo (M2D) from NTT, also in IEEE/ACM TASLP 2024, encodes only the masked part of the input to produce the training target and reports state of the art numbers on UrbanSound8K, VoxCeleb1, AudioSet-20K, GTZAN, and Speech Commands V2; it later spawned the M2D-CLAP audio language variant.¹⁶ As of 2026 the strongest single self supervised result on AudioSet-2M is SSLAM (Self-Supervised Learning from Audio Mixtures) by Alex, Ahmed, Mustafa, Awais, and Jackson (ICLR 2025), which adds an audio mixture objective and a source retention loss to a data2vec style backbone and reaches 50.2 percent mAP, with gains of up to 9.1 percent mAP on polyphonic evaluation sets where overlapping sounds are common.¹⁷

In parallel, large speech self supervised models such as wav2vec 2.0,¹⁸ HuBERT, and WavLM became standard backbones for tasks that involve voice rather than general sound. WavLM in particular is the dominant backbone for speaker recognition, speaker diarisation, and emotion recognition on SUPERB and related benchmarks, because its pretraining objective explicitly preserves speaker identity.¹⁹

The latest direction is contrastive language audio pretraining. The two original CLAP papers, by Elizalde and colleagues at Microsoft and by Wu and colleagues at LAION, both appeared at ICASSP 2023 and extended the CLIP recipe from images to audio. CLAP models are not classifiers in the strict sense; they project audio and text into a shared embedding space, and a class label is just a text prompt at inference time. This enables zero shot audio classification on arbitrary label sets.

Foundational models

The table below summarises the most widely used audio classification backbones in 2024 to 2026. AudioSet mAP refers to the balanced evaluation set unless noted otherwise.

Model	Year	Authors	Architecture	Pretraining	AudioSet mAP	Parameters
VGGish	2017	Hershey et al. (Google)	VGG-A 11 layer CNN	YouTube-100M	not reported on AudioSet	~62M, 128-d embedding
YAMNet	2019	Plakal and Ellis (Google)	MobileNetV1	AudioSet	balanced mAP 0.306	3.7M
PANNs CNN14	2020	Kong et al.	14 layer CNN	AudioSet	0.431	80.8M
PANNs Wavegram-Logmel-CNN	2020	Kong et al.	1D + 2D CNN	AudioSet	0.439	81.1M
AST	2021	Gong, Chung, Glass	ViT-B from ImageNet	AudioSet	0.485	88M
PaSST	2022	Koutini et al.	AST with patchout	AudioSet	0.471	86M
HTS-AT	2022	Chen et al.	Swin-style hierarchical	AudioSet	0.471	31M
SSAST	2022	Gong et al.	AST, self-supervised MSPM	AudioSet + LibriSpeech	0.310 (linear)	89M
AudioMAE	2022	Huang et al. (Meta)	ViT MAE	AudioSet	0.473	86M
BEATs iter 3	2023	Chen et al. (Microsoft)	ViT, acoustic tokenizer	AudioSet	0.486 to 0.506	90M
ATST-Frame	2024	Li, Shao, Li	Teacher-student transformer	AudioSet	SOTA on frame-level SED	86M
M2D	2024	Niizumi et al. (NTT)	ViT, masked modeling duo	AudioSet	SOTA on AudioSet-20K, UrbanSound8K	86M
CED	2024	Dinkel et al. (Xiaomi)	Transformer, ensemble distillation	AudioSet	0.490 for a 10M student	10M to 86M
EAT	2024	Chen et al.	ViT, utterance-frame objective	AudioSet	0.488	88M
Dasheng	2024	Dinkel et al. (Xiaomi)	ViT MAE encoder, 272k hours	general audio SSL	general encoder, not the headline metric	up to 1.2B
SSLAM	2025	Alex et al.	data2vec-style, audio mixtures	AudioSet	0.502 (AudioSet-2M)	86M

Most of these models are released as Hugging Face checkpoints and can be fine tuned in a few hours on a single GPU. The standard procedure is to load the AudioSet pretrained weights, replace the classification head with a layer sized to the new label set, and train with binary cross entropy if the labels are multi label or softmax cross entropy if a single label per clip is expected.

VGGish

VGGish is a VGG-A style 11 layer CNN that takes a 96 by 64 log mel patch as input and produces a 128 dimensional embedding. It was trained on a private corpus called YouTube-100M to predict video level topic labels, not audio specific ones, so its embedding is a general purpose feature rather than a classifier per se. VGGish remains the default "give me a feature vector for a clip" model in many production systems, partly because of inertia and partly because its small embedding size makes downstream learning cheap.

YAMNet

YAMNet is a MobileNetV1 trained directly on AudioSet to predict 521 of its 527 leaf classes. It runs on a 0.96 second window with a 0.48 second hop, produces frame level posteriors, and is shipped in TensorFlow Hub, TFLite, and PyTorch ports. On the AudioSet evaluation set YAMNet reaches a balanced average d-prime of 2.318, a balanced mAP of 0.306, and an lwlrap of 0.393. It is small enough to run on a phone and is the recommended starting point for keyword spotting style applications that need on device inference.

PANNs

PANNs is the most comprehensive open release of audio classifiers trained on AudioSet. The repository includes CNN10, CNN14, ResNet22, ResNet38, ResNet54, MobileNetV1 and V2, DaiNet, LeeNet, Res1dNet, and Wavegram-Logmel-CNN. CNN14 has become the de facto baseline for transfer learning to ESC-50, UrbanSound8K, FSD50K, GTZAN, RAVDESS, and other small datasets because its 80 megabyte weights and 0.431 AudioSet mAP set a sensible accuracy/size tradeoff. The PANN paper also introduced the Wavegram front end, a learnable 1D convolution stack that produces a spectrogram like representation directly from the waveform and that, when concatenated with a log mel branch, gives the best PANN at 0.439 mAP.

AST and family

The Audio Spectrogram Transformer treats audio classification as a vision problem. It cuts the 128 bin log mel spectrogram of a 10 second clip into 16 by 16 patches with an overlap of 6 in time and 10 in frequency, flattens them into a sequence of patch tokens, prepends a class token, adds learnable positional embeddings, and runs the sequence through a ViT-B with 12 transformer blocks. The output of the class token is fed into a linear classifier. The trick that made AST work was to initialise the ViT from an ImageNet pretrained checkpoint and to interpolate the positional embeddings to fit the audio patch grid. Without this initialisation the model overfits AudioSet.

PaSST modifies AST by structured patchout, dropping out either random patches or entire time or frequency rows during training. This shortens the input sequence length and acts as a regulariser. The reported speedup is about 4x compared to AST on the same hardware, with a small accuracy hit (0.471 mAP vs 0.485). HTS-AT replaces the flat ViT with a hierarchical Swin Transformer style backbone that pools tokens across stages and ends with a token semantic module that maps the final tokens back to per class spectrotemporal maps, which makes the same network useful for sound event detection.

Self supervised audio transformers

SSAST, AudioMAE, BEATs, EAT, ATST, M2D, and SSLAM are all self supervised models that share the AST or ViT backbone but differ in the pretraining objective. SSAST uses joint discriminative and generative masked spectrogram patch modelling. AudioMAE uses pure masked reconstruction in pixel space with a high masking ratio (80 percent) and a local window attention decoder. BEATs uses an iterative tokenizer co training procedure where the model and the acoustic tokenizer take turns improving each other; the third iteration produces the strongest released checkpoints, reaching 50.6 percent mAP on AudioSet-2M. EAT applies the data2vec 2.0 bootstrap recipe to audio with an utterance frame objective and inverse block masking, and is the most training efficient model in this group. SSLAM extends the same line by training on synthetic mixtures of clips so that the representation stays robust when several sounds overlap, which is the common case in real recordings.

Two recent threads push on efficiency and scale rather than on the masking objective. CED (Consistent Ensemble Distillation) by Dinkel and colleagues at Xiaomi (ICASSP 2024) distils an ensemble of large AudioSet teachers into a single student by replaying stored teacher logits under matched augmentation. The recipe is label free and adds only about 0.3 percent disk overhead for AudioSet, and a 10 million parameter CED student reaches 49.0 mAP, which is competitive with models nearly ten times its size.²⁰ Dasheng (Deep Audio-Signal Holistic Embeddings), also from Xiaomi (Interspeech 2024), goes the other way and scales a masked autoencoder encoder to 1.2 billion parameters trained on 272,356 hours of public audio from VGGSound, AudioSet, MTG-Jamendo, and ACAV100M. Dasheng is a general purpose encoder rather than an AudioSet classifier, and it reports strong transfer across speech, music, and environmental sound tasks such as CREMA-D, LibriCount, Speech Commands, and VoxLingua107.²¹

CLAP family

Contrastive Language Audio Pretraining (CLAP) is the audio analogue of CLIP. A CLAP model has an audio encoder and a text encoder, and it is trained on pairs of audio clips and natural language captions or tags to maximise the cosine similarity of matching pairs and minimise it for mismatched pairs in a batch. Once trained, the model can classify a clip in a zero shot manner by encoding a list of candidate label prompts ("the sound of a dog barking", "the sound of a car horn") and picking the prompt closest to the audio embedding in the shared space.

The two original CLAP papers both appeared at ICASSP 2023:²²²³

Model	Authors	Audio encoder	Text encoder	Training data	Distinctive feature
Microsoft CLAP	Elizalde, Deshmukh, Al Ismail, Wang	CNN14	BERT	128k audio text pairs	First demonstration of zero shot audio classification across 26 tasks
LAION-CLAP	Wu, Chen, Zhang, Hui, Berg-Kirkpatrick, Dubnov	HTS-AT	RoBERTa	LAION-Audio-630K (633,526 pairs)	Feature fusion for variable length audio and keyword to caption augmentation

LAION-CLAP also released the LAION-Audio-630K dataset, which pools clips from Freesound, AudioCaps, Clotho, BBC Sound Effects, and several smaller sources. The LAION checkpoints are the most widely used CLAP variants and are integrated into Hugging Face transformers under the model identifier laion/clap-htsat-unfused.

Microsoft followed up with CLAP 2023 and CLAP 2024 checkpoints that scale the dataset and the model size and that are released through the msclap package on PyPI. Other CLAP variants include CompA-CLAP, CoLLAP for long form audio, CALM, and music focused MuLan from Google.

CLAP is a flexible tool. It is used for zero shot tagging, for retrieval (given a text query, find the audio clip), for captioning by combining the audio encoder with a language model decoder, and as a frozen embedding for downstream classifiers. The catch is that on a closed label set with enough training data, a supervised fine tuned BEATs or AST still beats CLAP. CLAP wins when the label set is large, open ended, or unknown in advance.

Audio language models

A newer route to open ended audio classification is the large audio language model, which connects an audio encoder to a text generating large language model and answers natural language questions about a clip, including "what sound is this". Qwen2-Audio (Alibaba, 2024) is a 7 billion parameter instruction tuned model released as Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct; it reports state of the art instruction following across speech, sound, and music and can classify events directly from a prompt without a fixed label head.²⁴ NVIDIA's Audio Flamingo 3 (2025) couples an AF-Whisper unified encoder with a Qwen2.5-7B decoder and reports state of the art on several zero shot and few shot audio classification, captioning, and question answering benchmarks, with support for clips up to ten minutes long.²⁵ These models are heavier than a CLAP encoder and are usually overkill for a single fixed label set, but they are the most flexible option when the task is phrased in free text or mixes classification with reasoning.

Speaker recognition

Speaker recognition is the family of tasks that ask who is talking. It splits into speaker identification (closed set, one of N), speaker verification (open set, same or different), and speaker diarisation (who spoke when), and it is the oldest audio classification problem in the deep learning era.

The field was dominated for a decade by the i-vector front end of Dehak, Kenny, and others (2010), which used a factor analysis model on GMM supervectors to produce a fixed dimensional embedding for a recording, followed by probabilistic linear discriminant analysis. The first deep learning system to clearly beat i-vectors was x-vectors (Snyder, Garcia-Romero, Sell, Povey, Khudanpur, ICASSP 2018), a time delay neural network that produces a fixed dimensional speaker embedding by statistics pooling over frame level features.²⁶ X-vectors became the production standard quickly because they handle large training corpora and data augmentation much better than i-vectors.

The next jump was ECAPA-TDNN (Desplanques, Thienpondt, Demuynck, Interspeech 2020).²⁷ ECAPA-TDNN stands for Emphasised Channel Attention, Propagation and Aggregation in Time Delay Neural Networks. It introduces three changes on top of x-vectors: Res2Net blocks with multi scale temporal convolutions, squeeze and excitation channel attention, and channel dependent attentive statistics pooling. On VoxCeleb1-O it reaches equal error rate (EER) below one percent and it remains a strong baseline through 2026. The SpeechBrain implementation speechbrain/spkrec-ecapa-voxceleb on Hugging Face is the most downloaded speaker embedding checkpoint.

From 2022 onward the trend has been to use general purpose speech self supervised models as backbones rather than to train speaker specific networks from scratch. WavLM Large fine tuned on VoxCeleb2 sets new state of the art on the SUPERB speaker tasks and on VoxCeleb evaluation sets, partly because its pretraining objective explicitly preserves speaker information. NVIDIA's NeMo TitaNet is another strong production model in the same lineage.

The metric of record for verification is the equal error rate (EER) and the minimum detection cost function (minDCF). Standard test sets include VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H, and the NIST SRE series.

Music tagging

Music tagging is the task of assigning multi label tags such as genre, instrument, mood, and tempo to a music clip. The classical benchmarks are MagnaTagATune (MTT) with 188 tags on ~25,000 clips of 29 seconds, the Million Song Dataset (MSD) with 50 tags on a million clips, and MTG-Jamendo with 195 tags on 55,000 full tracks.

The progression of music tagging models mirrors the wider field. The deep learning baseline is musicnn (Pons and Serra, 2019), a musically motivated CNN that uses vertical filters to capture timbre and horizontal filters to capture rhythm, with an attention output layer.²⁸ musicnn reaches 90.77 ROC-AUC and 38.61 PR-AUC on MTT and 88.81 ROC-AUC and 31.51 PR-AUC on MSD. The library musicnn ships pretrained MTT and MSD checkpoints and is still a useful baseline. PANNs CNN14 and AST adapted to music tagging match or exceed musicnn on the same benchmarks.

The current generation of music tagging models is self supervised. JukeMIR (Castellon, Donahue, Liang, 2021) used internal representations from OpenAI's Jukebox music generation model as music features and matched supervised baselines on tag and chord recognition. MERT (Li, Yuan, Zhang et al., ICLR 2024) is a dedicated music self supervised model.²⁹ It is a HuBERT style BERT encoder trained with two teachers: an RVQ-VAE acoustic teacher and a Constant-Q Transform musical teacher. MERT scales from 95 million to 330 million parameters and reaches state of the art on a 14 task suite covering tagging, key detection, beat tracking, source separation, and singer identification.

MuLan (Huang et al., Google, 2022) and the LAION-CLAP music branch extend the CLAP idea to music with text queries such as "jazz piano with brushes" or "upbeat indie pop with female vocals". MuLan is the audio encoder used in MusicLM and is one of the most influential music representations in the recent generation of music generators.

Sound event detection (DCASE)

Sound event detection (SED) requires the model to output not just which events are present but also when each one starts and ends, frame by frame or in continuous time. SED is harder than clip level classification because the model has to deal with overlapping events, with labels of very different duration, and with weak or missing time annotations.

The SED community is organised around the annual DCASE (Detection and Classification of Acoustic Scenes and Events) challenge, which has run every year since 2013 and has produced most of the canonical SED datasets. DCASE 2016 was a turning point because it included separate tasks for acoustic scene classification, sound event detection in synthetic audio, sound event detection in real life audio, and domestic audio tagging, and it documented the shift from GMM and SVM systems to deep learning. The corresponding paper by Mesaros, Heittola, and Virtanen, "Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge" (IEEE/ACM TASLP 2018), is the standard citation for SED methodology.³⁰

Key datasets and tasks that come out of DCASE include:

Dataset / task	First year	Domain	Annotation
TUT Acoustic Scenes	2016	Indoor and outdoor scenes	Clip level scene label
TUT Sound Events	2016	Home and residential area	Strong (onset/offset)
DESED (Domestic Environment Sound Event Detection)	2018	Home	Mix of strong and weak labels
URBAN-SED	2017	Urban	Strong, synthetic
FSD50K	2020	Freesound clips	Clip level only, AudioSet style
MAESTRO Real (DCASE 2023 Task 4)	2023	Real and synthetic mixtures	Strong with soft labels

Modern SED systems are usually built on top of a clip level pretrained backbone such as PANNs or BEATs, with an attention pooling head that learns to localise the relevant frames. HTS-AT, with its token semantic module, was specifically designed for this dual use case. The state of the art on DESED hovers around 0.5 to 0.6 in the polyphonic sound detection score (PSDS), which is the DCASE primary metric.

Bioacoustics and Google Perch

Bioacoustics applies audio classification to animal sounds, with the practical aim of biodiversity monitoring and conservation. The community has benefited enormously from transfer learning because labelled bird and bat recordings are scarce.

The most influential model is Google's Perch, released in 2023 by the Google DeepMind bioacoustics team. Perch is an EfficientNet B1 trained on the Xeno-canto bird sound archive with a sequence of contrastive and supervised losses, and it produces an embedding that transfers strongly to dozens of bioacoustic classification benchmarks. The Nature Scientific Reports paper "Global birdsong embeddings enable superior transfer learning for bioacoustic classification" (Ghani, Denton, Kahl, Klinck, 2023) shows that Perch embeddings beat task specific models on many BEANS benchmarks, including frog, bat, and marine mammal calls, despite being trained only on birds.³¹ Perch has been downloaded more than 250,000 times and is integrated into Cornell's BirdNET Analyzer and into the Conservation Metrics field tools.

Perch 2.0, released in 2025, scales the model to support roughly 15,000 species and uses an updated training pipeline that combines species classification with self distillation. It sets new state of the art on the BirdSet and BEANS benchmarks and is recommended for any new bioacoustic project.³²

Other bioacoustic models include BirdNET (Kahl et al., Cornell, 2021), which is widely deployed in citizen science apps such as Merlin Bird ID, and AnimalNet for a broader range of taxa. The BEANS (BEnchmark of ANimal Sounds) suite by Hagiwara et al. (2022) is the standard evaluation for bioacoustic embeddings.³³

Benchmarks

The community uses several standard benchmarks to compare audio classification models. They differ in label space, clip length, domain, and licence. Two that anchor the keyword spotting and language identification tasks are Google Speech Commands³⁴ and VoxLingua107.³⁵

Benchmark	Year	Size	Classes	Task	Metric
AudioSet	2017	2.1M clips, 10 s	527 leaf (632 ontology)	Multi label tagging	mAP
ESC-50	2015	2,000 clips, 5 s	50	Single label	Accuracy, 5 fold CV
ESC-10	2015	400 clips, 5 s	10	Single label	Accuracy
UrbanSound8K	2014	8,732 clips, up to 4 s	10	Single label	Accuracy, 10 fold CV
FSD50K	2020	51,197 clips	200	Multi label	mAP
Speech Commands V2	2018	105,829 clips, 1 s	35	Keyword	Accuracy
VoxLingua107	2021	6,628 hours	107 languages	Single label	Accuracy
VoxCeleb1	2017	153,516 utterances	1,251 speakers	Identification, verification	EER
MagnaTagATune	2009	~25,000 clips, 29 s	188 tags (top 50 used)	Multi label	ROC-AUC, PR-AUC
MTG-Jamendo	2019	55,000 tracks	195 tags	Multi label	ROC-AUC, PR-AUC
BirdCLEF	2014+	~700k recordings	10,000+ species	Multi label	macro F1
BEANS	2022	12 datasets	many	Bioacoustic	Mean of per task metrics
HEAR 2021	2022	19 tasks, 16 datasets	mixed	General	Mean of per task metrics
SUPERB	2021	10 speech tasks	mixed	Speech	Mean of per task metrics
X-ARES	2024	22 tasks	mixed	General	Mean of per task metrics

HEAR (Holistic Evaluation of Audio Representations) by Turian and colleagues at NeurIPS 2021 was the first attempt at a unified audio representation benchmark across speech, environmental sound, and music. Twenty nine submitted models from thirteen teams were evaluated on nineteen tasks from sixteen datasets, with a common API that takes an audio file and returns an embedding.³⁶ HEAR remains the most cited general audio representation benchmark, though its tasks are now considered relatively easy and recent benchmarks such as X-ARES extend it.

Applications

Audio classification is used in a wide range of practical systems.

On device keyword spotting and wake word detection. Every consumer smart speaker and most smartphones run a small audio classifier continuously on device, looking for the wake word ("Alexa", "Hey Siri", "OK Google"). These detectors are typically tiny CNNs or RNNs with a few hundred thousand parameters and a memory footprint under a megabyte. Google's MarbleNet and the open source PocketSphinx keyword spotter are representative.

Hearing aids and accessibility. Modern hearing aids use real time sound scene classifiers to decide whether the wearer is in a quiet room, in a restaurant, in traffic, or listening to music, and they adjust filtering and compression accordingly. Cochlear implant processors use similar classifiers for environment detection.

Industrial machine monitoring. Sound based predictive maintenance for industrial equipment uses anomaly detection on top of audio classification. DCASE has run a dedicated task on this since 2020 with the MIMII and ToyADMOS datasets, where the model has to flag malfunctioning pumps, valves, fans, and slide rails from their sound.

Surveillance and public safety. Gunshot detection systems such as ShotSpotter use distributed microphone arrays and a classifier trained on impulsive sounds to localise gunfire in urban areas. Glass break detectors in alarm systems are simpler classifiers of the same family.

Wildlife monitoring. Passive acoustic monitoring of forests, wetlands, and oceans relies on bioacoustic classifiers such as BirdNET and Perch. The 2024 BirdCLEF challenge had recordings from passive monitoring sites in the Western Ghats of India, and the winning system used a Perch embedding plus a small classifier head.

Music streaming and recommendation. Spotify, Apple Music, and YouTube Music all use audio tagging models internally for genre, mood, and instrument tagging, for cold start recommendations on tracks without listening history, and for automatic playlist generation. Spotify's Annoy and Faiss based retrieval over learned audio embeddings is one production example.

Content moderation. Platforms that host user uploaded video use audio classification to flag certain content categories, for example violent sounds, certain types of music, and language identification on the speech track before it is run through automatic speech recognition systems such as Whisper.³⁷

Voice biometrics. Banks and call centres use speaker verification systems based on ECAPA-TDNN or WavLM to confirm a caller's identity. The accuracy is high enough for low risk operations but is usually combined with other factors for high value transactions.

Limitations

Audio classifiers inherit most of the limitations of any deep learning system trained on internet data and add a few that are specific to audio.

Label noise on AudioSet is significant. The original paper reports rater unanimity of 76.2 percent, which means roughly a quarter of segments have at least one rater disagreeing on at least one label. Single label datasets such as ESC-50 are cleaner but small.

Domain shift hurts. A model trained on YouTube audio sees mostly studio recorded or amateur near field speech and music. It often fails on far field recordings, on heavy reverberation, on unusual microphones, and on languages or accents that are underrepresented in the training data. Bioacoustic transfer to new geographies is a constant problem; a model trained on North American birds will miss many South American species.

The AudioSet ontology is biased toward sounds that are common on YouTube. Many real world sound categories, for example specific medical sounds, industrial sounds, and rare wildlife calls, are absent or only weakly represented. The ontology also conflates many sounds that are perceptually distinct, for example all dog sounds are in one node despite a bark and a whimper being acoustically very different.

Privacy is a real concern. Always on audio classifiers run on phones and smart speakers raise legitimate questions about what data leaves the device and what is retained. Some manufacturers store wake word triggers for service improvement, which has led to several public controversies. Speaker recognition systems are also subject to spoofing attacks (replay, voice conversion, deepfakes), and the ASVspoof challenge series has documented how easy it is to fool naive verification systems.

Evaluation metrics under report failure modes. mAP on AudioSet gives a single number that hides large per class variation; rare classes are often near random. F score on SED is sensitive to the time tolerance used in the evaluation, and small changes in the metric can swap the ranking of models.

Compute and energy costs are not trivial. Training BEATs iter 3 on AudioSet takes roughly a thousand A100 GPU hours, which is expensive in dollars and in carbon. Most academic labs cannot reproduce these runs and depend on the released checkpoints.

Finally, zero shot CLAP style models are tempting but they have a hidden cost. Prompt engineering matters a lot. The prompt "the sound of a dog barking" gives different results from "a dog barking" or "dog", and small differences can change the rank order on a benchmark. Calibration of CLAP scores across classes is also not well behaved out of the box.

References

Piczak, K. J. (2015). "ESC: Dataset for Environmental Sound Classification." ACM Multimedia 2015. https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf Accessed 2026-05-31. ↩
Salamon, J., Jacoby, C., Bello, J. P. (2014). "A Dataset and Taxonomy for Urban Sound Research." ACM Multimedia 2014. https://dl.acm.org/doi/10.1145/2647868.2655045 Accessed 2026-05-31. ↩
Aytar, Y., Vondrick, C., Torralba, A. (2016). "SoundNet: Learning Sound Representations from Unlabeled Video." NeurIPS 2016. https://arxiv.org/abs/1610.09001 Accessed 2026-05-31. ↩
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., Wilson, K. (2017). "CNN Architectures for Large-Scale Audio Classification." ICASSP 2017. https://arxiv.org/abs/1609.09430 Accessed 2026-05-31. ↩
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., Ritter, M. (2017). "Audio Set: An ontology and human-labeled dataset for audio events." ICASSP 2017. https://research.google/pubs/audio-set-an-ontology-and-human-labeled-dataset-for-audio-events/ Accessed 2026-05-31. ↩
Plakal, M., Ellis, D. P. W. (2019, updated 2020). "YAMNet." TensorFlow Models. https://github.com/tensorflow/models/tree/master/research/audioset/yamnet Accessed 2026-05-31. ↩
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M. D. (2020). "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition." IEEE/ACM TASLP. https://arxiv.org/abs/1912.10211 Accessed 2026-05-31. ↩
Gong, Y., Chung, Y.-A., Glass, J. (2021). "AST: Audio Spectrogram Transformer." Interspeech 2021. https://arxiv.org/abs/2104.01778 Accessed 2026-05-31. ↩
Koutini, K., Schluter, J., Eghbal-zadeh, H., Widmer, G. (2022). "Efficient Training of Audio Transformers with Patchout." Interspeech 2022. https://arxiv.org/abs/2110.05069 Accessed 2026-05-31. ↩
Chen, K., Du, X., Zhu, B., Ma, Z., Berg-Kirkpatrick, T., Dubnov, S. (2022). "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection." ICASSP 2022. https://arxiv.org/abs/2202.00874 Accessed 2026-05-31. ↩
Gong, Y., Lai, C.-I. J., Chung, Y.-A., Glass, J. (2022). "SSAST: Self-Supervised Audio Spectrogram Transformer." AAAI 2022. https://arxiv.org/abs/2110.09784 Accessed 2026-05-31. ↩
Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., Feichtenhofer, C. (2022). "Masked Autoencoders that Listen." NeurIPS 2022. https://arxiv.org/abs/2207.06405 Accessed 2026-05-31. ↩
Chen, S., Wu, Y., Wang, C., Liu, S., Tompkins, D., Chen, Z., Wei, F. (2023). "BEATs: Audio Pre-Training with Acoustic Tokenizers." ICML 2023. https://arxiv.org/abs/2212.09058 Accessed 2026-05-31. ↩
Chen, W., Liang, Y., Ma, Z., Zheng, Z., Chen, X. (2024). "EAT: Self-Supervised Pre-Training with Efficient Audio Transformer." IJCAI 2024. https://arxiv.org/abs/2401.03497 Accessed 2026-05-31. ↩
Li, X., Shao, N., Li, X. (2024). "Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks." IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://arxiv.org/abs/2306.04186 Accessed 2026-05-31. ↩
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K. (2024). "Masked Modeling Duo: Towards a Universal Audio Pre-Training Framework." IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://arxiv.org/abs/2404.06095 Accessed 2026-05-31. ↩
Alex, T., Ahmed, S., Mustafa, A., Awais, M., Jackson, P. J. B. (2025). "SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes." ICLR 2025. https://arxiv.org/abs/2506.12222 Accessed 2026-05-31. ↩
Baevski, A., Zhou, H., Mohamed, A., Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS 2020. https://arxiv.org/abs/2006.11477 Accessed 2026-05-31. ↩
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F. (2022). "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." IEEE JSTSP. https://arxiv.org/abs/2110.13900 Accessed 2026-05-31. ↩
Dinkel, H., Wang, Y., Yan, Z., Zhang, J., Wang, Y. (2024). "CED: Consistent Ensemble Distillation for Audio Tagging." ICASSP 2024. https://arxiv.org/abs/2308.11957 Accessed 2026-05-31. ↩
Dinkel, H., Yan, Z., Wang, Y., Zhang, J., Wang, Y., Wang, B. (2024). "Scaling up masked audio encoder learning for general audio classification" (Dasheng). Interspeech 2024. https://arxiv.org/abs/2406.06992 Accessed 2026-05-31. ↩
Elizalde, B., Deshmukh, S., Al Ismail, M., Wang, H. (2023). "CLAP: Learning Audio Concepts From Natural Language Supervision." ICASSP 2023. https://arxiv.org/abs/2206.04769 Accessed 2026-05-31. ↩
Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S. (2023). "Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP 2023. https://arxiv.org/abs/2211.06687 Accessed 2026-05-31. ↩
Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., Zhou, C., Zhou, J. (2024). "Qwen2-Audio Technical Report." Alibaba. https://arxiv.org/abs/2407.10759 Accessed 2026-05-31. ↩
Goel, A., Ghosh, S., Kim, J., Kumar, S., Kong, Z., Lee, S.-g., Yang, C.-H. H., Duraiswami, R., Manocha, D., Valle, R., Catanzaro, B. (2025). "Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models." NVIDIA (accepted to NeurIPS 2025). https://research.nvidia.com/labs/adlr/AF3/ Accessed 2026-05-31. ↩
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S. (2018). "X-Vectors: Robust DNN Embeddings for Speaker Recognition." ICASSP 2018. https://www.danielpovey.com/files/2018_icassp_xvectors.pdf Accessed 2026-05-31. ↩
Desplanques, B., Thienpondt, J., Demuynck, K. (2020). "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification." Interspeech 2020. https://arxiv.org/abs/2005.07143 Accessed 2026-05-31. ↩
Pons, J., Serra, X. (2019). "musicnn: Pre-trained convolutional neural networks for music audio tagging." ISMIR 2019 LBD. https://arxiv.org/abs/1909.06654 Accessed 2026-05-31. ↩
Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Lin, C., Ragni, A., Benetos, E., Gyenge, N., Dannenberg, R., Liu, R., Chen, W., Xia, G., Shi, Y., Huang, W., Wang, Y., Guo, Y., Fu, J. (2024). "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training." ICLR 2024. https://arxiv.org/abs/2306.00107 Accessed 2026-05-31. ↩
Mesaros, A., Heittola, T., Virtanen, T. (2018). "Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge." IEEE/ACM TASLP. https://ieeexplore.ieee.org/document/8123864/ Accessed 2026-05-31. ↩
Ghani, B., Denton, T., Kahl, S., Klinck, H. (2023). "Global birdsong embeddings enable superior transfer learning for bioacoustic classification." Scientific Reports 13. https://www.nature.com/articles/s41598-023-49989-z Accessed 2026-05-31. ↩
Hagiwara, M., et al. (2025). "Perch 2.0: The Bittern Lesson for Bioacoustics." arXiv:2508.04665. https://arxiv.org/abs/2508.04665 Accessed 2026-05-31. ↩
Hagiwara, M., Hoffman, B., Liu, J.-Y., Cusimano, M., Effenberger, F., Zacarian, K. (2022). "BEANS: The Benchmark of Animal Sounds." arXiv:2210.12300. https://arxiv.org/abs/2210.12300 Accessed 2026-05-31. ↩
Warden, P. (2018). "Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition." arXiv:1804.03209. https://arxiv.org/abs/1804.03209 Accessed 2026-05-31. ↩
Valk, J., Alumae, T. (2021). "VoxLingua107: a Dataset for Spoken Language Recognition." IEEE SLT 2021. https://arxiv.org/abs/2011.12998 Accessed 2026-05-31. ↩
Turian, J., Shier, J., Khan, H. R., Raj, B., Schuller, B. W., Steinmetz, C. J., Malloy, C., Tzanetakis, G., Velarde, G., McNally, K., Henry, M., Pinto, N., Noufi, C., Clough, C., Herremans, D., Fonseca, E., Engel, J., Salamon, J., Esling, P., Manocha, P., Watanabe, S., Jin, Z., Bisk, Y. (2022). "HEAR: Holistic Evaluation of Audio Representations." NeurIPS 2021 Competition Track, PMLR 176. https://arxiv.org/abs/2203.03022 Accessed 2026-05-31. ↩
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)." OpenAI. https://github.com/openai/whisper Accessed 2026-05-31. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models Automatic Speech Recognition Models

Overview

History

MFCC and GMM era

CNN era

Transformer era

Foundational models

VGGish

YAMNet

PANNs

AST and family

Self supervised audio transformers

CLAP family

Audio language models

Speaker recognition

Music tagging

Sound event detection (DCASE)

Bioacoustics and Google Perch

Benchmarks

Applications

Limitations

See also

References

Footnotes

Improve this article

Related Articles

Speech recognition

AudioCraft

Whisper

Wav2Vec

WaveNet

SUPERB

What links here

Related Articles

Speech recognition

AudioCraft

Whisper

Wav2Vec

WaveNet

SUPERB

What links here