Audio Classification Models
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 · 6,378 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 · 6,378 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Audio Models and Tasks
Audio classification models are machine learning systems that assign one or more labels to a recording or to short fragments of it. The labels can describe an event in a clip (a dog bark, a siren, a guitar chord), an environment (a restaurant, a forest, a subway), a speaker, a language, or a music tag such as a genre, an instrument, or a mood. The same family of models is also used for time localised tasks where the system has to report when in a recording a given event occurs.
Modern audio classifiers are almost all deep neural networks trained on log-mel spectrograms or raw waveforms. They draw heavily on architectures invented for vision, including VGG style convolutional networks, ResNets, and vision transformers, and they are usually pretrained on AudioSet before being adapted to a target task. The dominant pretrained backbones in 2024 to 2026 include PANNs, the Audio Spectrogram Transformer (AST) and its descendants, BEATs, AudioMAE, EAT, and the CLAP family of contrastive audio language models that support zero shot classification.
Audio classification is a broad family of tasks that share one structural property: a model maps an audio signal, or a short window inside it, to one or more discrete labels. The signal is usually short, on the order of a few hundred milliseconds to thirty seconds, and the label space ranges from two classes to several thousand.
The community usually distinguishes the following sub tasks:
| Sub task | What is predicted | Typical clip length | Representative datasets |
|---|---|---|---|
| Sound event classification | One or more event labels per clip | 1 to 10 seconds | AudioSet, ESC-50, UrbanSound8K, FSD50K |
| Sound event detection (SED) | Labels with onset and offset times | 10 to 60 seconds | DCASE SED tasks, URBAN-SED, DESED |
| Acoustic scene classification | One label describing the environment | 10 to 30 seconds | DCASE ASC, TUT Acoustic Scenes |
| Music tagging | Multi label tags for genre, mood, instrument | 30 seconds clip | MagnaTagATune (MTT), Million Song Dataset, MTG-Jamendo |
| Speaker identification | Speaker label from a closed set | 1 to 8 seconds | VoxCeleb1, VoxCeleb2 |
| Speaker verification | Same speaker yes or no | 1 to 8 seconds pairs | VoxCeleb, NIST SRE |
| Spoken language identification | Language label | 3 to 30 seconds | VoxLingua107, CommonLanguage |
| Keyword spotting | One of a fixed vocabulary | 1 second | Google Speech Commands |
| Bioacoustic classification | Species or call type | a few seconds | BirdCLEF, BEANS, BirdSet |
Sound event classification and sound event detection are related but they are not the same problem. A classifier is satisfied by a clip level decision, for example "this recording contains a baby crying somewhere inside it". A detector also has to provide the onset and offset times of each event in continuous time, which is harder and is usually evaluated by an event based F score rather than mean average precision.
The key design choices for an audio classifier are the input representation (raw waveform, short time Fourier transform, log mel spectrogram, MFCC, learnable filterbank), the backbone (CNN, transformer, conformer, state space model), the temporal pooling that produces a clip level decision from a sequence of frame level features, and the training objective. Most state of the art systems use log mel spectrograms because they have proved hard to beat as input, and the dominant pooling is either global average pooling for CNNs or a CLS token for transformers.
Before deep learning, audio classification systems were built on a long pipeline that started with hand engineered features and ended with a probabilistic or kernel classifier. The most common feature was the mel frequency cepstral coefficient (MFCC), first proposed for speech recognition by Davis and Mermelstein in 1980. MFCCs are produced by computing a short time Fourier transform, mapping the magnitude spectrum to a mel filterbank that imitates the frequency selectivity of the human cochlea, taking the logarithm, and applying a discrete cosine transform that decorrelates the resulting bands.
On top of MFCCs the standard classifier for decades was the Gaussian mixture model (GMM), in particular the GMM-UBM (universal background model) framework introduced by Reynolds and others in the early 2000s for speaker recognition. Closely related were hidden Markov models for sequence labelling, support vector machines with kernels over bags of frames, and the i-vector approach of Dehak and colleagues from 2010, which gave a fixed dimensional embedding for a recording and dominated speaker verification until the deep learning takeover around 2016.
The MFCC plus GMM stack works well when there is little data and when the target classes are well separated, but it does not scale to the long tailed multi label problems that the field now cares about, and it requires careful hand tuning.
The shift to deep learning came in two waves. The first wave applied small convolutional networks to mel spectrograms and showed that they beat the MFCC stack on clean academic benchmarks such as ESC-50 and UrbanSound8K. Piczak's 2015 baseline CNN on ESC-50 is the canonical reference for this wave.
The second wave, starting in 2016 and 2017, scaled the approach by training on YouTube. Aytar, Vondrick, and Torralba presented SoundNet at NeurIPS 2016, a 1D CNN trained on roughly two million unlabelled videos using a teacher student loss that distilled image classifier predictions on the video frames into the audio stream. SoundNet pushed ESC-50 accuracy by more than ten points and is one of the earliest demonstrations of cross modal self supervision for audio.
In 2017 Hershey and colleagues at Google published "CNN Architectures for Large-Scale Audio Classification" at ICASSP, training AlexNet, VGG, Inception, and ResNet variants on 70 million video soundtracks from YouTube with 30,871 video level labels. The VGG variant became known as VGGish and Google released it as a 128 dimensional embedding model. VGGish has been the default audio embedding for transfer learning ever since.
The same group released AudioSet in 2017 (Gemmeke et al., ICASSP 2017), a curated set of 1,789,621 ten second YouTube clips annotated with labels from a hierarchical ontology of 632 sound classes. AudioSet is by far the largest publicly available sound event dataset and almost every modern audio classifier is pretrained on it. Google later released YAMNet (Plakal and Ellis), a MobileNetV1 trained on AudioSet that predicts 521 of the 527 leaf classes, with 3.7 million weights and 69.2 million multiplies per 960 millisecond frame, which makes it about twenty times smaller than VGGish.
The most influential pure CNN system on AudioSet is the family of Pretrained Audio Neural Networks (PANNs) by Kong, Cao, Iqbal, Wang, Wang, and Plumbley, published in IEEE/ACM Transactions on Audio, Speech and Language Processing in 2020. PANNs comprises fifteen architectures, including CNN10, CNN14, ResNet38, MobileNetV1 and V2, and the Wavegram-Logmel-CNN that combines a learnable 1D front end with a 2D log mel branch. CNN14 reaches a mean average precision of 0.431 on AudioSet tagging and Wavegram-Logmel-CNN reaches 0.439, both of which improved the previous state of the art (0.392). The PANN checkpoints are still a common transfer learning baseline.
The transformer wave arrived in 2021 with the Audio Spectrogram Transformer (AST) by Yuan Gong, Yu-An Chung, and James Glass at MIT (Interspeech 2021). AST is the first convolution free model to set state of the art on audio classification benchmarks: 0.485 mAP on AudioSet, 95.6 percent accuracy on ESC-50, and 98.1 percent on Speech Commands V2. AST tokenises a log mel spectrogram into 16 by 16 patches with a small stride and feeds them through a ViT initialised from ImageNet. The choice to initialise from ImageNet was crucial; without it the model overfits.
AST kicked off a long line of variants. PaSST (Koutini et al., Interspeech 2022) replaces dense self attention with patchout, a structured dropout over time and frequency patches, which gives a 4x training speedup and a small accuracy gain. HTS-AT (Chen et al., ICASSP 2022) introduces a hierarchical Swin Transformer style backbone with a token semantic module that allows the same network to perform clip level classification and time localised detection. HTS-AT reaches state of the art on AudioSet, ESC-50, and Speech Commands V2 while using only 35 percent of the parameters and 15 percent of the training time of AST.
The self supervised variants followed. SSAST (Gong et al., AAAI 2022) pretrains the AST backbone on unlabelled AudioSet and LibriSpeech with a joint discriminative and generative masked spectrogram patch modelling objective, and improves over the original ImageNet pretrained AST by 60.9 percent on average across audio and speech tasks. AudioMAE (Huang et al., NeurIPS 2022) is the audio version of the masked autoencoder of He et al.; it masks 80 percent of patches, uses a ViT encoder on the unmasked patches, and a ViT decoder with local window attention to reconstruct the spectrogram. AudioMAE set new state of the art on six classification benchmarks at the time of publication.
BEATs (Chen et al., ICML 2023, Microsoft) closes the gap between masked reconstruction and discrete token prediction. Instead of reconstructing raw spectrogram pixels, BEATs trains an acoustic tokenizer in parallel with the SSL model, and the model predicts the tokenizer indices on the masked patches. This iterative co training reached 50.6 percent mAP on AudioSet-2M without external data and 98.1 percent accuracy on ESC-50. EAT (Chen et al., IJCAI 2024) introduces an utterance frame objective and large inverse block masks, and reduces pretraining time by 15x compared to BEATs iter 3 and 10x compared to AudioMAE while matching or exceeding their downstream scores.
In parallel, large speech self supervised models such as wav2vec 2.0, HuBERT, and WavLM became standard backbones for tasks that involve voice rather than general sound. WavLM in particular is the dominant backbone for speaker recognition, speaker diarisation, and emotion recognition on SUPERB and related benchmarks, because its pretraining objective explicitly preserves speaker identity.
The latest direction is contrastive language audio pretraining. The two original CLAP papers, by Elizalde and colleagues at Microsoft and by Wu and colleagues at LAION, both appeared at ICASSP 2023 and extended the CLIP recipe from images to audio. CLAP models are not classifiers in the strict sense; they project audio and text into a shared embedding space, and a class label is just a text prompt at inference time. This enables zero shot audio classification on arbitrary label sets.
The table below summarises the most widely used audio classification backbones in 2024 to 2026. AudioSet mAP refers to the balanced evaluation set unless noted otherwise.
| Model | Year | Authors | Architecture | Pretraining | AudioSet mAP | Parameters |
|---|---|---|---|---|---|---|
| VGGish | 2017 | Hershey et al. (Google) | VGG-A 11 layer CNN | YouTube-100M | not reported on AudioSet | ~62M, 128-d embedding |
| YAMNet | 2019 | Plakal and Ellis (Google) | MobileNetV1 | AudioSet | balanced mAP 0.306 | 3.7M |
| PANNs CNN14 | 2020 | Kong et al. | 14 layer CNN | AudioSet | 0.431 | 80.8M |
| PANNs Wavegram-Logmel-CNN | 2020 | Kong et al. | 1D + 2D CNN | AudioSet | 0.439 | 81.1M |
| AST | 2021 | Gong, Chung, Glass | ViT-B from ImageNet | AudioSet | 0.485 | 88M |
| PaSST | 2022 | Koutini et al. | AST with patchout | AudioSet | 0.471 | 86M |
| HTS-AT | 2022 | Chen et al. | Swin-style hierarchical | AudioSet | 0.471 | 31M |
| SSAST | 2022 | Gong et al. | AST, self-supervised MSPM | AudioSet + LibriSpeech | 0.310 (linear) | 89M |
| AudioMAE | 2022 | Huang et al. (Meta) | ViT MAE | AudioSet | 0.473 | 86M |
| BEATs iter 3 | 2023 | Chen et al. (Microsoft) | ViT, acoustic tokenizer | AudioSet | 0.486 to 0.506 | 90M |
| EAT | 2024 | Chen et al. | ViT, utterance-frame objective | AudioSet | 0.488 | 88M |
Most of these models are released as Hugging Face checkpoints and can be fine tuned in a few hours on a single GPU. The standard procedure is to load the AudioSet pretrained weights, replace the classification head with a layer sized to the new label set, and train with binary cross entropy if the labels are multi label or softmax cross entropy if a single label per clip is expected.
VGGish is a VGG-A style 11 layer CNN that takes a 96 by 64 log mel patch as input and produces a 128 dimensional embedding. It was trained on a private corpus called YouTube-100M to predict video level topic labels, not audio specific ones, so its embedding is a general purpose feature rather than a classifier per se. VGGish remains the default "give me a feature vector for a clip" model in many production systems, partly because of inertia and partly because its small embedding size makes downstream learning cheap.
YAMNet is a MobileNetV1 trained directly on AudioSet to predict 521 of its 527 leaf classes. It runs on a 0.96 second window with a 0.48 second hop, produces frame level posteriors, and is shipped in TensorFlow Hub, TFLite, and PyTorch ports. On the AudioSet evaluation set YAMNet reaches a balanced average d-prime of 2.318, a balanced mAP of 0.306, and an lwlrap of 0.393. It is small enough to run on a phone and is the recommended starting point for keyword spotting style applications that need on device inference.
PANNs is the most comprehensive open release of audio classifiers trained on AudioSet. The repository includes CNN10, CNN14, ResNet22, ResNet38, ResNet54, MobileNetV1 and V2, DaiNet, LeeNet, Res1dNet, and Wavegram-Logmel-CNN. CNN14 has become the de facto baseline for transfer learning to ESC-50, UrbanSound8K, FSD50K, GTZAN, RAVDESS, and other small datasets because its 80 megabyte weights and 0.431 AudioSet mAP set a sensible accuracy/size tradeoff. The PANN paper also introduced the Wavegram front end, a learnable 1D convolution stack that produces a spectrogram like representation directly from the waveform and that, when concatenated with a log mel branch, gives the best PANN at 0.439 mAP.
The Audio Spectrogram Transformer treats audio classification as a vision problem. It cuts the 128 bin log mel spectrogram of a 10 second clip into 16 by 16 patches with an overlap of 6 in time and 10 in frequency, flattens them into a sequence of patch tokens, prepends a class token, adds learnable positional embeddings, and runs the sequence through a ViT-B with 12 transformer blocks. The output of the class token is fed into a linear classifier. The trick that made AST work was to initialise the ViT from an ImageNet pretrained checkpoint and to interpolate the positional embeddings to fit the audio patch grid. Without this initialisation the model overfits AudioSet.
PaSST modifies AST by structured patchout, dropping out either random patches or entire time or frequency rows during training. This shortens the input sequence length and acts as a regulariser. The reported speedup is about 4x compared to AST on the same hardware, with a small accuracy hit (0.471 mAP vs 0.485). HTS-AT replaces the flat ViT with a hierarchical Swin Transformer style backbone that pools tokens across stages and ends with a token semantic module that maps the final tokens back to per class spectrotemporal maps, which makes the same network useful for sound event detection.
SSAST, AudioMAE, BEATs, and EAT are all self supervised models that share the AST or ViT backbone but differ in the pretraining objective. SSAST uses joint discriminative and generative masked spectrogram patch modelling. AudioMAE uses pure masked reconstruction in pixel space with a high masking ratio (80 percent) and a local window attention decoder. BEATs uses an iterative tokenizer co training procedure where the model and the acoustic tokenizer take turns improving each other; the third iteration produces the strongest released checkpoints, reaching 50.6 percent mAP on AudioSet-2M. EAT applies the data2vec 2.0 bootstrap recipe to audio with an utterance frame objective and inverse block masking, and is currently the most efficient SSL audio model.
Contrastive Language Audio Pretraining (CLAP) is the audio analogue of CLIP. A CLAP model has an audio encoder and a text encoder, and it is trained on pairs of audio clips and natural language captions or tags to maximise the cosine similarity of matching pairs and minimise it for mismatched pairs in a batch. Once trained, the model can classify a clip in a zero shot manner by encoding a list of candidate label prompts ("the sound of a dog barking", "the sound of a car horn") and picking the prompt closest to the audio embedding in the shared space.
The two original CLAP papers both appeared at ICASSP 2023:
| Model | Authors | Audio encoder | Text encoder | Training data | Distinctive feature |
|---|---|---|---|---|---|
| Microsoft CLAP | Elizalde, Deshmukh, Al Ismail, Wang | CNN14 | BERT | 128k audio text pairs | First demonstration of zero shot audio classification across 26 tasks |
| LAION-CLAP | Wu, Chen, Zhang, Hui, Berg-Kirkpatrick, Dubnov | HTS-AT | RoBERTa | LAION-Audio-630K (633,526 pairs) | Feature fusion for variable length audio and keyword to caption augmentation |
LAION-CLAP also released the LAION-Audio-630K dataset, which pools clips from Freesound, AudioCaps, Clotho, BBC Sound Effects, and several smaller sources. The LAION checkpoints are the most widely used CLAP variants and are integrated into Hugging Face transformers under the model identifier laion/clap-htsat-unfused.
Microsoft followed up with CLAP 2023 and CLAP 2024 checkpoints that scale the dataset and the model size and that are released through the msclap package on PyPI. Other CLAP variants include CompA-CLAP, CoLLAP for long form audio, CALM, and music focused MuLan from Google.
CLAP is a flexible tool. It is used for zero shot tagging, for retrieval (given a text query, find the audio clip), for captioning by combining the audio encoder with a language model decoder, and as a frozen embedding for downstream classifiers. The catch is that on a closed label set with enough training data, a supervised fine tuned BEATs or AST still beats CLAP. CLAP wins when the label set is large, open ended, or unknown in advance.
Speaker recognition is the family of tasks that ask who is talking. It splits into speaker identification (closed set, one of N), speaker verification (open set, same or different), and speaker diarisation (who spoke when), and it is the oldest audio classification problem in the deep learning era.
The field was dominated for a decade by the i-vector front end of Dehak, Kenny, and others (2010), which used a factor analysis model on GMM supervectors to produce a fixed dimensional embedding for a recording, followed by probabilistic linear discriminant analysis. The first deep learning system to clearly beat i-vectors was x-vectors (Snyder, Garcia-Romero, Sell, Povey, Khudanpur, ICASSP 2018), a time delay neural network that produces a fixed dimensional speaker embedding by statistics pooling over frame level features. X-vectors became the production standard quickly because they handle large training corpora and data augmentation much better than i-vectors.
The next jump was ECAPA-TDNN (Desplanques, Thienpondt, Demuynck, Interspeech 2020). ECAPA-TDNN stands for Emphasised Channel Attention, Propagation and Aggregation in Time Delay Neural Networks. It introduces three changes on top of x-vectors: Res2Net blocks with multi scale temporal convolutions, squeeze and excitation channel attention, and channel dependent attentive statistics pooling. On VoxCeleb1-O it reaches equal error rate (EER) below one percent and it remains a strong baseline through 2026. The SpeechBrain implementation speechbrain/spkrec-ecapa-voxceleb on Hugging Face is the most downloaded speaker embedding checkpoint.
From 2022 onward the trend has been to use general purpose speech self supervised models as backbones rather than to train speaker specific networks from scratch. WavLM Large fine tuned on VoxCeleb2 sets new state of the art on the SUPERB speaker tasks and on VoxCeleb evaluation sets, partly because its pretraining objective explicitly preserves speaker information. NVIDIA's NeMo TitaNet is another strong production model in the same lineage.
The metric of record for verification is the equal error rate (EER) and the minimum detection cost function (minDCF). Standard test sets include VoxCeleb1-O, VoxCeleb1-E, VoxCeleb1-H, and the NIST SRE series.
Music tagging is the task of assigning multi label tags such as genre, instrument, mood, and tempo to a music clip. The classical benchmarks are MagnaTagATune (MTT) with 188 tags on ~25,000 clips of 29 seconds, the Million Song Dataset (MSD) with 50 tags on a million clips, and MTG-Jamendo with 195 tags on 55,000 full tracks.
The progression of music tagging models mirrors the wider field. The deep learning baseline is musicnn (Pons and Serra, 2019), a musically motivated CNN that uses vertical filters to capture timbre and horizontal filters to capture rhythm, with an attention output layer. musicnn reaches 90.77 ROC-AUC and 38.61 PR-AUC on MTT and 88.81 ROC-AUC and 31.51 PR-AUC on MSD. The library musicnn ships pretrained MTT and MSD checkpoints and is still a useful baseline. PANNs CNN14 and AST adapted to music tagging match or exceed musicnn on the same benchmarks.
The current generation of music tagging models is self supervised. JukeMIR (Castellon, Donahue, Liang, 2021) used internal representations from OpenAI's Jukebox music generation model as music features and matched supervised baselines on tag and chord recognition. MERT (Li, Yuan, Zhang et al., ICLR 2024) is a dedicated music self supervised model. It is a HuBERT style BERT encoder trained with two teachers: an RVQ-VAE acoustic teacher and a Constant-Q Transform musical teacher. MERT scales from 95 million to 330 million parameters and reaches state of the art on a 14 task suite covering tagging, key detection, beat tracking, source separation, and singer identification.
MuLan (Huang et al., Google, 2022) and the LAION-CLAP music branch extend the CLAP idea to music with text queries such as "jazz piano with brushes" or "upbeat indie pop with female vocals". MuLan is the audio encoder used in MusicLM and is one of the most influential music representations in the recent generation of music generators.
Sound event detection (SED) requires the model to output not just which events are present but also when each one starts and ends, frame by frame or in continuous time. SED is harder than clip level classification because the model has to deal with overlapping events, with labels of very different duration, and with weak or missing time annotations.
The SED community is organised around the annual DCASE (Detection and Classification of Acoustic Scenes and Events) challenge, which has run every year since 2013 and has produced most of the canonical SED datasets. DCASE 2016 was a turning point because it included separate tasks for acoustic scene classification, sound event detection in synthetic audio, sound event detection in real life audio, and domestic audio tagging, and it documented the shift from GMM and SVM systems to deep learning. The corresponding paper by Mesaros, Heittola, and Virtanen, "Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge" (IEEE/ACM TASLP 2018), is the standard citation for SED methodology.
Key datasets and tasks that come out of DCASE include:
| Dataset / task | First year | Domain | Annotation |
|---|---|---|---|
| TUT Acoustic Scenes | 2016 | Indoor and outdoor scenes | Clip level scene label |
| TUT Sound Events | 2016 | Home and residential area | Strong (onset/offset) |
| DESED (Domestic Environment Sound Event Detection) | 2018 | Home | Mix of strong and weak labels |
| URBAN-SED | 2017 | Urban | Strong, synthetic |
| FSD50K | 2020 | Freesound clips | Clip level only, AudioSet style |
| MAESTRO Real (DCASE 2023 Task 4) | 2023 | Real and synthetic mixtures | Strong with soft labels |
Modern SED systems are usually built on top of a clip level pretrained backbone such as PANNs or BEATs, with an attention pooling head that learns to localise the relevant frames. HTS-AT, with its token semantic module, was specifically designed for this dual use case. The state of the art on DESED hovers around 0.5 to 0.6 in the polyphonic sound detection score (PSDS), which is the DCASE primary metric.
Bioacoustics applies audio classification to animal sounds, with the practical aim of biodiversity monitoring and conservation. The community has benefited enormously from transfer learning because labelled bird and bat recordings are scarce.
The most influential model is Google's Perch, released in 2023 by the Google DeepMind bioacoustics team. Perch is an EfficientNet B1 trained on the Xeno-canto bird sound archive with a sequence of contrastive and supervised losses, and it produces an embedding that transfers strongly to dozens of bioacoustic classification benchmarks. The Nature Scientific Reports paper "Global birdsong embeddings enable superior transfer learning for bioacoustic classification" (Ghani, Denton, Kahl, Klinck, 2023) shows that Perch embeddings beat task specific models on many BEANS benchmarks, including frog, bat, and marine mammal calls, despite being trained only on birds. Perch has been downloaded more than 250,000 times and is integrated into Cornell's BirdNET Analyzer and into the Conservation Metrics field tools.
Perch 2.0, released in 2025, scales the model to support roughly 15,000 species and uses an updated training pipeline that combines species classification with self distillation. It sets new state of the art on the BirdSet and BEANS benchmarks and is recommended for any new bioacoustic project.
Other bioacoustic models include BirdNET (Kahl et al., Cornell, 2021), which is widely deployed in citizen science apps such as Merlin Bird ID, and AnimalNet for a broader range of taxa. The BEANS (BEnchmark of ANimal Sounds) suite by Hagiwara et al. (2022) is the standard evaluation for bioacoustic embeddings.
The community uses several standard benchmarks to compare audio classification models. They differ in label space, clip length, domain, and licence.
| Benchmark | Year | Size | Classes | Task | Metric |
|---|---|---|---|---|---|
| AudioSet | 2017 | 2.1M clips, 10 s | 527 leaf (632 ontology) | Multi label tagging | mAP |
| ESC-50 | 2015 | 2,000 clips, 5 s | 50 | Single label | Accuracy, 5 fold CV |
| ESC-10 | 2015 | 400 clips, 5 s | 10 | Single label | Accuracy |
| UrbanSound8K | 2014 | 8,732 clips, up to 4 s | 10 | Single label | Accuracy, 10 fold CV |
| FSD50K | 2020 | 51,197 clips | 200 | Multi label | mAP |
| Speech Commands V2 | 2018 | 105,829 clips, 1 s | 35 | Keyword | Accuracy |
| VoxLingua107 | 2021 | 6,628 hours | 107 languages | Single label | Accuracy |
| VoxCeleb1 | 2017 | 153,516 utterances | 1,251 speakers | Identification, verification | EER |
| MagnaTagATune | 2009 | ~25,000 clips, 29 s | 188 tags (top 50 used) | Multi label | ROC-AUC, PR-AUC |
| MTG-Jamendo | 2019 | 55,000 tracks | 195 tags | Multi label | ROC-AUC, PR-AUC |
| BirdCLEF | 2014+ | ~700k recordings | 10,000+ species | Multi label | macro F1 |
| BEANS | 2022 | 12 datasets | many | Bioacoustic | Mean of per task metrics |
| HEAR 2021 | 2022 | 19 tasks, 16 datasets | mixed | General | Mean of per task metrics |
| SUPERB | 2021 | 10 speech tasks | mixed | Speech | Mean of per task metrics |
| X-ARES | 2024 | 22 tasks | mixed | General | Mean of per task metrics |
HEAR (Holistic Evaluation of Audio Representations) by Turian and colleagues at NeurIPS 2021 was the first attempt at a unified audio representation benchmark across speech, environmental sound, and music. Twenty nine submitted models from thirteen teams were evaluated on nineteen tasks from sixteen datasets, with a common API that takes an audio file and returns an embedding. HEAR remains the most cited general audio representation benchmark, though its tasks are now considered relatively easy and recent benchmarks such as X-ARES extend it.
Audio classification is used in a wide range of practical systems.
On device keyword spotting and wake word detection. Every consumer smart speaker and most smartphones run a small audio classifier continuously on device, looking for the wake word ("Alexa", "Hey Siri", "OK Google"). These detectors are typically tiny CNNs or RNNs with a few hundred thousand parameters and a memory footprint under a megabyte. Google's MarbleNet and the open source PocketSphinx keyword spotter are representative.
Hearing aids and accessibility. Modern hearing aids use real time sound scene classifiers to decide whether the wearer is in a quiet room, in a restaurant, in traffic, or listening to music, and they adjust filtering and compression accordingly. Cochlear implant processors use similar classifiers for environment detection.
Industrial machine monitoring. Sound based predictive maintenance for industrial equipment uses anomaly detection on top of audio classification. DCASE has run a dedicated task on this since 2020 with the MIMII and ToyADMOS datasets, where the model has to flag malfunctioning pumps, valves, fans, and slide rails from their sound.
Surveillance and public safety. Gunshot detection systems such as ShotSpotter use distributed microphone arrays and a classifier trained on impulsive sounds to localise gunfire in urban areas. Glass break detectors in alarm systems are simpler classifiers of the same family.
Wildlife monitoring. Passive acoustic monitoring of forests, wetlands, and oceans relies on bioacoustic classifiers such as BirdNET and Perch. The 2024 BirdCLEF challenge had recordings from passive monitoring sites in the Western Ghats of India, and the winning system used a Perch embedding plus a small classifier head.
Music streaming and recommendation. Spotify, Apple Music, and YouTube Music all use audio tagging models internally for genre, mood, and instrument tagging, for cold start recommendations on tracks without listening history, and for automatic playlist generation. Spotify's Annoy and Faiss based retrieval over learned audio embeddings is one production example.
Content moderation. Platforms that host user uploaded video use audio classification to flag certain content categories, for example violent sounds, certain types of music, and language identification on the speech track before it is run through ASR.
Voice biometrics. Banks and call centres use speaker verification systems based on ECAPA-TDNN or WavLM to confirm a caller's identity. The accuracy is high enough for low risk operations but is usually combined with other factors for high value transactions.
Audio classifiers inherit most of the limitations of any deep learning system trained on internet data and add a few that are specific to audio.
Label noise on AudioSet is significant. The original paper reports rater unanimity of 76.2 percent, which means roughly a quarter of segments have at least one rater disagreeing on at least one label. Single label datasets such as ESC-50 are cleaner but small.
Domain shift hurts. A model trained on YouTube audio sees mostly studio recorded or amateur near field speech and music. It often fails on far field recordings, on heavy reverberation, on unusual microphones, and on languages or accents that are underrepresented in the training data. Bioacoustic transfer to new geographies is a constant problem; a model trained on North American birds will miss many South American species.
The AudioSet ontology is biased toward sounds that are common on YouTube. Many real world sound categories, for example specific medical sounds, industrial sounds, and rare wildlife calls, are absent or only weakly represented. The ontology also conflates many sounds that are perceptually distinct, for example all dog sounds are in one node despite a bark and a whimper being acoustically very different.
Privacy is a real concern. Always on audio classifiers run on phones and smart speakers raise legitimate questions about what data leaves the device and what is retained. Some manufacturers store wake word triggers for service improvement, which has led to several public controversies. Speaker recognition systems are also subject to spoofing attacks (replay, voice conversion, deepfakes), and the ASVspoof challenge series has documented how easy it is to fool naive verification systems.
Evaluation metrics under report failure modes. mAP on AudioSet gives a single number that hides large per class variation; rare classes are often near random. F score on SED is sensitive to the time tolerance used in the evaluation, and small changes in the metric can swap the ranking of models.
Compute and energy costs are not trivial. Training BEATs iter 3 on AudioSet takes roughly a thousand A100 GPU hours, which is expensive in dollars and in carbon. Most academic labs cannot reproduce these runs and depend on the released checkpoints.
Finally, zero shot CLAP style models are tempting but they have a hidden cost. Prompt engineering matters a lot. The prompt "the sound of a dog barking" gives different results from "a dog barking" or "dog", and small differences can change the rank order on a benchmark. Calibration of CLAP scores across classes is also not well behaved out of the box.