SUPERB

AI Benchmarks Machine Learning Speech & Audio AI

27 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v3 · 5,430 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SUPERB, which stands for Speech processing Universal PERformance Benchmark, is a comprehensive evaluation framework designed to measure how well self-supervised learning (SSL) models generalize across a diverse set of speech recognition and audio processing tasks. Introduced by Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, and colleagues in 2021, SUPERB provides a standardized leaderboard where researchers can compare pretrained speech models under uniform conditions.^[1] The benchmark was presented at Interspeech 2021 and has since become one of the most widely adopted evaluation standards in the speech processing community.^[1] As of 2026 the original paper had accumulated more than 1,000 citations and is listed as a highly influential work in the field.^[9]

SUPERB addresses a gap that previously existed in speech research: while natural language processing had well-established benchmarks like GLUE and SuperGLUE for evaluating pretrained language models, no comparable unified benchmark existed for evaluating pretrained speech representations across multiple downstream tasks. By collecting ten distinct tasks spanning content recognition, speaker characterization, semantic understanding, and paralinguistic analysis, SUPERB enables systematic comparison of speech SSL models in a way that reveals both their strengths and limitations.^[1]

Background and Motivation

The rapid development of self-supervised learning techniques for speech processing created an urgent need for standardized evaluation. Before SUPERB, individual research papers typically reported results on only one or two downstream tasks, making it difficult to compare the overall quality of different pretrained representations. A model that excelled at automatic speech recognition might perform poorly on speaker verification, but without a unified evaluation protocol, these tradeoffs were not visible.

SUPERB draws direct inspiration from NLP benchmarks. Just as GLUE unified evaluation of sentence understanding capabilities for models like BERT, SUPERB aims to serve the same function for speech SSL models such as Wav2Vec 2.0 and HuBERT. The core idea is simple: take a frozen pretrained model, attach lightweight task-specific prediction heads, and measure performance across all ten tasks.^[1] This setup isolates the quality of the learned representations from the complexity of downstream architectures.

The benchmark is tightly integrated with the S3PRL (Self-Supervised Speech Pre-training and Representation Learning) toolkit, an open-source framework that provides reproducible training scripts and standardized evaluation pipelines for all SUPERB tasks. S3PRL supports dozens of pretrained speech models, making it straightforward for researchers to submit new models to the leaderboard. The project was advised by Hung-yi Lee of National Taiwan University, and a public leaderboard is hosted at superbbenchmark.org.^[16]

Evaluation Framework

Design Principles

SUPERB is built around several key design principles that ensure fair and meaningful comparisons between models:^[1]

Frozen representations. During evaluation, the parameters of the pretrained model are frozen. Only the lightweight downstream prediction head is trained. This ensures that performance differences reflect the quality of the pretrained representations rather than the capacity of fine-tuned models.
Learnable weighted sum. Rather than using only the final layer of the pretrained model, SUPERB employs a trainable weighted-sum mechanism across all hidden layers. This acknowledges that different layers capture different types of information, and the optimal combination may vary by task. The weights are learned alongside the downstream head.
Minimal downstream models. The prediction heads are intentionally simple. For classification tasks, a single linear layer is used. For sequence labeling tasks, a small BLSTM (bidirectional LSTM) network is employed. For speaker verification, an x-vector architecture is used. This minimalism ensures that the benchmark measures representation quality, not downstream model engineering.
Standardized hyperparameters. Training schedules, learning rates, and other hyperparameters are fixed across all pretrained models for a given task. This eliminates a major source of unfair comparison in prior work.

Downstream Architectures

Each SUPERB task uses a specific lightweight prediction head:

Task	Downstream Model	Output Type
Phoneme Recognition (PR)	Linear layer	Frame-level phoneme labels
Keyword Spotting (KS)	Linear layer (mean pooling)	Utterance-level class
Speaker Identification (SID)	Linear layer (mean pooling)	Speaker class
Emotion Recognition (ER)	Linear layer (mean pooling)	Emotion class
Intent Classification (IC)	Linear layer (mean pooling)	Intent class
Automatic Speaker Verification (ASV)	X-vector network	Same/different speaker
Speaker Diarization (SD)	Linear layer	Frame-level speaker labels
Automatic Speech Recognition (ASR)	2-layer BLSTM	Character/word sequence
Slot Filling (SF)	2-layer BLSTM	Slot-type sequence
Query by Example (QbE)	Dynamic Time Warping (DTW)	Term detection score

For the Query by Example task, no trainable parameters are used at all. Instead, DTW (Dynamic Time Warping) is applied directly to the SSL representations to measure similarity between the query and candidate audio segments. This makes QbE a pure test of representation quality.^[1]

Tasks

SUPERB organizes its ten tasks into four categories based on the type of speech information they primarily target: content, speaker, semantics, and paralinguistics.^[1]

Content Tasks

Phoneme Recognition (PR)

Phoneme Recognition requires transcribing an utterance into its constituent phonemes, the smallest distinguishable units of sound in a language. This task evaluates how well pretrained representations capture fine-grained acoustic and phonetic information.

Dataset: LibriSpeech (train-clean-100 / dev-clean / test-clean splits). Phoneme transcriptions are obtained using the LibriSpeech official grapheme-to-phoneme model (g2p-model-5) and the conversion script from the Kaldi LibriSpeech s5 recipe.

Metric: Phone Error Rate (PER), measured as the edit distance between predicted and reference phoneme sequences. Lower is better.

Automatic Speech Recognition (ASR)

ASR transcribes spoken utterances into written words. It is one of the most important applications of speech processing and serves as a key indicator of how well a model captures linguistic content.

Dataset: LibriSpeech (train-clean-100 for training; test-clean and test-other for evaluation). Results are reported both without and with a language model (LM) for decoding.

Metric: Word Error Rate (WER). Lower is better.

Keyword Spotting (KS)

Keyword Spotting classifies utterances as one of a predefined set of keywords or as unknown/silence. This is a practical task for voice-activated devices that need to detect wake words or simple voice commands.

Dataset: Speech Commands v1, which contains one-second audio clips of 35 spoken commands (such as "yes," "no," "up," "down," "go," "stop") recorded by thousands of speakers.

Metric: Accuracy (ACC). Higher is better.

Query by Example Spoken Term Detection (QbE)

QbE detects occurrences of a spoken query term within a database of audio recordings, without any text transcription. Given a short audio clip of a spoken word or phrase, the system must find all matching segments in a larger audio collection.

Dataset: QUESST 2014, specifically the English subset. The evaluation uses the official scoring tools from the MediaEval 2014 benchmark.

Metric: Maximum Term Weighted Value (MTWV), which balances miss rate and false alarm rate. Higher is better.

Speaker Tasks

Speaker Identification (SID)

Speaker Identification classifies each utterance by its speaker identity, selecting from a predefined closed set of speakers. This task tests how well representations capture speaker-specific vocal characteristics such as pitch, timbre, and speaking style.

Dataset: VoxCeleb1, which contains over 100,000 utterances from 1,251 celebrities extracted from YouTube videos. The standard train/test split is used.

Metric: Accuracy (ACC). Higher is better.

Automatic Speaker Verification (ASV)

Speaker Verification determines whether two utterances were spoken by the same person. Unlike Speaker Identification, ASV is an open-set problem: the model must handle speakers not seen during training.

Dataset: VoxCeleb1, without VoxCeleb2 training data or noise augmentation. The official trial pairs from the VoxCeleb1 test set are used.

Metric: Equal Error Rate (EER), the point at which the false acceptance rate equals the false rejection rate. Lower is better.

Speaker Diarization (SD)

Speaker Diarization predicts "who spoke when" by assigning a speaker label to each time frame of a multi-speaker recording. This is particularly challenging because it requires distinguishing overlapping speakers in real time.

Dataset: LibriMix, generated from LibriSpeech (train-clean-100 / dev-clean / test-clean). LibriMix creates synthetic two-speaker mixtures by combining utterances from different speakers, optionally with ambient noise from the WHAM! dataset.

Metric: Diarization Error Rate (DER). Lower is better.

Semantics Tasks

Intent Classification (IC)

Intent Classification determines the intent behind a spoken utterance. This task is central to voice assistants and spoken dialogue systems, where the system must understand what action the user wants to perform.

Dataset: Fluent Speech Commands, containing 30,043 English utterances from 97 speakers. Each utterance is annotated with three intent labels (action, object, and location), and the model must predict all three correctly for a sample to count as correct.

Metric: Accuracy (ACC). Higher is better.

Slot Filling (SF)

Slot Filling assigns a semantic slot-type label (such as "destination," "time," or "object") to each word in a spoken utterance. This is the sequence labeling counterpart of Intent Classification and is essential for extracting structured information from speech.

Dataset: Audio SNIPS, a spoken version of the SNIPS text dataset created by synthesizing utterances with multiple speakers. US-accent speakers are used for training, with other accents reserved for validation and testing.

Metrics: F1 score (for slot-type accuracy) and Character Error Rate (CER, for slot-value transcription quality). Higher F1 and lower CER are better.

Paralinguistics Tasks

Emotion Recognition (ER)

Emotion Recognition predicts the emotional state of the speaker from a spoken utterance. This task primarily relies on paralinguistic cues such as prosody, pitch variation, speech rate, and voice quality rather than on the lexical content of what is said.

Dataset: IEMOCAP (Interactive Emotional Dyadic Motion Capture), containing approximately 12 hours of audiovisual recordings of dyadic conversations between actors. The standard classification setup uses four emotion categories: neutral, happy, sad, and angry. Sessions 1 through 4 are used for training and session 5 for testing.

Metric: Accuracy (ACC). Higher is better.

Models Evaluated

The original SUPERB paper evaluated 13 self-supervised pretrained models plus a baseline, covering a range of pretraining strategies, architectures, and data scales. The models are grouped by their pretraining approach.^[1]

Baseline

FBANK (Log Mel Filterbank) serves as a non-learned baseline. It extracts 80-dimensional log mel filterbank features from raw audio at a 10ms frame rate, representing the audio signal without any self-supervised pretraining. FBANK provides a lower bound on performance, showing what can be achieved with purely handcrafted acoustic features.

Generative Approaches

Generative SSL models learn by predicting masked or future portions of the input signal:

Model	Architecture	Parameters	Training Data	Pretraining Method
APC	3-layer GRU	4.11M	LibriSpeech 360hr	Future prediction (generative)
VQ-APC	3-layer GRU	4.63M	LibriSpeech 360hr	Future prediction + vector quantization
NPC	4 Conv + 4 Masked Conv	19.38M	LibriSpeech 360hr	Masked prediction + vector quantization
Mockingjay	12-layer Transformer	85.12M	LibriSpeech 360hr	Time-masked prediction (generative)
TERA	3-layer Transformer	21.33M	LibriSpeech 960hr	Time/frequency masked prediction (generative)
DeCoAR 2.0	12-layer Transformer	89.84M	LibriSpeech 960hr	Time-masked prediction + vector quantization

Discriminative Approaches

Discriminative SSL models learn by distinguishing between positive and negative samples through contrastive learning objectives:

Model	Architecture	Parameters	Training Data	Pretraining Method
Modified CPC	5 Conv + 1 LSTM	1.84M	Libri-Light 60k hr	Future contrastive
Wav2Vec	19 Conv layers	32.54M	LibriSpeech 960hr	Future contrastive
vq-wav2vec	20 Conv layers	34.15M	LibriSpeech 960hr	Future contrastive + vector quantization
wav2vec 2.0 Base	7 Conv + 12 Transformer	95.04M	LibriSpeech 960hr	Masked contrastive + vector quantization
wav2vec 2.0 Large	7 Conv + 24 Transformer	317.38M	Libri-Light 60k hr	Masked contrastive + vector quantization

Masked Prediction Approaches

HuBERT uses an offline clustering step to create pseudo-labels, then trains with a masked prediction objective similar to BERT:^[6]

Model	Architecture	Parameters	Training Data	Pretraining Method
HuBERT Base	7 Conv + 12 Transformer	94.68M	LibriSpeech 960hr	Masked prediction + vector quantization
HuBERT Large	7 Conv + 24 Transformer	316.61M	Libri-Light 60k hr	Masked prediction + vector quantization

Multi-Task Approach

PASE+ combines multiple pretraining objectives (including waveform reconstruction, contrastive loss, and speaker classification) using a SincNet frontend followed by convolutional and QRNN layers. It has 7.83M parameters and was trained on 50 hours of LibriSpeech data.

Results

Main Results Table

The table below presents the complete benchmark results from the original SUPERB paper. For each task, the best result among the evaluated models is highlighted.^[1]

Model	PR (PER)	KS (ACC)	SID (ACC)	ASV (EER)	SD (DER)	ER (ACC)	IC (ACC)	SF (F1)	SF (CER)	ASR (WER)	ASR+LM (WER)	QbE (MTWV)
FBANK	82.01	8.63	0.09	9.56	10.05	35.39	9.10	69.64	52.94	23.18	15.21	0.0058
PASE+	58.87	82.54	37.99	11.61	8.68	57.86	29.82	62.14	60.17	25.11	16.62	0.0072
APC	41.98	91.01	60.42	8.56	10.53	59.33	74.69	70.46	50.89	21.28	14.74	0.0310
VQ-APC	41.08	91.11	60.15	8.72	10.45	59.66	74.48	68.53	52.91	21.20	15.21	0.0251
NPC	43.81	88.96	55.92	9.40	9.34	59.08	69.44	72.79	48.44	20.20	13.91	0.0246
Mockingjay	70.19	83.67	32.29	11.66	10.54	50.28	34.33	61.59	58.89	22.82	15.48	0.0007
TERA	49.17	89.48	57.57	15.89	9.96	56.27	58.42	67.50	54.17	18.17	12.16	0.0013
DeCoAR 2.0	14.93	94.48	74.42	7.16	6.59	62.47	90.80	83.28	34.73	13.02	9.07	0.0406
Modified CPC	42.54	91.88	39.63	12.86	10.38	60.96	64.09	71.19	49.91	20.18	13.53	0.0326
Wav2Vec	31.58	95.59	56.56	7.99	9.90	59.79	84.92	76.37	43.71	15.86	11.00	0.0485
vq-wav2vec	33.48	93.38	38.80	10.38	9.93	58.24	85.68	77.68	41.54	17.71	12.80	0.0410
wav2vec 2.0 Base	5.74	96.23	75.18	6.02	6.08	63.43	92.35	88.30	24.77	6.43	4.79	0.0233
wav2vec 2.0 Large	4.75	96.66	86.14	5.65	5.62	65.64	95.28	87.11	27.31	3.75	3.10	0.0489
HuBERT Base	5.41	96.30	81.42	5.11	5.88	64.92	98.34	88.53	25.20	6.42	4.79	0.0736
HuBERT Large	3.53	95.29	90.33	5.98	5.75	67.62	98.76	89.81	21.76	3.62	2.94	0.0353

Note: For PR, ASV, SD, SF (CER), ASR, and ASR+LM, lower values are better. For KS, SID, ER, IC, SF (F1), and QbE, higher values are better.

Key Findings

HuBERT and wav2vec 2.0 dominate most tasks. These two model families, both using Transformer-based architectures with large-scale pretraining, achieved the best or near-best results on nearly every SUPERB task. HuBERT Large achieved the lowest Phone Error Rate (3.53%), the highest Intent Classification accuracy (98.76%), the highest Emotion Recognition accuracy (67.62%), and the lowest ASR Word Error Rate with a language model (2.94%).^[1]

Scale matters significantly. The Large variants of both wav2vec 2.0 and HuBERT, trained on 60,000 hours of Libri-Light data with over 300 million parameters, substantially outperformed their Base counterparts trained on 960 hours of LibriSpeech with roughly 95 million parameters. For example, HuBERT Large reduced PR error from 5.41% to 3.53% and improved SID accuracy from 81.42% to 90.33% compared to HuBERT Base.^[1]

No single model wins everything. Despite HuBERT Large's overall strength, it did not achieve the best score on every task. HuBERT Base outperformed HuBERT Large on Keyword Spotting (96.30% vs. 95.29%) and QbE (0.0736 vs. 0.0353). wav2vec 2.0 Large achieved the best ASV Equal Error Rate (5.65%) among all models. This pattern suggests that different pretraining objectives and model scales capture different aspects of speech.^[1]

Discriminative models generally outperform generative ones. Models trained with contrastive or masked prediction objectives (wav2vec 2.0, HuBERT) consistently outperformed those trained with purely generative reconstruction objectives (APC, Mockingjay, TERA). The exception is DeCoAR 2.0, which combined generative pretraining with vector quantization and achieved competitive results.

FBANK baseline performs poorly across the board. The non-learned FBANK features performed far below all SSL models on most tasks, confirming the value of self-supervised pretraining. The gap was especially dramatic on Phoneme Recognition (82.01% PER for FBANK vs. 3.53% for HuBERT Large) and Intent Classification (9.10% for FBANK vs. 98.76% for HuBERT Large).

Speaker tasks reveal interesting patterns. TERA performed well on Speaker Identification but poorly on Emotion Recognition, while VQ-APC and NPC showed the opposite pattern. This suggests that different pretraining approaches capture different aspects of speaker and paralinguistic information.

Smaller models can be competitive on specific tasks. Modified CPC, with only 1.84 million parameters, achieved reasonable Emotion Recognition accuracy (60.96%) despite being over 170 times smaller than HuBERT Large.

Subsequent Models on SUPERB

After the initial SUPERB paper, several models achieved notable results on the benchmark:

WavLM

WavLM, developed by Microsoft Research in 2021, achieved state-of-the-art results on the SUPERB benchmark.^[5] WavLM introduced a joint masked speech prediction and denoising pretraining objective, along with gated relative position bias for the Transformer architecture.^[5] WavLM Base outperformed HuBERT Base by 22.6% relatively on speaker diarization, and WavLM Large set new records across multiple SUPERB tasks.^[5] The improvements were especially pronounced on speaker-related tasks, which benefited from the denoising pretraining that exposed the model to multi-speaker signals. On the overall SUPERB evaluation, WavLM Large outperformed HuBERT Large on 14 subtasks and improved the aggregate SUPERB score by an absolute 2.4 points (74.6 versus 72.2 for HuBERT Large).^[5]

data2vec

data2vec, proposed by Meta AI, applied a unified self-supervised learning framework across speech, vision, and text modalities. On SUPERB, data2vec achieved competitive results with wav2vec 2.0 and HuBERT while using a simpler pretraining objective that predicts contextualized latent representations rather than discrete tokens.

Speech foundation models

Later large-scale models broadened the scope of SUPERB evaluation. A 2024 study of speech foundation models extended SUPERB-style evaluation to additional pretrained encoders, including multilingual and weakly supervised systems, and analyzed how pretraining data scale and supervision affect downstream representation quality.^[8] These analyses reinforced the original finding that large-scale discriminative pretraining produces the most broadly useful frozen representations.

SUPERB @ SLT 2022 Challenge

To formalize community participation and add an efficiency dimension, the organizers held the SUPERB Challenge at the IEEE Spoken Language Technology (SLT) Workshop 2022. The challenge, organized by Tzu-hsun Feng, Annie Dong, Hung-yi Lee, and colleagues, evaluated submitted SSL models not only on task performance and generalization across the SUPERB task suite, but also on the computational cost of producing representations.^[10] It introduced explicit metrics for the computation requirements of SSL models, addressing a gap in the original benchmark that compared models of vastly different sizes on the same footing. The organizers summarized the results of 14 submitted models and used a hidden test set in addition to the public sets to discourage overfitting to the leaderboard.^[10]

Extensions of SUPERB

SUPERB-SG

SUPERB-SG (Semantic and Generative) was introduced in 2022 at ACL by Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, and colleagues.^[2] It extends the original SUPERB benchmark by adding tasks that specifically test semantic understanding and generative capabilities of speech SSL models. While the original SUPERB focused primarily on discriminative tasks with simple classification or labeling outputs, SUPERB-SG includes more challenging tasks that require deeper understanding of speech content and the ability to generate speech. Specifically, it adds five new tasks: speech translation, out-of-domain ASR, voice conversion, speech separation, and speech enhancement, and evaluates 15 SSL models across them.^[2]

SUPERB-SG maintains the same evaluation philosophy as the original: frozen pretrained models with lightweight downstream heads. The increased task diversity, combined with limited task supervision, provides a more thorough assessment of model generalizability. The benchmark was published in the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).^[2]

ML-SUPERB

ML-SUPERB (Multilingual Speech Universal PERformance Benchmark) was introduced at Interspeech 2023 by Jiatong Shi and colleagues, including Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee.^[3] The original SUPERB benchmark focuses almost entirely on English speech, which limits its ability to evaluate how well SSL representations transfer across languages. ML-SUPERB addresses this by covering 143 languages, ranging from high-resource languages like English and Mandarin to endangered languages with very limited data.^[3]

ML-SUPERB focuses on two primary tasks: automatic speech recognition and language identification. It uses frozen SSL features with a shallow downstream model, consistent with the SUPERB evaluation philosophy.^[3] A key finding from ML-SUPERB is that multilingual pretrained models do not always outperform monolingual ones, challenging the assumption that broader language coverage in pretraining automatically improves cross-lingual transfer.

A second version, ML-SUPERB 2.0, was presented at Interspeech 2024, expanding the benchmark to evaluate multilingual speech models across additional modeling constraints, languages, and datasets.^[11] ML-SUPERB 2.0 covers 142 languages across 15 datasets and goes beyond the frozen-feature protocol to compare full fine-tuning, partial fine-tuning, adapters, and Low-Rank Adaptation (LoRA), as well as Transformer, Conformer, and E-Branchformer downstream architectures.^[11] The authors reported that E-Branchformer downstream models outperform Transformer and Conformer counterparts in almost all cases, and that per-language performance varies dramatically: mean character error rates of roughly 16 to 27 percent contrast with worst-case languages such as Lao and Min Nan Chinese exceeding 60 percent, underscoring the need for language-targeted modeling.^[11]

Dynamic-SUPERB

Dynamic-SUPERB, introduced in 2023 and presented at ICASSP 2024, takes a fundamentally different approach to speech evaluation.^[4] Instead of fixed tasks with task-specific prediction heads, Dynamic-SUPERB evaluates models in a zero-shot instruction-following setting. Models receive natural language instructions describing a task and must produce correct outputs without any task-specific training.^[4]

The initial release included 55 evaluation instances combining 33 tasks across 22 datasets, spanning six dimensions: content, speaker, semantics, degradation, paralinguistics, and audio (non-speech).^[4] Dynamic-SUPERB Phase-2, presented at ICLR 2025, expanded the benchmark to 180 tasks contributed collaboratively by the global research community, making it one of the largest benchmarks for speech and audio evaluation.^[12] Phase-2 added 125 new community-contributed tasks on top of the original suite, covers speech, music, and general sound domains, and supports classification, regression, and sequence-generation output formats.^[12] As of June 2025, the Phase-2 task code and a dedicated leaderboard on Hugging Face are publicly available, enabling community submissions across the full 180-task suite. In the Phase-2 evaluation, no model performed well universally: SALMONN-13B led English ASR while Qwen2-Audio-7B-Instruct was strongest on emotion recognition, illustrating that current spoken language models remain specialized rather than general.^[12]

2025 to 2026 Developments

The SUPERB design has continued to spawn specialized benchmarks that probe capabilities the original suite did not cover.

TS-SUPERB

TS-SUPERB (Target Speech Processing Universal PERformance Benchmark), presented at ICASSP 2025 by Junyi Peng, Takanori Ashihara, Marc Delcroix, and colleagues, evaluates SSL models in multi-talker, noisy conditions by conditioning the downstream model on a speaker embedding extracted from enrollment speech.^[13] It defines four target-speaker tasks: target speech extraction, personalized speech enhancement, personalized voice activity detection, and target-speaker ASR.^[13] The authors found that a model's target-speaker performance cannot be reliably inferred from its scores on the related single-speaker SUPERB tasks, demonstrating that frozen-representation quality is scenario dependent.^[13]

Spoof-SUPERB

A SUPERB-style benchmark for audio deepfake detection, posted in March 2026 by Hashim Ali, Nithin Sai Adupa, Surya Subramani, and Hafiz Malik of the University of Michigan-Dearborn and accepted at a 2025 IEEE workshop, systematically evaluated 20 SSL models across eight spoofing datasets (including ASVspoof 2019, ASVspoof 2021, ASVspoof 5, and In-the-Wild).^[14] Ranked by mean Equal Error Rate, the top five frozen representations were XLS-R (17.4%), UniSpeech-SAT (19.5%), WavLM Large (20.6%), HuBERT Large (22.7%), and MR-HuBERT (23.0%).^[14] The authors reported that large-scale discriminative models stayed resilient under acoustic degradations while generative approaches such as APC, TERA, and Mockingjay degraded sharply, and they attributed XLS-R's robustness to its large-scale multilingual pretraining on more than 400,000 hours of speech.^[14]

Impact and Significance

SUPERB has had a substantial impact on the speech processing research community since its introduction. Several factors contribute to its significance:^[1]

Standardized evaluation. Before SUPERB, comparing speech SSL models was difficult because different papers used different tasks, datasets, preprocessing, and evaluation protocols. SUPERB established a common ground that allows direct, fair comparison.

Driving model development. The existence of a public leaderboard has motivated the development of stronger SSL models. WavLM, data2vec, and other post-SUPERB models were explicitly designed and evaluated with the SUPERB benchmark in mind.^[5]

Open-source ecosystem. The tight integration with the S3PRL toolkit ensures that all results are reproducible. Researchers can easily add new pretrained models to the benchmark using standardized scripts and training configurations. The toolkit is hosted on GitHub and actively maintained.^[16]

Revealing representation properties. SUPERB's multi-task design reveals what types of information different SSL approaches capture. For example, the finding that HuBERT excels on content and semantic tasks while showing different patterns on speaker tasks provides insights into how different pretraining objectives shape learned representations.

Inspiring benchmarks in other domains. The SUPERB framework has inspired similar benchmark efforts in other areas of audio processing, including Codec-SUPERB for audio codec evaluation. Codec-SUPERB, run as a challenge at SLT 2024 and organized by Haibin Wu, Xuanjun Chen, and colleagues, applies the SUPERB philosophy to neural audio codec models, measuring how well a codec preserves content, paralinguistic, speaker, and general audio information at low bitrates.^[15]

S3PRL Toolkit

The S3PRL (Self-Supervised Speech Pre-training and Representation Learning) toolkit serves as the official implementation platform for SUPERB. S3PRL is an open-source PyTorch-based framework that provides:^[16]

Unified interfaces for loading and using dozens of pretrained speech SSL models
Standardized downstream task implementations for all SUPERB tasks
Reproducible training scripts with fixed hyperparameters
Utilities for extracting and analyzing representations from different model layers
Support for the learnable weighted-sum interface used in SUPERB evaluation

In S3PRL's architecture, pretrained models are referred to as "upstream" models, and the task-specific components are called "downstream" models. This naming convention reflects the information flow from general pretrained representations to task-specific predictions. As of 2024 the project entered what its maintainers describe as "pure maintenance mode," continuing to support existing functionality and accepting new contributions only in the form of additional upstream models.^[16]

Limitations

While SUPERB has been highly influential, the benchmark has several known limitations:

English-only scope. The original SUPERB benchmark evaluates only English speech data, limiting its ability to assess cross-lingual or multilingual capabilities. ML-SUPERB was created specifically to address this gap.^[3]

Limited task diversity. With ten tasks, SUPERB covers a reasonable but not exhaustive range of speech processing capabilities. Tasks like speech translation, voice conversion, speech enhancement, and source separation are not included. SUPERB-SG and Dynamic-SUPERB partially address this limitation.^[2]

Frozen representation constraint. While the frozen-model protocol ensures fair comparison of representations, it does not reflect real-world usage where models are typically fine-tuned for specific tasks. Models that produce representations well-suited for fine-tuning but less effective in the frozen setting may be undervalued. ML-SUPERB 2.0 later relaxed this constraint by adding fine-tuning and parameter-efficient adaptation tracks.^[11]

Dataset scale and domain. Most SUPERB datasets are based on read speech (LibriSpeech) or acted scenarios (IEMOCAP), which may not represent the full diversity of real-world speech. Conversational, accented, noisy, and code-switched speech are underrepresented. TS-SUPERB later showed that single-speaker scores do not predict performance in noisy multi-talker conditions.^[13]

Computational cost transparency. SUPERB does not explicitly account for the computational cost of pretraining or inference. A model with billions of parameters and months of GPU training time is compared on the same footing as a model with a few million parameters trained in hours. The SUPERB @ SLT 2022 Challenge partially addressed this by adding explicit efficiency metrics.^[10]

References

Yang, S., Chi, P., Chuang, Y., Lai, C.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., Lin, G., Huang, T., Tseng, W., Lee, K., Liu, D., Huang, Z., Dong, S., Li, S., Watanabe, S., Mohamed, A., and Lee, H. (2021). "SUPERB: Speech processing Universal PERformance Benchmark." Proceedings of Interspeech 2021, Brno, Czech Republic. arXiv:2105.01051. ↩
Tsai, H., Chang, H., Huang, W., Huang, Z., Lakhotia, K., Yang, S., Dong, S., Liu, A., Lai, C., Shi, J., Chang, X., Hall, P., Chen, H., Li, S., Watanabe, S., Mohamed, A., and Lee, H. (2022). "SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities." Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), pp. 8479-8492, Dublin, Ireland. ↩
Shi, J., Berrebbi, D., Chen, W., Hu, E., Huang, W., Chung, H., Chang, X., Li, S., Mohamed, A., Lee, H., and Watanabe, S. (2023). "ML-SUPERB: Multilingual Speech Universal PERformance Benchmark." Proceedings of Interspeech 2023. arXiv:2305.10615. ↩
Huang, C., Chen, W., Yang, S., Liu, A.T., Li, S., Lee, H. (2024). "Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech." Proceedings of ICASSP 2024. arXiv:2309.09510. ↩
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., and Wei, F. (2022). "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." IEEE Journal of Selected Topics in Signal Processing, 16(6), pp. 1505-1518. arXiv:2110.13900. ↩
Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, pp. 3451-3460. ↩
Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., von Platen, P., Saez-Trigueros, D., Conneau, A., and Auli, M. (2022). "A Large-Scale Evaluation of Speech Foundation Models." arXiv:2404.09385. ↩
"SUPERB: Speech processing Universal PERformance Benchmark." Semantic Scholar paper page. Retrieved 2026. Reports more than 1,000 citations (182 background, 346 methods, 411 results) and a highly influential designation. https://www.semanticscholar.org/paper/d8e81e80490113434f7ac338c5f8d5a23f05a3de ↩
Feng, T., Dong, A., Yeh, C., Yang, S., Lin, T., Shi, J., Chang, K., Huang, Z., Wu, H., Chang, X., Watanabe, S., Mohamed, A., Li, S., and Lee, H. (2022). "SUPERB @ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning." 2022 IEEE Spoken Language Technology Workshop (SLT). arXiv:2210.08634. ↩
Shi, J., Wang, S., Chen, W., Bartelds, M., Bansal, V., Tian, J., Berrebbi, D., Watanabe, S., Jurafsky, D., Livescu, K., and Lee, H. (2024). "ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets." Proceedings of Interspeech 2024. arXiv:2406.08641. ↩
Huang, C., et al. (2025). "Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks." International Conference on Learning Representations (ICLR 2025). arXiv:2411.05361. ↩
Peng, J., Ashihara, T., Delcroix, M., Ochiai, T., Plchot, O., Araki, S., and Cernocky, J. (2025). "TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models." Proceedings of ICASSP 2025. arXiv:2505.06660. ↩
Ali, H., Adupa, N.S., Subramani, S., and Malik, H. (2026). "A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection." Accepted at the 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). arXiv:2603.01482. ↩
Wu, H., Chen, X., Lin, Y., Chang, K., Du, J., Lu, K., Liu, A.H., Chung, H., Wu, Y., Yang, D., Liu, S., Wu, Y., Tan, X., Glass, J., Watanabe, S., and Lee, H. (2024). "Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models." 2024 IEEE Spoken Language Technology Workshop (SLT). arXiv:2409.14085. ↩
"S3PRL: Self-Supervised Speech Pre-training and Representation Learning Toolkit." GitHub repository, s3prl/s3prl. Retrieved 2026. https://github.com/s3prl/s3prl ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Automatic Speech Recognition Models HuBERT Wav2Vec 2.0

Background and Motivation

Evaluation Framework

Design Principles

Downstream Architectures

Tasks

Content Tasks

Phoneme Recognition (PR)

Automatic Speech Recognition (ASR)

Keyword Spotting (KS)

Query by Example Spoken Term Detection (QbE)

Speaker Tasks

Speaker Identification (SID)

Automatic Speaker Verification (ASV)

Speaker Diarization (SD)

Semantics Tasks

Intent Classification (IC)

Slot Filling (SF)

Paralinguistics Tasks

Emotion Recognition (ER)

Models Evaluated

Baseline

Generative Approaches

Discriminative Approaches

Masked Prediction Approaches

Multi-Task Approach

Results

Main Results Table

Key Findings

Subsequent Models on SUPERB

WavLM

data2vec

Speech foundation models

SUPERB @ SLT 2022 Challenge

Extensions of SUPERB

SUPERB-SG

ML-SUPERB

Dynamic-SUPERB

2025 to 2026 Developments

TS-SUPERB

Spoof-SUPERB

Impact and Significance

S3PRL Toolkit

Limitations

See Also

References

Improve this article

Related Articles

LibriSpeech

Audio Classification Models

Speech recognition

HuBERT

Wav2Vec 2.0

Word error rate

What links here

Related Articles

LibriSpeech

Audio Classification Models

Speech recognition

HuBERT

Wav2Vec 2.0

Word error rate

What links here