SUPERB, which stands for Speech processing Universal PERformance Benchmark, is a comprehensive evaluation framework designed to measure how well self-supervised learning (SSL) models generalize across a diverse set of speech recognition and audio processing tasks. Introduced by Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, and colleagues in 2021, SUPERB provides a standardized leaderboard where researchers can compare pretrained speech models under uniform conditions. The benchmark was presented at Interspeech 2021 and has since become one of the most widely adopted evaluation standards in the speech processing community.
SUPERB addresses a gap that previously existed in speech research: while natural language processing had well-established benchmarks like GLUE and SuperGLUE for evaluating pretrained language models, no comparable unified benchmark existed for evaluating pretrained speech representations across multiple downstream tasks. By collecting ten distinct tasks spanning content recognition, speaker characterization, semantic understanding, and paralinguistic analysis, SUPERB enables systematic comparison of speech SSL models in a way that reveals both their strengths and limitations.
The rapid development of self-supervised learning techniques for speech processing created an urgent need for standardized evaluation. Before SUPERB, individual research papers typically reported results on only one or two downstream tasks, making it difficult to compare the overall quality of different pretrained representations. A model that excelled at automatic speech recognition might perform poorly on speaker verification, but without a unified evaluation protocol, these tradeoffs were not visible.
SUPERB draws direct inspiration from NLP benchmarks. Just as GLUE unified evaluation of sentence understanding capabilities for models like BERT, SUPERB aims to serve the same function for speech SSL models such as Wav2Vec 2.0 and HuBERT. The core idea is simple: take a frozen pretrained model, attach lightweight task-specific prediction heads, and measure performance across all ten tasks. This setup isolates the quality of the learned representations from the complexity of downstream architectures.
The benchmark is tightly integrated with the S3PRL (Self-Supervised Speech Pre-training and Representation Learning) toolkit, an open-source framework that provides reproducible training scripts and standardized evaluation pipelines for all SUPERB tasks. S3PRL supports dozens of pretrained speech models, making it straightforward for researchers to submit new models to the leaderboard.
SUPERB is built around several key design principles that ensure fair and meaningful comparisons between models:
Frozen representations. During evaluation, the parameters of the pretrained model are frozen. Only the lightweight downstream prediction head is trained. This ensures that performance differences reflect the quality of the pretrained representations rather than the capacity of fine-tuned models.
Learnable weighted sum. Rather than using only the final layer of the pretrained model, SUPERB employs a trainable weighted-sum mechanism across all hidden layers. This acknowledges that different layers capture different types of information, and the optimal combination may vary by task. The weights are learned alongside the downstream head.
Minimal downstream models. The prediction heads are intentionally simple. For classification tasks, a single linear layer is used. For sequence labeling tasks, a small BLSTM (bidirectional LSTM) network is employed. For speaker verification, an x-vector architecture is used. This minimalism ensures that the benchmark measures representation quality, not downstream model engineering.
Standardized hyperparameters. Training schedules, learning rates, and other hyperparameters are fixed across all pretrained models for a given task. This eliminates a major source of unfair comparison in prior work.
Each SUPERB task uses a specific lightweight prediction head:
| Task | Downstream Model | Output Type |
|---|---|---|
| Phoneme Recognition (PR) | Linear layer | Frame-level phoneme labels |
| Keyword Spotting (KS) | Linear layer (mean pooling) | Utterance-level class |
| Speaker Identification (SID) | Linear layer (mean pooling) | Speaker class |
| Emotion Recognition (ER) | Linear layer (mean pooling) | Emotion class |
| Intent Classification (IC) | Linear layer (mean pooling) | Intent class |
| Automatic Speaker Verification (ASV) | X-vector network | Same/different speaker |
| Speaker Diarization (SD) | Linear layer | Frame-level speaker labels |
| Automatic Speech Recognition (ASR) | 2-layer BLSTM | Character/word sequence |
| Slot Filling (SF) | 2-layer BLSTM | Slot-type sequence |
| Query by Example (QbE) | Dynamic Time Warping (DTW) | Term detection score |
For the Query by Example task, no trainable parameters are used at all. Instead, DTW (Dynamic Time Warping) is applied directly to the SSL representations to measure similarity between the query and candidate audio segments. This makes QbE a pure test of representation quality.
SUPERB organizes its ten tasks into four categories based on the type of speech information they primarily target: content, speaker, semantics, and paralinguistics.
Phoneme Recognition requires transcribing an utterance into its constituent phonemes, the smallest distinguishable units of sound in a language. This task evaluates how well pretrained representations capture fine-grained acoustic and phonetic information.
Dataset: LibriSpeech (train-clean-100 / dev-clean / test-clean splits). Phoneme transcriptions are obtained using the LibriSpeech official grapheme-to-phoneme model (g2p-model-5) and the conversion script from the Kaldi LibriSpeech s5 recipe.
Metric: Phone Error Rate (PER), measured as the edit distance between predicted and reference phoneme sequences. Lower is better.
ASR transcribes spoken utterances into written words. It is one of the most important applications of speech processing and serves as a key indicator of how well a model captures linguistic content.
Dataset: LibriSpeech (train-clean-100 for training; test-clean and test-other for evaluation). Results are reported both without and with a language model (LM) for decoding.
Metric: Word Error Rate (WER). Lower is better.
Keyword Spotting classifies utterances as one of a predefined set of keywords or as unknown/silence. This is a practical task for voice-activated devices that need to detect wake words or simple voice commands.
Dataset: Speech Commands v1, which contains one-second audio clips of 35 spoken commands (such as "yes," "no," "up," "down," "go," "stop") recorded by thousands of speakers.
Metric: Accuracy (ACC). Higher is better.
QbE detects occurrences of a spoken query term within a database of audio recordings, without any text transcription. Given a short audio clip of a spoken word or phrase, the system must find all matching segments in a larger audio collection.
Dataset: QUESST 2014, specifically the English subset. The evaluation uses the official scoring tools from the MediaEval 2014 benchmark.
Metric: Maximum Term Weighted Value (MTWV), which balances miss rate and false alarm rate. Higher is better.
Speaker Identification classifies each utterance by its speaker identity, selecting from a predefined closed set of speakers. This task tests how well representations capture speaker-specific vocal characteristics such as pitch, timbre, and speaking style.
Dataset: VoxCeleb1, which contains over 100,000 utterances from 1,251 celebrities extracted from YouTube videos. The standard train/test split is used.
Metric: Accuracy (ACC). Higher is better.
Speaker Verification determines whether two utterances were spoken by the same person. Unlike Speaker Identification, ASV is an open-set problem: the model must handle speakers not seen during training.
Dataset: VoxCeleb1, without VoxCeleb2 training data or noise augmentation. The official trial pairs from the VoxCeleb1 test set are used.
Metric: Equal Error Rate (EER), the point at which the false acceptance rate equals the false rejection rate. Lower is better.
Speaker Diarization predicts "who spoke when" by assigning a speaker label to each time frame of a multi-speaker recording. This is particularly challenging because it requires distinguishing overlapping speakers in real time.
Dataset: LibriMix, generated from LibriSpeech (train-clean-100 / dev-clean / test-clean). LibriMix creates synthetic two-speaker mixtures by combining utterances from different speakers, optionally with ambient noise from the WHAM! dataset.
Metric: Diarization Error Rate (DER). Lower is better.
Intent Classification determines the intent behind a spoken utterance. This task is central to voice assistants and spoken dialogue systems, where the system must understand what action the user wants to perform.
Dataset: Fluent Speech Commands, containing 30,043 English utterances from 97 speakers. Each utterance is annotated with three intent labels (action, object, and location), and the model must predict all three correctly for a sample to count as correct.
Metric: Accuracy (ACC). Higher is better.
Slot Filling assigns a semantic slot-type label (such as "destination," "time," or "object") to each word in a spoken utterance. This is the sequence labeling counterpart of Intent Classification and is essential for extracting structured information from speech.
Dataset: Audio SNIPS, a spoken version of the SNIPS text dataset created by synthesizing utterances with multiple speakers. US-accent speakers are used for training, with other accents reserved for validation and testing.
Metrics: F1 score (for slot-type accuracy) and Character Error Rate (CER, for slot-value transcription quality). Higher F1 and lower CER are better.
Emotion Recognition predicts the emotional state of the speaker from a spoken utterance. This task primarily relies on paralinguistic cues such as prosody, pitch variation, speech rate, and voice quality rather than on the lexical content of what is said.
Dataset: IEMOCAP (Interactive Emotional Dyadic Motion Capture), containing approximately 12 hours of audiovisual recordings of dyadic conversations between actors. The standard classification setup uses four emotion categories: neutral, happy, sad, and angry. Sessions 1 through 4 are used for training and session 5 for testing.
Metric: Accuracy (ACC). Higher is better.
The original SUPERB paper evaluated 13 self-supervised pretrained models plus a baseline, covering a range of pretraining strategies, architectures, and data scales. The models are grouped by their pretraining approach.
FBANK (Log Mel Filterbank) serves as a non-learned baseline. It extracts 80-dimensional log mel filterbank features from raw audio at a 10ms frame rate, representing the audio signal without any self-supervised pretraining. FBANK provides a lower bound on performance, showing what can be achieved with purely handcrafted acoustic features.
Generative SSL models learn by predicting masked or future portions of the input signal:
| Model | Architecture | Parameters | Training Data | Pretraining Method |
|---|---|---|---|---|
| APC | 3-layer GRU | 4.11M | LibriSpeech 360hr | Future prediction (generative) |
| VQ-APC | 3-layer GRU | 4.63M | LibriSpeech 360hr | Future prediction + vector quantization |
| NPC | 4 Conv + 4 Masked Conv | 19.38M | LibriSpeech 360hr | Masked prediction + vector quantization |
| Mockingjay | 12-layer Transformer | 85.12M | LibriSpeech 360hr | Time-masked prediction (generative) |
| TERA | 3-layer Transformer | 21.33M | LibriSpeech 960hr | Time/frequency masked prediction (generative) |
| DeCoAR 2.0 | 12-layer Transformer | 89.84M | LibriSpeech 960hr | Time-masked prediction + vector quantization |
Discriminative SSL models learn by distinguishing between positive and negative samples through contrastive learning objectives:
| Model | Architecture | Parameters | Training Data | Pretraining Method |
|---|---|---|---|---|
| Modified CPC | 5 Conv + 1 LSTM | 1.84M | Libri-Light 60k hr | Future contrastive |
| Wav2Vec | 19 Conv layers | 32.54M | LibriSpeech 960hr | Future contrastive |
| vq-wav2vec | 20 Conv layers | 34.15M | LibriSpeech 960hr | Future contrastive + vector quantization |
| wav2vec 2.0 Base | 7 Conv + 12 Transformer | 95.04M | LibriSpeech 960hr | Masked contrastive + vector quantization |
| wav2vec 2.0 Large | 7 Conv + 24 Transformer | 317.38M | Libri-Light 60k hr | Masked contrastive + vector quantization |
HuBERT uses an offline clustering step to create pseudo-labels, then trains with a masked prediction objective similar to BERT:
| Model | Architecture | Parameters | Training Data | Pretraining Method |
|---|---|---|---|---|
| HuBERT Base | 7 Conv + 12 Transformer | 94.68M | LibriSpeech 960hr | Masked prediction + vector quantization |
| HuBERT Large | 7 Conv + 24 Transformer | 316.61M | Libri-Light 60k hr | Masked prediction + vector quantization |
PASE+ combines multiple pretraining objectives (including waveform reconstruction, contrastive loss, and speaker classification) using a SincNet frontend followed by convolutional and QRNN layers. It has 7.83M parameters and was trained on 50 hours of LibriSpeech data.
The table below presents the complete benchmark results from the original SUPERB paper. For each task, the best result among the evaluated models is highlighted.
| Model | PR (PER) | KS (ACC) | SID (ACC) | ASV (EER) | SD (DER) | ER (ACC) | IC (ACC) | SF (F1) | SF (CER) | ASR (WER) | ASR+LM (WER) | QbE (MTWV) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FBANK | 82.01 | 8.63 | 0.09 | 9.56 | 10.05 | 35.39 | 9.10 | 69.64 | 52.94 | 23.18 | 15.21 | 0.0058 |
| PASE+ | 58.87 | 82.54 | 37.99 | 11.61 | 8.68 | 57.86 | 29.82 | 62.14 | 60.17 | 25.11 | 16.62 | 0.0072 |
| APC | 41.98 | 91.01 | 60.42 | 8.56 | 10.53 | 59.33 | 74.69 | 70.46 | 50.89 | 21.28 | 14.74 | 0.0310 |
| VQ-APC | 41.08 | 91.11 | 60.15 | 8.72 | 10.45 | 59.66 | 74.48 | 68.53 | 52.91 | 21.20 | 15.21 | 0.0251 |
| NPC | 43.81 | 88.96 | 55.92 | 9.40 | 9.34 | 59.08 | 69.44 | 72.79 | 48.44 | 20.20 | 13.91 | 0.0246 |
| Mockingjay | 70.19 | 83.67 | 32.29 | 11.66 | 10.54 | 50.28 | 34.33 | 61.59 | 58.89 | 22.82 | 15.48 | 0.0007 |
| TERA | 49.17 | 89.48 | 57.57 | 15.89 | 9.96 | 56.27 | 58.42 | 67.50 | 54.17 | 18.17 | 12.16 | 0.0013 |
| DeCoAR 2.0 | 14.93 | 94.48 | 74.42 | 7.16 | 6.59 | 62.47 | 90.80 | 83.28 | 34.73 | 13.02 | 9.07 | 0.0406 |
| Modified CPC | 42.54 | 91.88 | 39.63 | 12.86 | 10.38 | 60.96 | 64.09 | 71.19 | 49.91 | 20.18 | 13.53 | 0.0326 |
| Wav2Vec | 31.58 | 95.59 | 56.56 | 7.99 | 9.90 | 59.79 | 84.92 | 76.37 | 43.71 | 15.86 | 11.00 | 0.0485 |
| vq-wav2vec | 33.48 | 93.38 | 38.80 | 10.38 | 9.93 | 58.24 | 85.68 | 77.68 | 41.54 | 17.71 | 12.80 | 0.0410 |
| wav2vec 2.0 Base | 5.74 | 96.23 | 75.18 | 6.02 | 6.08 | 63.43 | 92.35 | 88.30 | 24.77 | 6.43 | 4.79 | 0.0233 |
| wav2vec 2.0 Large | 4.75 | 96.66 | 86.14 | 5.65 | 5.62 | 65.64 | 95.28 | 87.11 | 27.31 | 3.75 | 3.10 | 0.0489 |
| HuBERT Base | 5.41 | 96.30 | 81.42 | 5.11 | 5.88 | 64.92 | 98.34 | 88.53 | 25.20 | 6.42 | 4.79 | 0.0736 |
| HuBERT Large | 3.53 | 95.29 | 90.33 | 5.98 | 5.75 | 67.62 | 98.76 | 89.81 | 21.76 | 3.62 | 2.94 | 0.0353 |
Note: For PR, ASV, SD, SF (CER), ASR, and ASR+LM, lower values are better. For KS, SID, ER, IC, SF (F1), and QbE, higher values are better.
HuBERT and wav2vec 2.0 dominate most tasks. These two model families, both using Transformer-based architectures with large-scale pretraining, achieved the best or near-best results on nearly every SUPERB task. HuBERT Large achieved the lowest Phone Error Rate (3.53%), the highest Intent Classification accuracy (98.76%), the highest Emotion Recognition accuracy (67.62%), and the lowest ASR Word Error Rate with a language model (2.94%).
Scale matters significantly. The Large variants of both wav2vec 2.0 and HuBERT, trained on 60,000 hours of Libri-Light data with over 300 million parameters, substantially outperformed their Base counterparts trained on 960 hours of LibriSpeech with roughly 95 million parameters. For example, HuBERT Large reduced PR error from 5.41% to 3.53% and improved SID accuracy from 81.42% to 90.33% compared to HuBERT Base.
No single model wins everything. Despite HuBERT Large's overall strength, it did not achieve the best score on every task. HuBERT Base outperformed HuBERT Large on Keyword Spotting (96.30% vs. 95.29%) and QbE (0.0736 vs. 0.0353). wav2vec 2.0 Large achieved the best ASV Equal Error Rate (5.65%) among all models. This pattern suggests that different pretraining objectives and model scales capture different aspects of speech.
Discriminative models generally outperform generative ones. Models trained with contrastive or masked prediction objectives (wav2vec 2.0, HuBERT) consistently outperformed those trained with purely generative reconstruction objectives (APC, Mockingjay, TERA). The exception is DeCoAR 2.0, which combined generative pretraining with vector quantization and achieved competitive results.
FBANK baseline performs poorly across the board. The non-learned FBANK features performed far below all SSL models on most tasks, confirming the value of self-supervised pretraining. The gap was especially dramatic on Phoneme Recognition (82.01% PER for FBANK vs. 3.53% for HuBERT Large) and Intent Classification (9.10% for FBANK vs. 98.76% for HuBERT Large).
Speaker tasks reveal interesting patterns. TERA performed well on Speaker Identification but poorly on Emotion Recognition, while VQ-APC and NPC showed the opposite pattern. This suggests that different pretraining approaches capture different aspects of speaker and paralinguistic information.
Smaller models can be competitive on specific tasks. Modified CPC, with only 1.84 million parameters, achieved reasonable Emotion Recognition accuracy (60.96%) despite being over 170 times smaller than HuBERT Large.
After the initial SUPERB paper, several models achieved notable results on the benchmark:
WavLM, developed by Microsoft Research in 2021, achieved state-of-the-art results on the SUPERB benchmark. WavLM introduced a joint masked speech prediction and denoising pretraining objective, along with gated relative position bias for the Transformer architecture. WavLM Base outperformed HuBERT Base by 22.6% relatively on speaker diarization, and WavLM Large set new records across multiple SUPERB tasks. The improvements were especially pronounced on speaker-related tasks, which benefited from the denoising pretraining that exposed the model to multi-speaker signals.
data2vec, proposed by Meta AI, applied a unified self-supervised learning framework across speech, vision, and text modalities. On SUPERB, data2vec achieved competitive results with wav2vec 2.0 and HuBERT while using a simpler pretraining objective that predicts contextualized latent representations rather than discrete tokens.
SUPERB-SG (Semantic and Generative) was introduced in 2022 at ACL by Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, and colleagues. It extends the original SUPERB benchmark by adding tasks that specifically test semantic understanding and generative capabilities of speech SSL models. While the original SUPERB focused primarily on discriminative tasks with simple classification or labeling outputs, SUPERB-SG includes more challenging tasks that require deeper understanding of speech content and the ability to generate speech.
SUPERB-SG maintains the same evaluation philosophy as the original: frozen pretrained models with lightweight downstream heads. The increased task diversity, combined with limited task supervision, provides a more thorough assessment of model generalizability. The benchmark was published in the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022).
ML-SUPERB (Multilingual Speech Universal PERformance Benchmark) was introduced at Interspeech 2023 by Jiatong Shi and colleagues, including Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. The original SUPERB benchmark focuses almost entirely on English speech, which limits its ability to evaluate how well SSL representations transfer across languages. ML-SUPERB addresses this by covering 143 languages, ranging from high-resource languages like English and Mandarin to endangered languages with very limited data.
ML-SUPERB focuses on two primary tasks: automatic speech recognition and language identification. It uses frozen SSL features with a shallow downstream model, consistent with the SUPERB evaluation philosophy. A key finding from ML-SUPERB is that multilingual pretrained models do not always outperform monolingual ones, challenging the assumption that broader language coverage in pretraining automatically improves cross-lingual transfer.
A second version, ML-SUPERB 2.0, was presented at Interspeech 2024, expanding the benchmark to evaluate multilingual speech models across additional modeling constraints, languages, and datasets.
Dynamic-SUPERB, introduced in 2023 and presented at ICASSP 2024, takes a fundamentally different approach to speech evaluation. Instead of fixed tasks with task-specific prediction heads, Dynamic-SUPERB evaluates models in a zero-shot instruction-following setting. Models receive natural language instructions describing a task and must produce correct outputs without any task-specific training.
The initial release included 55 evaluation instances combining 33 tasks across 22 datasets, spanning six dimensions: content, speaker, semantics, degradation, paralinguistics, and audio (non-speech). Dynamic-SUPERB Phase-2, presented at ICLR 2025, expanded the benchmark to 180 tasks contributed collaboratively by the global research community, making it one of the largest benchmarks for speech and audio evaluation. Phase-2 covers speech, music, and general sound domains, and supports classification, regression, and sequence-generation output formats.
SUPERB has had a substantial impact on the speech processing research community since its introduction. Several factors contribute to its significance:
Standardized evaluation. Before SUPERB, comparing speech SSL models was difficult because different papers used different tasks, datasets, preprocessing, and evaluation protocols. SUPERB established a common ground that allows direct, fair comparison.
Driving model development. The existence of a public leaderboard has motivated the development of stronger SSL models. WavLM, data2vec, and other post-SUPERB models were explicitly designed and evaluated with the SUPERB benchmark in mind.
Open-source ecosystem. The tight integration with the S3PRL toolkit ensures that all results are reproducible. Researchers can easily add new pretrained models to the benchmark using standardized scripts and training configurations. The toolkit is hosted on GitHub and actively maintained.
Revealing representation properties. SUPERB's multi-task design reveals what types of information different SSL approaches capture. For example, the finding that HuBERT excels on content and semantic tasks while showing different patterns on speaker tasks provides insights into how different pretraining objectives shape learned representations.
Inspiring benchmarks in other domains. The SUPERB framework has inspired similar benchmark efforts in other areas of audio processing, including Codec-SUPERB for audio codec evaluation.
The S3PRL (Self-Supervised Speech Pre-training and Representation Learning) toolkit serves as the official implementation platform for SUPERB. S3PRL is an open-source PyTorch-based framework that provides:
In S3PRL's architecture, pretrained models are referred to as "upstream" models, and the task-specific components are called "downstream" models. This naming convention reflects the information flow from general pretrained representations to task-specific predictions.
While SUPERB has been highly influential, the benchmark has several known limitations:
English-only scope. The original SUPERB benchmark evaluates only English speech data, limiting its ability to assess cross-lingual or multilingual capabilities. ML-SUPERB was created specifically to address this gap.
Limited task diversity. With ten tasks, SUPERB covers a reasonable but not exhaustive range of speech processing capabilities. Tasks like speech translation, voice conversion, speech enhancement, and source separation are not included. SUPERB-SG and Dynamic-SUPERB partially address this limitation.
Frozen representation constraint. While the frozen-model protocol ensures fair comparison of representations, it does not reflect real-world usage where models are typically fine-tuned for specific tasks. Models that produce representations well-suited for fine-tuning but less effective in the frozen setting may be undervalued.
Dataset scale and domain. Most SUPERB datasets are based on read speech (LibriSpeech) or acted scenarios (IEMOCAP), which may not represent the full diversity of real-world speech. Conversational, accented, noisy, and code-switched speech are underrepresented.
Computational cost transparency. SUPERB does not explicitly account for the computational cost of pretraining or inference. A model with billions of parameters and months of GPU training time is compared on the same footing as a model with a few million parameters trained in hours.