LibriSpeech

AI Benchmarks Natural Language Processing Speech & Audio AI

27 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v4 · 5,299 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LibriSpeech is a freely available corpus of approximately 1,000 hours of 16 kHz read English speech that serves as the standard benchmark for training and evaluating automatic speech recognition (ASR) systems. Its official description reads: "LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech... The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned" ^[14]. The corpus was created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur at Johns Hopkins University and was first presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in April 2015 ^[1]. It pairs LibriVox volunteer audiobook recordings with public-domain texts from Project Gutenberg ^[1]. LibriSpeech has become the single most widely used benchmark for English speech recognition, accumulating over 8,000 citations ^[24] and serving as the headline evaluation target for nearly every major ASR model developed since its release, from Kaldi DNN-HMM baselines to self-supervised and weakly supervised neural systems such as wav2vec 2.0, Conformer, and Whisper.

Why was LibriSpeech created?

Before LibriSpeech, the speech recognition research community relied heavily on a handful of corpora that were either small in scale or restricted by licensing. The Wall Street Journal (WSJ) corpus, for example, contained only about 80 hours of read speech ^[1] and required an expensive Linguistic Data Consortium (LDC) license. The Switchboard corpus provided conversational telephone speech but was similarly gated behind licensing fees. The Fisher corpus offered more data but remained proprietary. These restrictions made it difficult for researchers at smaller institutions, open-source projects, and international labs to participate fully in ASR research.

The creators of LibriSpeech set out to address these problems by building a freely available, large-scale speech corpus from public domain sources. Their key insight was that the LibriVox project, a volunteer-driven initiative to create free audio recordings of public domain books, had accumulated thousands of hours of English speech with corresponding text from Project Gutenberg. By carefully aligning this audio with its textual transcriptions, the team could produce a high-quality ASR training corpus without any licensing restrictions.

The resulting dataset was released under a Creative Commons Attribution 4.0 (CC BY 4.0) license, allowing unrestricted use for both academic and commercial purposes ^[1]^[14]. This open access model played a significant role in LibriSpeech's rapid adoption across the research community.

What is LibriSpeech made from?

LibriVox

LibriVox is a volunteer-driven project founded in 2005 that aims to make all public domain books available as free audiobooks. Volunteers record themselves reading chapters from books whose copyright has expired, and the recordings are released into the public domain. By the time LibriSpeech was created, LibriVox had accumulated a vast library of English-language audiobook recordings spanning a wide range of speakers, accents, and recording quality levels. The original paper put the count at approximately 8,000 public domain audiobooks at the time, the majority of them in English ^[1].

The diversity of LibriVox volunteers is both a strength and a challenge for ASR corpus construction. Speakers range from professional-sounding readers with studio-quality microphones to casual volunteers recording on consumer equipment in noisy environments. This variation provided the basis for LibriSpeech's division into "clean" and "other" subsets, reflecting different levels of recording quality and speaker characteristics.

Project Gutenberg

Project Gutenberg is a digital library of over 70,000 free eBooks, focusing on older works for which U.S. copyright has expired. The plain text versions of these books served as the reference transcriptions for LibriSpeech. Since LibriVox recordings are readings of Project Gutenberg texts, the alignment between audio and text could be established by matching audio segments to their corresponding passages in the source books.

How was LibriSpeech built?

Overview of the Pipeline

The construction of LibriSpeech involved several stages: selecting suitable LibriVox recordings, downloading the corresponding Project Gutenberg texts, performing text normalization, training an initial acoustic model for alignment, running forced alignment to synchronize audio with text, segmenting the aligned audio into utterances, and finally selecting and organizing the data into training, development, and test splits ^[1].

Text Normalization

The text from Project Gutenberg required extensive normalization before it could be used as reference transcriptions. All letters were converted to uppercase, and all punctuation was removed. Numbers, abbreviations, and other non-standard words were expanded into their spoken forms ^[1]. This normalization step was critical because ASR systems at the time typically operated on sequences of words without punctuation or case distinctions.

Alignment Process

The audio-text alignment was performed using a two-pass approach. In the first pass, the team trained a triphone acoustic model using discriminative training with Boosted Maximum Mutual Information (BMMI) on Mel-Frequency Cepstral Coefficient (MFCC) features. The features were processed with frame-splicing over seven frames, followed by Linear Discriminant Analysis (LDA) and a global Semi-Tied Covariance (STC) transform ^[1]. The acoustic model for this first decoding pass was trained on the VoxForge dataset ^[1]. This initial model was used to perform a first-pass alignment of the LibriVox audio against the normalized Project Gutenberg text. A Smith-Waterman algorithm then located the best single matching region between the recognized audio and the chapter text, and the audio was split into pieces of 35 seconds or less at silences falling inside "islands of confidence", defined as exact matches with the reference at least 12 phones long ^[1].

In the second pass, the alignment was refined using the output of the first pass to train a better acoustic model, which then re-aligned the data for improved accuracy. Specifically, the second stage decoded each segment with a custom graph that combined the linear word sequence of the transcript with a generic phone-level bigram, using a speaker-adapted model with fMLLR transforms, and rejected any utterance whose decoding deviated from the transcript ^[1]. The entire alignment process took approximately 65 hours running on two Amazon EC2 cc2.8xlarge instances and produced roughly 1,200 hours of aligned audio, from which the final 1,000-hour corpus was selected ^[1].

Segmentation

After alignment, the continuous audio streams were segmented into individual utterances. For the training sets, the audio was split at silence intervals exceeding 0.3 seconds, with a maximum segment length of 35 seconds ^[1]. For the development and test sets, segmentation was performed only at sentence boundaries in the reference text ^[1] to ensure that evaluation utterances corresponded to complete sentences, which is more natural for evaluation purposes.

Speaker Selection and the Clean/Other Split

A distinctive feature of LibriSpeech is its division of data into "clean" and "other" subsets. To create this split, the corpus authors ranked all speakers according to the word error rate (WER) achieved by a baseline acoustic model (trained on the WSJ si-84 data) when transcribing their speech ^[1]. Speakers whose speech was easier to recognize (lower WER) were designated as "clean," while speakers with higher WER were designated as "other." The division was made roughly at the midpoint, so approximately half the speakers fell into each category ^[1]. For the "other" pool, the development and test speakers were not picked at random: they were drawn from the third quartile of the WER-based difficulty ranking, deliberately selecting more challenging data ^[1]. Multi-speaker recordings such as LibriVox "Dramatic Readings" were excluded, and the remaining audio was screened with the LIUM speaker diarization toolkit plus a custom inspection tool to remove multi-speaker chapters and record speaker gender ^[1].

The "clean" subset generally contains speakers with clearer pronunciation, less background noise, better microphone quality, and accents closer to standard American English. The "other" subset includes speakers with more diverse accents, noisier recording conditions, and other factors that make recognition more challenging. This split allows researchers to evaluate their systems on both relatively easy and more difficult speech, providing a more nuanced picture of ASR performance.

How many hours and speakers does LibriSpeech contain?

The complete LibriSpeech corpus totals approximately 982 hours of speech from 2,484 unique speakers ^[1] reading 5,466 chapters from LibriVox audiobooks. The data is organized into seven subsets across training, development, and test partitions ^[1].

Subset Breakdown

Subset	Hours	Speakers	Utterances	Description
train-clean-100	~100	251	28,539	Clean training data (smaller subset)
train-clean-360	~360	921	104,014	Clean training data (larger subset)
train-other-500	~500	1,166	148,688	More challenging training data
dev-clean	~5.4	40	2,703	Clean development/validation set
dev-other	~5.3	33	2,864	Challenging development/validation set
test-clean	~5.4	40	2,620	Clean test set
test-other	~5.1	33	2,939	Challenging test set
Total	~982	2,484	292,367

The hours and speaker counts follow Table 1 of the original paper, which gives exact durations of 100.6, 363.6, and 496.7 hours for the three training sets ^[1], and the per-subset utterance counts match the official Hugging Face distribution of the corpus ^[15]. The development and test sets each contain approximately 5 hours of audio. For these evaluation sets, 40 speakers (20 male and 20 female) were selected for the clean partition, and 33 speakers were selected for the other partition, with roughly 8 minutes of speech from each speaker ^[1]. The speakers in the development and test sets are entirely disjoint from the training set speakers, ensuring unbiased evaluation ^[1].

Audio Format

All audio in LibriSpeech is stored in FLAC (Free Lossless Audio Codec) format at a 16 kHz sampling rate with 16-bit resolution. Each audio file corresponds to a single utterance and is named using the convention {speaker_id}-{chapter_id}-{utterance_id}.flac. The transcriptions are stored in plain text files alongside the audio.

Speaker Demographics

LibriSpeech includes a diverse set of English speakers, though the corpus does not provide detailed demographic annotations beyond speaker identity. The speakers are predominantly native English speakers from various regions, reflecting the volunteer base of the LibriVox project. Each speaker contributed between a few minutes and several hours of recorded speech, with the training data capped at approximately 25 minutes per speaker in the train-clean-100 subset ^[1].

Language Model Resources

In addition to the acoustic data, the LibriSpeech authors prepared extensive language model training resources. They collected approximately 803 million tokens of text from 14,500 Project Gutenberg books, which were normalized and used to train several n-gram language models ^[1]. To prevent contamination, every book underlying the development and test sets was excluded from this text, along with any candidate book flagged by a title-similarity check or by an inverted index of shared 5-grams ^[1]. These pre-built language models were distributed alongside the corpus to facilitate reproducible research.

The language model resources included:

A 200,000-word vocabulary covering the most frequent words in the corpus ^[1]
3-gram and 4-gram language models with modified Kneser-Ney smoothing ^[1]
ARPA-format language model files ready for use with common ASR decoders
The raw normalized text for researchers who preferred to train their own language models

The released 3-gram model has a perplexity of 170 on the evaluation sets and the 4-gram model a perplexity of around 150, with an out-of-vocabulary rate of approximately 0.4 percent ^[1].

Kaldi-ASR recipes were also released alongside the corpus, providing complete scripts for building competitive baseline ASR systems using the LibriSpeech data ^[1]^[11]. These recipes significantly lowered the barrier to entry for researchers new to speech recognition.

How is LibriSpeech evaluated?

LibriSpeech uses Word Error Rate (WER) as its primary evaluation metric. WER is computed as the edit distance between the hypothesized transcription and the reference transcription, normalized by the number of words in the reference. Specifically:

WER = (Substitutions + Insertions + Deletions) / Total Reference Words x 100%

The standard practice is to report WER separately on four evaluation sets: dev-clean, dev-other, test-clean, and test-other. Most published results focus on test-clean and test-other, with the gap between the two scores serving as an indicator of a model's robustness to speaker and recording variability.

Baseline Results from the Original Paper

The original 2015 paper by Panayotov et al. reported baseline results using traditional Gaussian Mixture Model (GMM) and Deep Neural Network (DNN) acoustic models built with the Kaldi toolkit ^[1]^[11]. These baselines established initial performance targets for the corpus.

Model	Training Data	test-clean WER (%)	test-other WER (%)
SAT (GMM)	460h (clean)	8.34	28.11
DNN (p-norm)	460h (clean)	5.78	19.12
SAT (GMM)	960h (all)	8.04	22.65
DNN (p-norm)	960h (all)	5.51	13.97

All values are obtained with rescoring by the full 4-gram language model ^[1]. The SAT (Speaker-Adapted Training) models used GMM-HMM systems with speaker-level feature transforms (fMLLR). The DNN models used networks with p-norm nonlinearities trained on fMLLR features ^[1]. Notably, the paper demonstrated that acoustic models trained on LibriSpeech generalized well to other domains: as the authors put it, "acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself" ^[1].

What is the state of the art on LibriSpeech?

Since 2015, the word error rates on LibriSpeech have dropped dramatically, driven by advances in end-to-end modeling, self-supervised pre-training, data augmentation, and large-scale weak supervision. The following table summarizes notable results across the history of the benchmark.

Historical Progression of WER on LibriSpeech

Year	Model / System	test-clean WER (%)	test-other WER (%)	Key Innovation
2015	Deep Speech 2 (Baidu) ^[2]	5.33	13.25	End-to-end RNN with CTC
2015	LibriSpeech Baseline DNN ^[1]	5.51	13.97	Kaldi DNN with p-norm
2018	TDNN-F (Kaldi)	3.80	8.76	Factorized TDNN with lattice-free MMI
2019	SpecAugment ^[7]	2.5	5.8	Simple data augmentation for spectrograms
2019	End-to-end (Semi-supervised)	2.0	4.1	Pre-training with unlabeled data
2020	Conformer (Google) ^[3]	1.9	3.9	Convolution-augmented Transformer
2020	ContextNet ^[16]	1.9	4.1	CNN-RNN-Transducer
2020	wav2vec 2.0 (960h fine-tuned) ^[4]	1.8	3.3	Self-supervised learning at scale
2021	HuBERT X-Large ^[5]	1.8	2.9	Hidden-unit BERT for speech
2022	Whisper Large (zero-shot) ^[6]	2.7	5.2	680,000h weakly supervised training
2024	Whisper Large v3 Turbo	~2.5	~4.5	Distilled large-scale model
2024	NVIDIA Parakeet RNNT 1.1B ^[17]	1.46	2.5	Conformer encoder with RNN-T decoder
2025	NVIDIA Canary Qwen 2.5B ^[18]	1.6	3.1	Conformer encoder with LLM decoder
2025	NVIDIA Parakeet TDT 0.6B v2 ^[19]	1.69	3.19	600M-parameter FastConformer with token-and-duration transducer

Several observations stand out from this progression. First, the WER on test-clean dropped from approximately 5% in 2015 to below 2% by 2020, representing a roughly 60% relative improvement in just five years. Second, the gap between test-clean and test-other performance has narrowed considerably, indicating that modern systems are more robust to challenging acoustic conditions. Third, self-supervised learning methods like wav2vec 2.0 ^[4] and HuBERT ^[5] achieved state-of-the-art results while requiring far less labeled data than their predecessors, fundamentally changing the economics of ASR model development.

Human-Level Performance

Establishing a human baseline for LibriSpeech has been an important reference point. Amodei et al. (2015) reported human WER of 5.83% on test-clean and 12.69% on test-other in the Deep Speech 2 paper ^[2], though these numbers have been debated. More careful human transcription experiments suggest that expert transcribers achieve approximately 2-4% WER on test-clean. By this measure, the best machine systems have reached or surpassed human-level accuracy on clean read speech, while performance on the more challenging "other" subset continues to improve.

The Open ASR Leaderboard era (2023-2026)

Since 2023, the Hugging Face Open ASR Leaderboard has become the main venue for comparing English speech recognizers, and LibriSpeech test-clean and test-other are two of the evaluation sets in its English short-form track. A 2025 paper by the leaderboard's maintainers describes the platform as comparing 86 open-source and proprietary systems across 12 datasets, with standardized text normalization and joint reporting of WER and inverse real-time factor (RTFx) so that accuracy and speed can be weighed together ^[20].

NVIDIA's NeMo speech models repeatedly led the leaderboard's English track in 2024 and 2025. The Parakeet family, developed with Suno.ai, first topped the leaderboard in early 2024 ^[25]. Parakeet TDT 0.6B v2, a 600-million-parameter FastConformer model with a token-and-duration transducer decoder released under a CC BY 4.0 license in May 2025, reported 1.69 percent WER on test-clean and 3.19 percent on test-other while transcribing audio roughly 3,386 times faster than real time at batch size 128; it was trained on about 120,000 hours of English speech from NVIDIA's Granary dataset, of which around 10,000 hours are human-transcribed and 110,000 hours pseudo-labeled ^[19]. Canary-Qwen-2.5B, released in July 2025, couples a FastConformer encoder to the Qwen3-1.7B language model through a linear projection with LoRA adaptation; trained on approximately 234,500 hours of English speech, it reported 1.60 percent WER on test-clean, 3.10 percent on test-other, and a 5.63 percent average WER across the leaderboard's English test sets ^[18].

The clean subset now shows clear saturation at the top: leading systems differ on test-clean by tenths of a percentage point, and multi-domain leaderboard averages have largely replaced LibriSpeech-only comparisons as the headline measure of English ASR progress ^[20]. LibriSpeech nonetheless remains the field's most established single evaluation target: the Semantic Scholar record for the original paper passed 8,000 citations by June 2026 ^[24], and the corpus's Hugging Face mirror records roughly 100,000 downloads per month, with more than 390 hosted models listing it as training or fine-tuning data ^[15].

Key Models Evaluated on LibriSpeech

Deep Speech 2

Baidu's Deep Speech 2, published in late 2015, was one of the first end-to-end systems evaluated on LibriSpeech. It used a deep recurrent neural network with Connectionist Temporal Classification (CTC) loss and batch normalization. Deep Speech 2 achieved 5.33% WER on test-clean and 13.25% on test-other, establishing an early neural baseline ^[2].

SpecAugment

Published by Google Brain in 2019, SpecAugment introduced a remarkably simple data augmentation technique for speech recognition. By applying random time warping, frequency masking, and time masking to log-mel spectrograms during training, SpecAugment achieved 2.5% WER on test-clean and 5.8% on test-other using a Listen, Attend and Spell (LAS) model ^[7]. The simplicity and effectiveness of SpecAugment made it a standard component in subsequent ASR systems.

Conformer

The Conformer architecture, introduced by Gulati et al. (2020) at Google, combined the global modeling capability of self-attention mechanisms with the local feature extraction strengths of convolutions. This hybrid approach proved highly effective for speech recognition, achieving 1.9% WER on test-clean and 3.9% on test-other ^[3]. The Conformer architecture became the foundation for most subsequent state-of-the-art ASR systems, including NVIDIA's Parakeet family of models.

wav2vec 2.0

Developed by Meta AI (formerly Facebook AI Research), wav2vec 2.0 introduced a self-supervised pre-training framework for speech. The model learned speech representations by solving a contrastive task over quantized latent speech representations, then fine-tuned on labeled data. When pre-trained on 53,000 hours of unlabeled audio from Libri-Light ^[9] and fine-tuned on LibriSpeech's 960 hours of labeled data, wav2vec 2.0 achieved 1.8% WER on test-clean and 3.3% on test-other ^[4]. As the paper states, "experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets" ^[4]. Remarkably, the authors also report that "using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER" ^[4], demonstrating that the bulk of speech recognition skill can be learned from unlabeled audio alone.

HuBERT

HuBERT (Hidden-Unit BERT), also from Meta AI, extended the self-supervised learning approach by using an offline clustering step to provide pseudo-labels for a BERT-like pre-training objective. The HuBERT X-Large model, with roughly one billion parameters, achieved 1.8% WER on test-clean and 2.9% on test-other ^[5], setting a new record on the more challenging test-other subset at the time of publication in 2021.

Whisper

OpenAI's Whisper, released in September 2022, took a fundamentally different approach to achieving robust speech recognition. Instead of self-supervised pre-training followed by fine-tuning, Whisper was trained in a weakly supervised manner on approximately 680,000 hours of audio paired with transcriptions collected from the internet ^[6]. The OpenAI authors report that "when scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning" ^[6]. The largest Whisper model (Large) achieved 2.7% WER on test-clean and 5.2% on test-other in a zero-shot setting (without any fine-tuning on LibriSpeech data) ^[6]. While these numbers were not state-of-the-art on LibriSpeech specifically, Whisper's strength lay in its exceptional robustness across diverse domains, languages, and acoustic conditions.

NVIDIA Parakeet

NVIDIA's Parakeet family of models, built on Conformer encoders paired with various decoders (CTC, RNN-Transducer, Token-and-Duration Transducer), achieved leading results on LibriSpeech. The Parakeet RNNT 1.1B model reached 1.46% WER on test-clean ^[17], while the more recent Canary Qwen 2.5B model, which pairs a Conformer encoder with a large language model decoder, achieved 1.6% WER on test-clean and 3.1% on test-other ^[18], representing some of the best reported results on the benchmark as of early 2026.

Impact on ASR Research

Standardization of Evaluation

LibriSpeech's most significant contribution has been the standardization of ASR evaluation. Before its release, the field lacked a universally accepted, freely available benchmark at sufficient scale. Researchers reported results on different datasets with different evaluation protocols, making direct comparisons between systems difficult. LibriSpeech provided a common ground, and its dual clean/other evaluation paradigm offered a more nuanced assessment than a single test set could provide.

Enabling Reproducible Research

The combination of free audio data, prepared language models, and complete Kaldi recipes meant that any researcher could reproduce the baseline results and build upon them. This reproducibility was a major factor in LibriSpeech's adoption and helped accelerate progress in the field.

Training Self-Supervised Models

LibriSpeech played a central role in the development of self-supervised speech representation learning. The 960-hour training set became the standard pre-training and fine-tuning dataset for models like wav2vec 2.0, HuBERT, WavLM, and data2vec ^[4]^[5]. These models demonstrated that large amounts of unlabeled speech, combined with small amounts of labeled LibriSpeech data, could match or exceed the performance of fully supervised systems trained on the complete labeled corpus.

Influence on Subsequent Datasets

LibriSpeech's design principles and construction methodology influenced the creation of numerous follow-on datasets:

LibriTTS: A version of LibriSpeech optimized for text-to-speech research, with 585 hours at 24 kHz sampling rate, sentence-level segmentation, and preserved punctuation and capitalization ^[8].
Libri-Light: An extension from Meta AI containing 60,000 hours of unlabeled speech from LibriVox, designed for self-supervised and semi-supervised learning research. Libri-Light includes three limited-supervision subsets (10 hours, 1 hour, and 10 minutes) for benchmarking low-resource ASR ^[9].
LibriCSS: A dataset of LibriSpeech utterances replayed through loudspeakers in an office room and recorded with distant microphones, designed for evaluating meeting transcription and speech separation systems ^[13].
LibriMix: A collection of multi-speaker mixture datasets derived from LibriSpeech for speech separation research, including two-speaker (Libri2Mix) and three-speaker (Libri3Mix) mixtures with and without background noise ^[12].
LibriSpeech-PC: An extension that restores punctuation and capitalization to the LibriSpeech transcriptions, enabling evaluation of end-to-end ASR models that produce formatted text ^[23].
Multilingual LibriSpeech (MLS): A large-scale multilingual corpus from Meta AI containing roughly 50,000 hours of speech in eight languages, constructed using a methodology similar to LibriSpeech from LibriVox recordings ^[10].
LibriTTS-R: A 2023 successor to LibriTTS containing the same 585 hours of speech from 2,456 speakers at 24 kHz, processed with speech restoration so that text-to-speech models can be trained on audio with studio-like quality ^[22].
Libriheavy: A 50,000-hour labeled ASR corpus introduced at ICASSP 2024 by Wei Kang and colleagues, including LibriSpeech co-creator Daniel Povey. It is derived from LibriVox audio and is distinguished by transcripts that retain punctuation, casing, and surrounding text context ^[21].

What are LibriSpeech's limitations?

Despite its widespread use, LibriSpeech has several well-known limitations that researchers should consider when interpreting results.

Read Speech Only

LibriSpeech consists entirely of read speech from audiobooks. It does not include spontaneous conversation, accented speech from non-native speakers, speech with disfluencies and fillers, or speech in noisy real-world environments. ASR systems that perform well on LibriSpeech may not generalize to more challenging real-world conditions. This limitation has motivated the development of complementary benchmarks such as the CHiME challenges and the Switchboard/Fisher corpora for conversational speech.

Limited Acoustic Diversity

While LibriSpeech includes a range of recording conditions through its clean/other split, the acoustic diversity is still limited compared to real-world deployment scenarios. The audio was recorded by individuals reading in relatively quiet environments, which does not capture the full range of background noise, reverberation, and channel effects encountered in practice.

English Only

LibriSpeech covers only English speech, limiting its utility for multilingual or cross-lingual speech recognition research. This limitation was partially addressed by the Multilingual LibriSpeech (MLS) dataset, which extended the LibriSpeech methodology to eight languages ^[10].

Saturation

As state-of-the-art models have pushed WER on test-clean below 2%, approaching or surpassing estimated human-level performance, the benchmark has become increasingly saturated for the clean subset. Differences between top-performing systems on test-clean are often within the margin of statistical significance, making it harder to distinguish meaningful improvements. The test-other subset remains more discriminative, but even there, the gap between systems has narrowed considerably.

Text Normalization Artifacts

Because all punctuation is removed and text is uppercased, LibriSpeech does not evaluate a system's ability to produce naturally formatted output with capitalization, punctuation, and number formatting. This has become increasingly relevant as end-to-end ASR systems are deployed in applications where users expect properly formatted transcriptions. The LibriSpeech-PC benchmark and the Libriheavy corpus were both created in part to address this gap ^[21]^[23].

How do you download LibriSpeech, and is it free?

LibriSpeech is freely available for download from the Open Speech and Language Resources (OpenSLR) website at openslr.org/12 ^[14]. It is distributed under a Creative Commons Attribution 4.0 (CC BY 4.0) license, so it can be used without charge for both academic and commercial purposes provided the source is credited ^[1]^[14]. The corpus is distributed as a set of compressed tar archives, one for each subset.

Subset	File Size
train-clean-100	6.3 GB
train-clean-360	23 GB
train-other-500	30 GB
dev-clean	337 MB
dev-other	314 MB
test-clean	346 MB
test-other	328 MB

The dataset is also available through popular machine learning data platforms including Hugging Face Datasets ^[15], TensorFlow Datasets, and PyTorch (torchaudio). These integrations allow researchers to load and use LibriSpeech data with a few lines of code, further lowering the barrier to experimentation.

Technical Usage

Loading with Hugging Face

LibriSpeech can be loaded directly using the Hugging Face datasets library ^[15]:

from datasets import load_dataset

# Load the test-clean split
dataset = load_dataset("openslr/librispeech_asr", "clean", split="test")

Loading with torchaudio

The PyTorch audio library provides native support for LibriSpeech:

import torchaudio

dataset = torchaudio.datasets.LIBRISPEECH(
    root="./data",
    url="test-clean",
    download=True
)

Data Format

Each example in the dataset contains:

Field	Type	Description
file	string	Path to the FLAC audio file
audio	dict	Decoded audio waveform array and sampling rate (16 kHz)
text	string	Uppercase transcription without punctuation
id	string	Unique utterance identifier
speaker_id	integer	Unique speaker identifier
chapter_id	integer	Audiobook chapter identifier

LibriSpeech is part of a broader ecosystem of ASR benchmarks. Other commonly used speech recognition benchmarks include:

Wall Street Journal (WSJ): An older, smaller read-speech corpus from the Linguistic Data Consortium, containing roughly 80 hours of training data.
Switchboard: A corpus of spontaneous conversational telephone speech, widely used alongside LibriSpeech for evaluating conversational ASR.
Common Voice: A crowd-sourced multilingual speech dataset from Mozilla, covering over 100 languages.
GigaSpeech: A 10,000-hour English speech corpus from various sources including audiobooks, podcasts, and YouTube.
SPGISpeech: A 5,000-hour corpus of financial earnings calls with professionally produced transcripts.
VoxPopuli: A large-scale multilingual speech corpus from European Parliament recordings.

References

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). "Librispeech: An ASR corpus based on public domain audio books." *2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, pp. 5206-5210. doi:10.1109/ICASSP.2015.7178964 ↩
Amodei, D., et al. (2015). "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*. ↩
Gulati, A., et al. (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition." *Interspeech 2020*. ↩
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." *Advances in Neural Information Processing Systems (NeurIPS)*. ↩
Hsu, W.-N., et al. (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." *IEEE/ACM Transactions on Audio, Speech, and Language Processing*. ↩
Radford, A., et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." *arXiv preprint arXiv:2212.04356*. ↩
Park, D. S., et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." *Interspeech 2019*. ↩
Zen, H., et al. (2019). "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech." *Interspeech 2019*. ↩
Kahn, J., et al. (2020). "Libri-Light: A Benchmark for ASR with Limited or No Supervision." *2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. ↩
Pratap, V., et al. (2020). "MLS: A Large-Scale Multilingual Dataset for Speech Research." *Interspeech 2020*. ↩
Povey, D., et al. (2011). "The Kaldi Speech Recognition Toolkit." *IEEE Workshop on Automatic Speech Recognition and Understanding*. ↩
Cosentino, J., et al. (2020). "LibriMix: An Open-Source Dataset for Generalizable Speech Separation." *arXiv preprint arXiv:2005.11262*. ↩
Chen, Z., et al. (2020). "Continuous Speech Separation: Dataset and Analysis." *2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. ↩
Open Speech and Language Resources. "LibriSpeech ASR corpus (SLR12)." https://www.openslr.org/12 ↩
Hugging Face. "openslr/librispeech_asr dataset card." https://huggingface.co/datasets/openslr/librispeech_asr ↩
Han, W., et al. (2020). "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context." *Interspeech 2020*. arXiv:2005.03191 ↩
NVIDIA. "parakeet-rnnt-1.1b model card." Hugging Face. https://huggingface.co/nvidia/parakeet-rnnt-1.1b ↩
NVIDIA. "canary-qwen-2.5b model card." Hugging Face. https://huggingface.co/nvidia/canary-qwen-2.5b ↩
NVIDIA. "parakeet-tdt-0.6b-v2 model card." Hugging Face. https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 ↩
Srivastav, V., et al. (2025). "Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation." *arXiv preprint arXiv:2510.06961*. ↩
Kang, W., et al. (2024). "Libriheavy: A 50,000 Hours ASR Corpus with Punctuation Casing and Context." *2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. arXiv:2309.08105 ↩
Koizumi, Y., et al. (2023). "LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus." *Interspeech 2023*. arXiv:2305.18802 ↩
Meister, A., et al. (2023). "LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-End ASR Models." *arXiv preprint arXiv:2310.02943*. ↩
Semantic Scholar. "Librispeech: An ASR corpus based on public domain audio books (paper record, citation count)." https://www.semanticscholar.org/paper/34038d9424ce602d7ac917a4e582d977725d4393 ↩
NVIDIA Technical Blog (April 18, 2024). "Pushing the Boundaries of Speech Recognition with NVIDIA NeMo Parakeet ASR Models." https://developer.nvidia.com/blog/pushing-the-boundaries-of-speech-recognition-with-nemo-parakeet-asr-models/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models Automatic Speech Recognition Models F5-TTS Sequence Model Text-to-Speech Models Voicebox Whisper Word error rate

Why was LibriSpeech created?

What is LibriSpeech made from?

LibriVox

Project Gutenberg

How was LibriSpeech built?

Overview of the Pipeline

Text Normalization

Alignment Process

Segmentation

Speaker Selection and the Clean/Other Split

How many hours and speakers does LibriSpeech contain?

Subset Breakdown

Audio Format

Speaker Demographics

Language Model Resources

How is LibriSpeech evaluated?

Baseline Results from the Original Paper

What is the state of the art on LibriSpeech?

Historical Progression of WER on LibriSpeech

Human-Level Performance

The Open ASR Leaderboard era (2023-2026)

Key Models Evaluated on LibriSpeech

Deep Speech 2

SpecAugment

Conformer

wav2vec 2.0

HuBERT

Whisper

NVIDIA Parakeet

Impact on ASR Research

Standardization of Evaluation

Enabling Reproducible Research

Training Self-Supervised Models

Influence on Subsequent Datasets

What are LibriSpeech's limitations?

Read Speech Only

Limited Acoustic Diversity

English Only

Saturation

Text Normalization Artifacts

How do you download LibriSpeech, and is it free?

Technical Usage

Loading with Hugging Face

Loading with torchaudio

Data Format

Related Benchmarks

See Also

References

Improve this article

Related Articles

SUPERB

Universal Speech Model

Whisper

Wav2Vec

Speech recognition

Deepgram

What links here

Related Articles

SUPERB

Universal Speech Model

Whisper

Wav2Vec

Speech recognition

Deepgram

What links here