LibriSpeech is a large-scale corpus of read English speech designed for training and evaluating automatic speech recognition systems. Created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur at Johns Hopkins University, the corpus contains approximately 1,000 hours of speech sampled at 16 kHz. It was derived from audiobooks recorded by volunteers as part of the LibriVox project, with corresponding texts sourced from Project Gutenberg. First introduced at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in April 2015, LibriSpeech has become the single most widely used benchmark for English speech recognition research, accumulating over 7,000 citations and serving as the standard evaluation target for nearly every major ASR system developed since its release.
Before LibriSpeech, the speech recognition research community relied heavily on a handful of corpora that were either small in scale or restricted by licensing. The Wall Street Journal (WSJ) corpus, for example, contained only about 80 hours of read speech and required an expensive Linguistic Data Consortium (LDC) license. The Switchboard corpus provided conversational telephone speech but was similarly gated behind licensing fees. The Fisher corpus offered more data but remained proprietary. These restrictions made it difficult for researchers at smaller institutions, open-source projects, and international labs to participate fully in ASR research.
The creators of LibriSpeech set out to address these problems by building a freely available, large-scale speech corpus from public domain sources. Their key insight was that the LibriVox project, a volunteer-driven initiative to create free audio recordings of public domain books, had accumulated thousands of hours of English speech with corresponding text from Project Gutenberg. By carefully aligning this audio with its textual transcriptions, the team could produce a high-quality ASR training corpus without any licensing restrictions.
The resulting dataset was released under a Creative Commons Attribution 4.0 (CC BY 4.0) license, allowing unrestricted use for both academic and commercial purposes. This open access model played a significant role in LibriSpeech's rapid adoption across the research community.
LibriVox is a volunteer-driven project founded in 2005 that aims to make all public domain books available as free audiobooks. Volunteers record themselves reading chapters from books whose copyright has expired, and the recordings are released into the public domain. By the time LibriSpeech was created, LibriVox had accumulated a vast library of English-language audiobook recordings spanning a wide range of speakers, accents, and recording quality levels.
The diversity of LibriVox volunteers is both a strength and a challenge for ASR corpus construction. Speakers range from professional-sounding readers with studio-quality microphones to casual volunteers recording on consumer equipment in noisy environments. This variation provided the basis for LibriSpeech's division into "clean" and "other" subsets, reflecting different levels of recording quality and speaker characteristics.
Project Gutenberg is a digital library of over 70,000 free eBooks, focusing on older works for which U.S. copyright has expired. The plain text versions of these books served as the reference transcriptions for LibriSpeech. Since LibriVox recordings are readings of Project Gutenberg texts, the alignment between audio and text could be established by matching audio segments to their corresponding passages in the source books.
The construction of LibriSpeech involved several stages: selecting suitable LibriVox recordings, downloading the corresponding Project Gutenberg texts, performing text normalization, training an initial acoustic model for alignment, running forced alignment to synchronize audio with text, segmenting the aligned audio into utterances, and finally selecting and organizing the data into training, development, and test splits.
The text from Project Gutenberg required extensive normalization before it could be used as reference transcriptions. All letters were converted to uppercase, and all punctuation was removed. Numbers, abbreviations, and other non-standard words were expanded into their spoken forms. This normalization step was critical because ASR systems at the time typically operated on sequences of words without punctuation or case distinctions.
The audio-text alignment was performed using a two-pass approach. In the first pass, the team trained a triphone acoustic model using discriminative training with Boosted Maximum Mutual Information (BMMI) on Mel-Frequency Cepstral Coefficient (MFCC) features. The features were processed with frame-splicing over seven frames, followed by Linear Discriminant Analysis (LDA) and a global Semi-Tied Covariance (STC) transform. This initial model was used to perform a first-pass alignment of the LibriVox audio against the normalized Project Gutenberg text.
In the second pass, the alignment was refined using the output of the first pass to train a better acoustic model, which then re-aligned the data for improved accuracy. The entire alignment process took approximately 65 hours running on two Amazon EC2 cc2.8xlarge instances and produced roughly 1,200 hours of aligned audio, from which the final 1,000-hour corpus was selected.
After alignment, the continuous audio streams were segmented into individual utterances. For the training sets, the audio was split at silence intervals exceeding 0.3 seconds, with a maximum segment length of 35 seconds. For the development and test sets, segmentation was performed only at sentence boundaries in the reference text to ensure that evaluation utterances corresponded to complete sentences, which is more natural for evaluation purposes.
A distinctive feature of LibriSpeech is its division of data into "clean" and "other" subsets. To create this split, the corpus authors ranked all speakers according to the word error rate (WER) achieved by a baseline acoustic model (trained on the WSJ si-84 data) when transcribing their speech. Speakers whose speech was easier to recognize (lower WER) were designated as "clean," while speakers with higher WER were designated as "other." The division was made roughly at the midpoint, so approximately half the speakers fell into each category.
The "clean" subset generally contains speakers with clearer pronunciation, less background noise, better microphone quality, and accents closer to standard American English. The "other" subset includes speakers with more diverse accents, noisier recording conditions, and other factors that make recognition more challenging. This split allows researchers to evaluate their systems on both relatively easy and more difficult speech, providing a more nuanced picture of ASR performance.
The complete LibriSpeech corpus totals approximately 982 hours of speech from 2,484 unique speakers reading 5,466 chapters from LibriVox audiobooks. The data is organized into seven subsets across training, development, and test partitions.
| Subset | Hours | Speakers | Utterances | Description |
|---|---|---|---|---|
| train-clean-100 | ~100 | 251 | 28,539 | Clean training data (smaller subset) |
| train-clean-360 | ~360 | 921 | 104,014 | Clean training data (larger subset) |
| train-other-500 | ~500 | 1,166 | 148,688 | More challenging training data |
| dev-clean | ~5.4 | 40 | 2,703 | Clean development/validation set |
| dev-other | ~5.3 | 33 | 2,864 | Challenging development/validation set |
| test-clean | ~5.4 | 40 | 2,620 | Clean test set |
| test-other | ~5.1 | 33 | 2,939 | Challenging test set |
| Total | ~982 | 2,484 | 292,367 |
The development and test sets each contain approximately 5 hours of audio. For these evaluation sets, 40 speakers (20 male and 20 female) were selected for the clean partition, and 33 speakers were selected for the other partition, with roughly 8 minutes of speech from each speaker. The speakers in the development and test sets are entirely disjoint from the training set speakers, ensuring unbiased evaluation.
All audio in LibriSpeech is stored in FLAC (Free Lossless Audio Codec) format at a 16 kHz sampling rate with 16-bit resolution. Each audio file corresponds to a single utterance and is named using the convention {speaker_id}-{chapter_id}-{utterance_id}.flac. The transcriptions are stored in plain text files alongside the audio.
LibriSpeech includes a diverse set of English speakers, though the corpus does not provide detailed demographic annotations beyond speaker identity. The speakers are predominantly native English speakers from various regions, reflecting the volunteer base of the LibriVox project. Each speaker contributed between a few minutes and several hours of recorded speech, with the training data capped at approximately 25 minutes per speaker in the train-clean-100 subset.
In addition to the acoustic data, the LibriSpeech authors prepared extensive language model training resources. They collected approximately 803 million tokens of text from 14,500 Project Gutenberg books, which were normalized and used to train several n-gram language models. These pre-built language models were distributed alongside the corpus to facilitate reproducible research.
The language model resources included:
Kaldi-ASR recipes were also released alongside the corpus, providing complete scripts for building competitive baseline ASR systems using the LibriSpeech data. These recipes significantly lowered the barrier to entry for researchers new to speech recognition.
LibriSpeech uses Word Error Rate (WER) as its primary evaluation metric. WER is computed as the edit distance between the hypothesized transcription and the reference transcription, normalized by the number of words in the reference. Specifically:
WER = (Substitutions + Insertions + Deletions) / Total Reference Words x 100%
The standard practice is to report WER separately on four evaluation sets: dev-clean, dev-other, test-clean, and test-other. Most published results focus on test-clean and test-other, with the gap between the two scores serving as an indicator of a model's robustness to speaker and recording variability.
The original 2015 paper by Panayotov et al. reported baseline results using traditional Gaussian Mixture Model (GMM) and Deep Neural Network (DNN) acoustic models built with the Kaldi toolkit. These baselines established initial performance targets for the corpus.
| Model | Training Data | test-clean WER (%) | test-other WER (%) |
|---|---|---|---|
| SAT (GMM) | 460h (clean) | 5.4 | 14.5 |
| DNN (p-norm) | 460h (clean) | 4.3 | 12.5 |
| SAT (GMM) | 960h (all) | 5.1 | 12.7 |
| DNN (p-norm) | 960h (all) | 4.8 | 14.5 |
The SAT (Speaker-Adapted Training) models used GMM-HMM systems with speaker-level feature transforms (fMLLR). The DNN models used networks with p-norm nonlinearities trained on fMLLR features. Notably, the paper demonstrated that acoustic models trained on LibriSpeech generalized well to other domains, achieving lower error rates on the WSJ test sets than models trained on WSJ itself.
Since 2015, the word error rates on LibriSpeech have dropped dramatically, driven by advances in end-to-end modeling, self-supervised pre-training, data augmentation, and large-scale weak supervision. The following table summarizes notable results across the history of the benchmark.
| Year | Model / System | test-clean WER (%) | test-other WER (%) | Key Innovation |
|---|---|---|---|---|
| 2015 | Deep Speech 2 (Baidu) | 5.15 | 12.73 | End-to-end RNN with CTC |
| 2015 | LibriSpeech Baseline DNN | 4.3 | 12.5 | Kaldi DNN with p-norm |
| 2018 | TDNN-F (Kaldi) | 3.80 | 8.76 | Factorized TDNN with lattice-free MMI |
| 2019 | SpecAugment | 2.5 | 5.8 | Simple data augmentation for spectrograms |
| 2019 | End-to-end (Semi-supervised) | 2.0 | 4.1 | Pre-training with unlabeled data |
| 2020 | Conformer | 1.9 | 3.9 | Convolution-augmented Transformer |
| 2020 | ContextNet | 1.9 | 4.1 | CNN-RNN-Transducer |
| 2020 | wav2vec 2.0 (960h fine-tuned) | 1.8 | 3.3 | Self-supervised learning at scale |
| 2021 | HuBERT Large | 1.8 | 2.9 | Hidden-unit BERT for speech |
| 2022 | Whisper Large (zero-shot) | 2.7 | 5.2 | 680,000h weakly supervised training |
| 2024 | Whisper Large v3 Turbo | ~2.5 | ~4.5 | Distilled large-scale model |
| 2025 | NVIDIA Parakeet RNNT 1.1B | 1.8 | ~3.3 | Conformer encoder with RNN-T decoder |
| 2025 | NVIDIA Canary Qwen 2.5B | 1.6 | 3.1 | Conformer encoder with LLM decoder |
Several observations stand out from this progression. First, the WER on test-clean dropped from approximately 5% in 2015 to below 2% by 2020, representing a roughly 60% relative improvement in just five years. Second, the gap between test-clean and test-other performance has narrowed considerably, indicating that modern systems are more robust to challenging acoustic conditions. Third, self-supervised learning methods like wav2vec 2.0 and HuBERT achieved state-of-the-art results while requiring far less labeled data than their predecessors, fundamentally changing the economics of ASR model development.
Establishing a human baseline for LibriSpeech has been an important reference point. Amodei et al. (2015) reported human WER of 5.83% on test-clean and 12.69% on test-other in the Deep Speech 2 paper, though these numbers have been debated. More careful human transcription experiments suggest that expert transcribers achieve approximately 2-4% WER on test-clean. By this measure, the best machine systems have reached or surpassed human-level accuracy on clean read speech, while performance on the more challenging "other" subset continues to improve.
Baidu's Deep Speech 2, published in late 2015, was one of the first end-to-end systems evaluated on LibriSpeech. It used a deep recurrent neural network with Connectionist Temporal Classification (CTC) loss and batch normalization. Deep Speech 2 achieved 5.15% WER on test-clean and 12.73% on test-other, establishing an early neural baseline.
Published by Google Brain in 2019, SpecAugment introduced a remarkably simple data augmentation technique for speech recognition. By applying random time warping, frequency masking, and time masking to log-mel spectrograms during training, SpecAugment achieved 2.5% WER on test-clean and 5.8% on test-other using a Listen, Attend and Spell (LAS) model. The simplicity and effectiveness of SpecAugment made it a standard component in subsequent ASR systems.
The Conformer architecture, introduced by Gulati et al. (2020) at Google, combined the global modeling capability of self-attention mechanisms with the local feature extraction strengths of convolutions. This hybrid approach proved highly effective for speech recognition, achieving 1.9% WER on test-clean and 3.9% on test-other. The Conformer architecture became the foundation for most subsequent state-of-the-art ASR systems, including NVIDIA's Parakeet family of models.
Developed by Meta AI (formerly Facebook AI Research), wav2vec 2.0 introduced a self-supervised pre-training framework for speech. The model learned speech representations by solving a contrastive task over quantized latent speech representations, then fine-tuned on labeled data. When pre-trained on 53,000 hours of unlabeled audio from Libri-Light and fine-tuned on LibriSpeech's 960 hours of labeled data, wav2vec 2.0 achieved 1.8% WER on test-clean and 3.3% on test-other. Remarkably, when fine-tuned on only 10 minutes of labeled data, it still achieved 4.8% WER on test-clean and 8.2% on test-other.
HuBERT (Hidden-Unit BERT), also from Meta AI, extended the self-supervised learning approach by using an offline clustering step to provide pseudo-labels for a BERT-like pre-training objective. The HuBERT Large model achieved 1.8% WER on test-clean and 2.9% on test-other, setting a new record on the more challenging test-other subset at the time of publication in 2021.
OpenAI's Whisper, released in September 2022, took a fundamentally different approach to achieving robust speech recognition. Instead of self-supervised pre-training followed by fine-tuning, Whisper was trained in a weakly supervised manner on approximately 680,000 hours of audio paired with transcriptions collected from the internet. The largest Whisper model (Large) achieved 2.7% WER on test-clean and 5.2% on test-other in a zero-shot setting (without any fine-tuning on LibriSpeech data). While these numbers were not state-of-the-art on LibriSpeech specifically, Whisper's strength lay in its exceptional robustness across diverse domains, languages, and acoustic conditions.
NVIDIA's Parakeet family of models, built on Conformer encoders paired with various decoders (CTC, RNN-Transducer, Token-and-Duration Transducer), achieved leading results on LibriSpeech. The Parakeet RNNT 1.1B model reached 1.8% WER on test-clean, while the more recent Canary Qwen 2.5B model, which pairs a Conformer encoder with a large language model decoder, achieved 1.6% WER on test-clean and 3.1% on test-other, representing some of the best reported results on the benchmark as of early 2026.
LibriSpeech's most significant contribution has been the standardization of ASR evaluation. Before its release, the field lacked a universally accepted, freely available benchmark at sufficient scale. Researchers reported results on different datasets with different evaluation protocols, making direct comparisons between systems difficult. LibriSpeech provided a common ground, and its dual clean/other evaluation paradigm offered a more nuanced assessment than a single test set could provide.
The combination of free audio data, prepared language models, and complete Kaldi recipes meant that any researcher could reproduce the baseline results and build upon them. This reproducibility was a major factor in LibriSpeech's adoption and helped accelerate progress in the field.
LibriSpeech played a central role in the development of self-supervised speech representation learning. The 960-hour training set became the standard pre-training and fine-tuning dataset for models like wav2vec 2.0, HuBERT, WavLM, and data2vec. These models demonstrated that large amounts of unlabeled speech, combined with small amounts of labeled LibriSpeech data, could match or exceed the performance of fully supervised systems trained on the complete labeled corpus.
LibriSpeech's design principles and construction methodology influenced the creation of numerous follow-on datasets:
Despite its widespread use, LibriSpeech has several well-known limitations that researchers should consider when interpreting results.
LibriSpeech consists entirely of read speech from audiobooks. It does not include spontaneous conversation, accented speech from non-native speakers, speech with disfluencies and fillers, or speech in noisy real-world environments. ASR systems that perform well on LibriSpeech may not generalize to more challenging real-world conditions. This limitation has motivated the development of complementary benchmarks such as the CHiME challenges and the Switchboard/Fisher corpora for conversational speech.
While LibriSpeech includes a range of recording conditions through its clean/other split, the acoustic diversity is still limited compared to real-world deployment scenarios. The audio was recorded by individuals reading in relatively quiet environments, which does not capture the full range of background noise, reverberation, and channel effects encountered in practice.
LibriSpeech covers only English speech, limiting its utility for multilingual or cross-lingual speech recognition research. This limitation was partially addressed by the Multilingual LibriSpeech (MLS) dataset, which extended the LibriSpeech methodology to eight languages.
As state-of-the-art models have pushed WER on test-clean below 2%, approaching or surpassing estimated human-level performance, the benchmark has become increasingly saturated for the clean subset. Differences between top-performing systems on test-clean are often within the margin of statistical significance, making it harder to distinguish meaningful improvements. The test-other subset remains more discriminative, but even there, the gap between systems has narrowed considerably.
Because all punctuation is removed and text is uppercased, LibriSpeech does not evaluate a system's ability to produce naturally formatted output with capitalization, punctuation, and number formatting. This has become increasingly relevant as end-to-end ASR systems are deployed in applications where users expect properly formatted transcriptions.
LibriSpeech is freely available for download from the Open Speech and Language Resources (OpenSLR) website at openslr.org/12. The corpus is distributed as a set of compressed tar archives, one for each subset.
| Subset | File Size |
|---|---|
| train-clean-100 | 6.3 GB |
| train-clean-360 | 23 GB |
| train-other-500 | 30 GB |
| dev-clean | 337 MB |
| dev-other | 314 MB |
| test-clean | 346 MB |
| test-other | 328 MB |
The dataset is also available through popular machine learning data platforms including Hugging Face Datasets, TensorFlow Datasets, and PyTorch (torchaudio). These integrations allow researchers to load and use LibriSpeech data with a few lines of code, further lowering the barrier to experimentation.
LibriSpeech can be loaded directly using the Hugging Face datasets library:
from datasets import load_dataset
# Load the test-clean split
dataset = load_dataset("openslr/librispeech_asr", "clean", split="test")
The PyTorch audio library provides native support for LibriSpeech:
import torchaudio
dataset = torchaudio.datasets.LIBRISPEECH(
root="./data",
url="test-clean",
download=True
)
Each example in the dataset contains:
| Field | Type | Description |
|---|---|---|
| file | string | Path to the FLAC audio file |
| audio | dict | Decoded audio waveform array and sampling rate (16 kHz) |
| text | string | Uppercase transcription without punctuation |
| id | string | Unique utterance identifier |
| speaker_id | integer | Unique speaker identifier |
| chapter_id | integer | Audiobook chapter identifier |
LibriSpeech is part of a broader ecosystem of ASR benchmarks. Other commonly used speech recognition benchmarks include: