LibriSpeech
Last reviewed
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,299 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
25 citations
Review status
Source-backed
Revision
v4 ยท 5,299 words
Add missing citations, update stale details, or suggest a clearer explanation.
LibriSpeech is a freely available corpus of approximately 1,000 hours of 16 kHz read English speech that serves as the standard benchmark for training and evaluating automatic speech recognition (ASR) systems. Its official description reads: "LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech... The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned" [14]. The corpus was created by Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur at Johns Hopkins University and was first presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in April 2015 [1]. It pairs LibriVox volunteer audiobook recordings with public-domain texts from Project Gutenberg [1]. LibriSpeech has become the single most widely used benchmark for English speech recognition, accumulating over 8,000 citations [24] and serving as the headline evaluation target for nearly every major ASR model developed since its release, from Kaldi DNN-HMM baselines to self-supervised and weakly supervised neural systems such as wav2vec 2.0, Conformer, and Whisper.
Before LibriSpeech, the speech recognition research community relied heavily on a handful of corpora that were either small in scale or restricted by licensing. The Wall Street Journal (WSJ) corpus, for example, contained only about 80 hours of read speech [1] and required an expensive Linguistic Data Consortium (LDC) license. The Switchboard corpus provided conversational telephone speech but was similarly gated behind licensing fees. The Fisher corpus offered more data but remained proprietary. These restrictions made it difficult for researchers at smaller institutions, open-source projects, and international labs to participate fully in ASR research.
The creators of LibriSpeech set out to address these problems by building a freely available, large-scale speech corpus from public domain sources. Their key insight was that the LibriVox project, a volunteer-driven initiative to create free audio recordings of public domain books, had accumulated thousands of hours of English speech with corresponding text from Project Gutenberg. By carefully aligning this audio with its textual transcriptions, the team could produce a high-quality ASR training corpus without any licensing restrictions.
The resulting dataset was released under a Creative Commons Attribution 4.0 (CC BY 4.0) license, allowing unrestricted use for both academic and commercial purposes [1][14]. This open access model played a significant role in LibriSpeech's rapid adoption across the research community.
LibriVox is a volunteer-driven project founded in 2005 that aims to make all public domain books available as free audiobooks. Volunteers record themselves reading chapters from books whose copyright has expired, and the recordings are released into the public domain. By the time LibriSpeech was created, LibriVox had accumulated a vast library of English-language audiobook recordings spanning a wide range of speakers, accents, and recording quality levels. The original paper put the count at approximately 8,000 public domain audiobooks at the time, the majority of them in English [1].
The diversity of LibriVox volunteers is both a strength and a challenge for ASR corpus construction. Speakers range from professional-sounding readers with studio-quality microphones to casual volunteers recording on consumer equipment in noisy environments. This variation provided the basis for LibriSpeech's division into "clean" and "other" subsets, reflecting different levels of recording quality and speaker characteristics.
Project Gutenberg is a digital library of over 70,000 free eBooks, focusing on older works for which U.S. copyright has expired. The plain text versions of these books served as the reference transcriptions for LibriSpeech. Since LibriVox recordings are readings of Project Gutenberg texts, the alignment between audio and text could be established by matching audio segments to their corresponding passages in the source books.
The construction of LibriSpeech involved several stages: selecting suitable LibriVox recordings, downloading the corresponding Project Gutenberg texts, performing text normalization, training an initial acoustic model for alignment, running forced alignment to synchronize audio with text, segmenting the aligned audio into utterances, and finally selecting and organizing the data into training, development, and test splits [1].
The text from Project Gutenberg required extensive normalization before it could be used as reference transcriptions. All letters were converted to uppercase, and all punctuation was removed. Numbers, abbreviations, and other non-standard words were expanded into their spoken forms [1]. This normalization step was critical because ASR systems at the time typically operated on sequences of words without punctuation or case distinctions.
The audio-text alignment was performed using a two-pass approach. In the first pass, the team trained a triphone acoustic model using discriminative training with Boosted Maximum Mutual Information (BMMI) on Mel-Frequency Cepstral Coefficient (MFCC) features. The features were processed with frame-splicing over seven frames, followed by Linear Discriminant Analysis (LDA) and a global Semi-Tied Covariance (STC) transform [1]. The acoustic model for this first decoding pass was trained on the VoxForge dataset [1]. This initial model was used to perform a first-pass alignment of the LibriVox audio against the normalized Project Gutenberg text. A Smith-Waterman algorithm then located the best single matching region between the recognized audio and the chapter text, and the audio was split into pieces of 35 seconds or less at silences falling inside "islands of confidence", defined as exact matches with the reference at least 12 phones long [1].
In the second pass, the alignment was refined using the output of the first pass to train a better acoustic model, which then re-aligned the data for improved accuracy. Specifically, the second stage decoded each segment with a custom graph that combined the linear word sequence of the transcript with a generic phone-level bigram, using a speaker-adapted model with fMLLR transforms, and rejected any utterance whose decoding deviated from the transcript [1]. The entire alignment process took approximately 65 hours running on two Amazon EC2 cc2.8xlarge instances and produced roughly 1,200 hours of aligned audio, from which the final 1,000-hour corpus was selected [1].
After alignment, the continuous audio streams were segmented into individual utterances. For the training sets, the audio was split at silence intervals exceeding 0.3 seconds, with a maximum segment length of 35 seconds [1]. For the development and test sets, segmentation was performed only at sentence boundaries in the reference text [1] to ensure that evaluation utterances corresponded to complete sentences, which is more natural for evaluation purposes.
A distinctive feature of LibriSpeech is its division of data into "clean" and "other" subsets. To create this split, the corpus authors ranked all speakers according to the word error rate (WER) achieved by a baseline acoustic model (trained on the WSJ si-84 data) when transcribing their speech [1]. Speakers whose speech was easier to recognize (lower WER) were designated as "clean," while speakers with higher WER were designated as "other." The division was made roughly at the midpoint, so approximately half the speakers fell into each category [1]. For the "other" pool, the development and test speakers were not picked at random: they were drawn from the third quartile of the WER-based difficulty ranking, deliberately selecting more challenging data [1]. Multi-speaker recordings such as LibriVox "Dramatic Readings" were excluded, and the remaining audio was screened with the LIUM speaker diarization toolkit plus a custom inspection tool to remove multi-speaker chapters and record speaker gender [1].
The "clean" subset generally contains speakers with clearer pronunciation, less background noise, better microphone quality, and accents closer to standard American English. The "other" subset includes speakers with more diverse accents, noisier recording conditions, and other factors that make recognition more challenging. This split allows researchers to evaluate their systems on both relatively easy and more difficult speech, providing a more nuanced picture of ASR performance.
The complete LibriSpeech corpus totals approximately 982 hours of speech from 2,484 unique speakers [1] reading 5,466 chapters from LibriVox audiobooks. The data is organized into seven subsets across training, development, and test partitions [1].
| Subset | Hours | Speakers | Utterances | Description |
|---|---|---|---|---|
| train-clean-100 | ~100 | 251 | 28,539 | Clean training data (smaller subset) |
| train-clean-360 | ~360 | 921 | 104,014 | Clean training data (larger subset) |
| train-other-500 | ~500 | 1,166 | 148,688 | More challenging training data |
| dev-clean | ~5.4 | 40 | 2,703 | Clean development/validation set |
| dev-other | ~5.3 | 33 | 2,864 | Challenging development/validation set |
| test-clean | ~5.4 | 40 | 2,620 | Clean test set |
| test-other | ~5.1 | 33 | 2,939 | Challenging test set |
| Total | ~982 | 2,484 | 292,367 |
The hours and speaker counts follow Table 1 of the original paper, which gives exact durations of 100.6, 363.6, and 496.7 hours for the three training sets [1], and the per-subset utterance counts match the official Hugging Face distribution of the corpus [15]. The development and test sets each contain approximately 5 hours of audio. For these evaluation sets, 40 speakers (20 male and 20 female) were selected for the clean partition, and 33 speakers were selected for the other partition, with roughly 8 minutes of speech from each speaker [1]. The speakers in the development and test sets are entirely disjoint from the training set speakers, ensuring unbiased evaluation [1].
All audio in LibriSpeech is stored in FLAC (Free Lossless Audio Codec) format at a 16 kHz sampling rate with 16-bit resolution. Each audio file corresponds to a single utterance and is named using the convention {speaker_id}-{chapter_id}-{utterance_id}.flac. The transcriptions are stored in plain text files alongside the audio.
LibriSpeech includes a diverse set of English speakers, though the corpus does not provide detailed demographic annotations beyond speaker identity. The speakers are predominantly native English speakers from various regions, reflecting the volunteer base of the LibriVox project. Each speaker contributed between a few minutes and several hours of recorded speech, with the training data capped at approximately 25 minutes per speaker in the train-clean-100 subset [1].
In addition to the acoustic data, the LibriSpeech authors prepared extensive language model training resources. They collected approximately 803 million tokens of text from 14,500 Project Gutenberg books, which were normalized and used to train several n-gram language models [1]. To prevent contamination, every book underlying the development and test sets was excluded from this text, along with any candidate book flagged by a title-similarity check or by an inverted index of shared 5-grams [1]. These pre-built language models were distributed alongside the corpus to facilitate reproducible research.
The language model resources included:
The released 3-gram model has a perplexity of 170 on the evaluation sets and the 4-gram model a perplexity of around 150, with an out-of-vocabulary rate of approximately 0.4 percent [1].
Kaldi-ASR recipes were also released alongside the corpus, providing complete scripts for building competitive baseline ASR systems using the LibriSpeech data [1][11]. These recipes significantly lowered the barrier to entry for researchers new to speech recognition.
LibriSpeech uses Word Error Rate (WER) as its primary evaluation metric. WER is computed as the edit distance between the hypothesized transcription and the reference transcription, normalized by the number of words in the reference. Specifically:
WER = (Substitutions + Insertions + Deletions) / Total Reference Words x 100%
The standard practice is to report WER separately on four evaluation sets: dev-clean, dev-other, test-clean, and test-other. Most published results focus on test-clean and test-other, with the gap between the two scores serving as an indicator of a model's robustness to speaker and recording variability.
The original 2015 paper by Panayotov et al. reported baseline results using traditional Gaussian Mixture Model (GMM) and Deep Neural Network (DNN) acoustic models built with the Kaldi toolkit [1][11]. These baselines established initial performance targets for the corpus.
| Model | Training Data | test-clean WER (%) | test-other WER (%) |
|---|---|---|---|
| SAT (GMM) | 460h (clean) | 8.34 | 28.11 |
| DNN (p-norm) | 460h (clean) | 5.78 | 19.12 |
| SAT (GMM) | 960h (all) | 8.04 | 22.65 |
| DNN (p-norm) | 960h (all) | 5.51 | 13.97 |
All values are obtained with rescoring by the full 4-gram language model [1]. The SAT (Speaker-Adapted Training) models used GMM-HMM systems with speaker-level feature transforms (fMLLR). The DNN models used networks with p-norm nonlinearities trained on fMLLR features [1]. Notably, the paper demonstrated that acoustic models trained on LibriSpeech generalized well to other domains: as the authors put it, "acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself" [1].
Since 2015, the word error rates on LibriSpeech have dropped dramatically, driven by advances in end-to-end modeling, self-supervised pre-training, data augmentation, and large-scale weak supervision. The following table summarizes notable results across the history of the benchmark.
| Year | Model / System | test-clean WER (%) | test-other WER (%) | Key Innovation |
|---|---|---|---|---|
| 2015 | Deep Speech 2 (Baidu) [2] | 5.33 | 13.25 | End-to-end RNN with CTC |
| 2015 | LibriSpeech Baseline DNN [1] | 5.51 | 13.97 | Kaldi DNN with p-norm |
| 2018 | TDNN-F (Kaldi) | 3.80 | 8.76 | Factorized TDNN with lattice-free MMI |
| 2019 | SpecAugment [7] | 2.5 | 5.8 | Simple data augmentation for spectrograms |
| 2019 | End-to-end (Semi-supervised) | 2.0 | 4.1 | Pre-training with unlabeled data |
| 2020 | Conformer (Google) [3] | 1.9 | 3.9 | Convolution-augmented Transformer |
| 2020 | ContextNet [16] | 1.9 | 4.1 | CNN-RNN-Transducer |
| 2020 | wav2vec 2.0 (960h fine-tuned) [4] | 1.8 | 3.3 | Self-supervised learning at scale |
| 2021 | HuBERT X-Large [5] | 1.8 | 2.9 | Hidden-unit BERT for speech |
| 2022 | Whisper Large (zero-shot) [6] | 2.7 | 5.2 | 680,000h weakly supervised training |
| 2024 | Whisper Large v3 Turbo | ~2.5 | ~4.5 | Distilled large-scale model |
| 2024 | NVIDIA Parakeet RNNT 1.1B [17] | 1.46 | 2.5 | Conformer encoder with RNN-T decoder |
| 2025 | NVIDIA Canary Qwen 2.5B [18] | 1.6 | 3.1 | Conformer encoder with LLM decoder |
| 2025 | NVIDIA Parakeet TDT 0.6B v2 [19] | 1.69 | 3.19 | 600M-parameter FastConformer with token-and-duration transducer |
Several observations stand out from this progression. First, the WER on test-clean dropped from approximately 5% in 2015 to below 2% by 2020, representing a roughly 60% relative improvement in just five years. Second, the gap between test-clean and test-other performance has narrowed considerably, indicating that modern systems are more robust to challenging acoustic conditions. Third, self-supervised learning methods like wav2vec 2.0 [4] and HuBERT [5] achieved state-of-the-art results while requiring far less labeled data than their predecessors, fundamentally changing the economics of ASR model development.
Establishing a human baseline for LibriSpeech has been an important reference point. Amodei et al. (2015) reported human WER of 5.83% on test-clean and 12.69% on test-other in the Deep Speech 2 paper [2], though these numbers have been debated. More careful human transcription experiments suggest that expert transcribers achieve approximately 2-4% WER on test-clean. By this measure, the best machine systems have reached or surpassed human-level accuracy on clean read speech, while performance on the more challenging "other" subset continues to improve.
Since 2023, the Hugging Face Open ASR Leaderboard has become the main venue for comparing English speech recognizers, and LibriSpeech test-clean and test-other are two of the evaluation sets in its English short-form track. A 2025 paper by the leaderboard's maintainers describes the platform as comparing 86 open-source and proprietary systems across 12 datasets, with standardized text normalization and joint reporting of WER and inverse real-time factor (RTFx) so that accuracy and speed can be weighed together [20].
NVIDIA's NeMo speech models repeatedly led the leaderboard's English track in 2024 and 2025. The Parakeet family, developed with Suno.ai, first topped the leaderboard in early 2024 [25]. Parakeet TDT 0.6B v2, a 600-million-parameter FastConformer model with a token-and-duration transducer decoder released under a CC BY 4.0 license in May 2025, reported 1.69 percent WER on test-clean and 3.19 percent on test-other while transcribing audio roughly 3,386 times faster than real time at batch size 128; it was trained on about 120,000 hours of English speech from NVIDIA's Granary dataset, of which around 10,000 hours are human-transcribed and 110,000 hours pseudo-labeled [19]. Canary-Qwen-2.5B, released in July 2025, couples a FastConformer encoder to the Qwen3-1.7B language model through a linear projection with LoRA adaptation; trained on approximately 234,500 hours of English speech, it reported 1.60 percent WER on test-clean, 3.10 percent on test-other, and a 5.63 percent average WER across the leaderboard's English test sets [18].
The clean subset now shows clear saturation at the top: leading systems differ on test-clean by tenths of a percentage point, and multi-domain leaderboard averages have largely replaced LibriSpeech-only comparisons as the headline measure of English ASR progress [20]. LibriSpeech nonetheless remains the field's most established single evaluation target: the Semantic Scholar record for the original paper passed 8,000 citations by June 2026 [24], and the corpus's Hugging Face mirror records roughly 100,000 downloads per month, with more than 390 hosted models listing it as training or fine-tuning data [15].
Baidu's Deep Speech 2, published in late 2015, was one of the first end-to-end systems evaluated on LibriSpeech. It used a deep recurrent neural network with Connectionist Temporal Classification (CTC) loss and batch normalization. Deep Speech 2 achieved 5.33% WER on test-clean and 13.25% on test-other, establishing an early neural baseline [2].
Published by Google Brain in 2019, SpecAugment introduced a remarkably simple data augmentation technique for speech recognition. By applying random time warping, frequency masking, and time masking to log-mel spectrograms during training, SpecAugment achieved 2.5% WER on test-clean and 5.8% on test-other using a Listen, Attend and Spell (LAS) model [7]. The simplicity and effectiveness of SpecAugment made it a standard component in subsequent ASR systems.
The Conformer architecture, introduced by Gulati et al. (2020) at Google, combined the global modeling capability of self-attention mechanisms with the local feature extraction strengths of convolutions. This hybrid approach proved highly effective for speech recognition, achieving 1.9% WER on test-clean and 3.9% on test-other [3]. The Conformer architecture became the foundation for most subsequent state-of-the-art ASR systems, including NVIDIA's Parakeet family of models.
Developed by Meta AI (formerly Facebook AI Research), wav2vec 2.0 introduced a self-supervised pre-training framework for speech. The model learned speech representations by solving a contrastive task over quantized latent speech representations, then fine-tuned on labeled data. When pre-trained on 53,000 hours of unlabeled audio from Libri-Light [9] and fine-tuned on LibriSpeech's 960 hours of labeled data, wav2vec 2.0 achieved 1.8% WER on test-clean and 3.3% on test-other [4]. As the paper states, "experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets" [4]. Remarkably, the authors also report that "using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER" [4], demonstrating that the bulk of speech recognition skill can be learned from unlabeled audio alone.
HuBERT (Hidden-Unit BERT), also from Meta AI, extended the self-supervised learning approach by using an offline clustering step to provide pseudo-labels for a BERT-like pre-training objective. The HuBERT X-Large model, with roughly one billion parameters, achieved 1.8% WER on test-clean and 2.9% on test-other [5], setting a new record on the more challenging test-other subset at the time of publication in 2021.
OpenAI's Whisper, released in September 2022, took a fundamentally different approach to achieving robust speech recognition. Instead of self-supervised pre-training followed by fine-tuning, Whisper was trained in a weakly supervised manner on approximately 680,000 hours of audio paired with transcriptions collected from the internet [6]. The OpenAI authors report that "when scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning" [6]. The largest Whisper model (Large) achieved 2.7% WER on test-clean and 5.2% on test-other in a zero-shot setting (without any fine-tuning on LibriSpeech data) [6]. While these numbers were not state-of-the-art on LibriSpeech specifically, Whisper's strength lay in its exceptional robustness across diverse domains, languages, and acoustic conditions.
NVIDIA's Parakeet family of models, built on Conformer encoders paired with various decoders (CTC, RNN-Transducer, Token-and-Duration Transducer), achieved leading results on LibriSpeech. The Parakeet RNNT 1.1B model reached 1.46% WER on test-clean [17], while the more recent Canary Qwen 2.5B model, which pairs a Conformer encoder with a large language model decoder, achieved 1.6% WER on test-clean and 3.1% on test-other [18], representing some of the best reported results on the benchmark as of early 2026.
LibriSpeech's most significant contribution has been the standardization of ASR evaluation. Before its release, the field lacked a universally accepted, freely available benchmark at sufficient scale. Researchers reported results on different datasets with different evaluation protocols, making direct comparisons between systems difficult. LibriSpeech provided a common ground, and its dual clean/other evaluation paradigm offered a more nuanced assessment than a single test set could provide.
The combination of free audio data, prepared language models, and complete Kaldi recipes meant that any researcher could reproduce the baseline results and build upon them. This reproducibility was a major factor in LibriSpeech's adoption and helped accelerate progress in the field.
LibriSpeech played a central role in the development of self-supervised speech representation learning. The 960-hour training set became the standard pre-training and fine-tuning dataset for models like wav2vec 2.0, HuBERT, WavLM, and data2vec [4][5]. These models demonstrated that large amounts of unlabeled speech, combined with small amounts of labeled LibriSpeech data, could match or exceed the performance of fully supervised systems trained on the complete labeled corpus.
LibriSpeech's design principles and construction methodology influenced the creation of numerous follow-on datasets:
Despite its widespread use, LibriSpeech has several well-known limitations that researchers should consider when interpreting results.
LibriSpeech consists entirely of read speech from audiobooks. It does not include spontaneous conversation, accented speech from non-native speakers, speech with disfluencies and fillers, or speech in noisy real-world environments. ASR systems that perform well on LibriSpeech may not generalize to more challenging real-world conditions. This limitation has motivated the development of complementary benchmarks such as the CHiME challenges and the Switchboard/Fisher corpora for conversational speech.
While LibriSpeech includes a range of recording conditions through its clean/other split, the acoustic diversity is still limited compared to real-world deployment scenarios. The audio was recorded by individuals reading in relatively quiet environments, which does not capture the full range of background noise, reverberation, and channel effects encountered in practice.
LibriSpeech covers only English speech, limiting its utility for multilingual or cross-lingual speech recognition research. This limitation was partially addressed by the Multilingual LibriSpeech (MLS) dataset, which extended the LibriSpeech methodology to eight languages [10].
As state-of-the-art models have pushed WER on test-clean below 2%, approaching or surpassing estimated human-level performance, the benchmark has become increasingly saturated for the clean subset. Differences between top-performing systems on test-clean are often within the margin of statistical significance, making it harder to distinguish meaningful improvements. The test-other subset remains more discriminative, but even there, the gap between systems has narrowed considerably.
Because all punctuation is removed and text is uppercased, LibriSpeech does not evaluate a system's ability to produce naturally formatted output with capitalization, punctuation, and number formatting. This has become increasingly relevant as end-to-end ASR systems are deployed in applications where users expect properly formatted transcriptions. The LibriSpeech-PC benchmark and the Libriheavy corpus were both created in part to address this gap [21][23].
LibriSpeech is freely available for download from the Open Speech and Language Resources (OpenSLR) website at openslr.org/12 [14]. It is distributed under a Creative Commons Attribution 4.0 (CC BY 4.0) license, so it can be used without charge for both academic and commercial purposes provided the source is credited [1][14]. The corpus is distributed as a set of compressed tar archives, one for each subset.
| Subset | File Size |
|---|---|
| train-clean-100 | 6.3 GB |
| train-clean-360 | 23 GB |
| train-other-500 | 30 GB |
| dev-clean | 337 MB |
| dev-other | 314 MB |
| test-clean | 346 MB |
| test-other | 328 MB |
The dataset is also available through popular machine learning data platforms including Hugging Face Datasets [15], TensorFlow Datasets, and PyTorch (torchaudio). These integrations allow researchers to load and use LibriSpeech data with a few lines of code, further lowering the barrier to experimentation.
LibriSpeech can be loaded directly using the Hugging Face datasets library [15]:
from datasets import load_dataset
# Load the test-clean split
dataset = load_dataset("openslr/librispeech_asr", "clean", split="test")
The PyTorch audio library provides native support for LibriSpeech:
import torchaudio
dataset = torchaudio.datasets.LIBRISPEECH(
root="./data",
url="test-clean",
download=True
)
Each example in the dataset contains:
| Field | Type | Description |
|---|---|---|
| file | string | Path to the FLAC audio file |
| audio | dict | Decoded audio waveform array and sampling rate (16 kHz) |
| text | string | Uppercase transcription without punctuation |
| id | string | Unique utterance identifier |
| speaker_id | integer | Unique speaker identifier |
| chapter_id | integer | Audiobook chapter identifier |
LibriSpeech is part of a broader ecosystem of ASR benchmarks. Other commonly used speech recognition benchmarks include: