HuBERT
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 3,018 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 3,018 words
Add missing citations, update stale details, or suggest a clearer explanation.
HuBERT (Hidden-Unit BERT) is a self-supervised learning model for speech representation, introduced by researchers at Meta AI (then Facebook AI Research) in 2021. It learns useful representations of spoken audio directly from raw, unlabeled waveforms by predicting cluster assignments, called hidden units, for masked portions of the signal. The approach borrows the masked-prediction idea from BERT in natural language processing and adapts it to continuous audio, where there is no ready-made vocabulary of discrete tokens. After self-supervised pre-training, a HuBERT model can be fine-tuned with a small amount of transcribed audio to perform automatic speech recognition, or used as a frozen feature extractor for many other speech tasks. The most widely downloaded checkpoint, facebook/hubert-large-ls960-ft, is a large model fine-tuned on 960 hours of LibriSpeech and is distributed through Hugging Face.
HuBERT was presented in the paper "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units" by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. It matched or surpassed the contemporaneous state of the art set by Wav2Vec 2.0 on standard recognition benchmarks, and it became one of the strongest entries on the SUPERB benchmark for general-purpose speech representations. It should not be confused with the Hungarian language model also informally called huBERT; the two share a naming pun on BERT but are unrelated.
The central problem HuBERT was designed to solve is that speech is hard to model in a self-supervised way for reasons that text does not share. The paper frames this as three difficulties. First, each spoken utterance contains many sound units rather than a single label. Second, there is no lexicon, no dictionary of the units, available during pre-training, so the model cannot simply be told what the target symbols are. Third, sound units have variable lengths and there are no clear boundaries marking where one ends and the next begins. A model that wants to learn from raw audio without transcripts has to cope with all three at once.
HuBERT's answer is to manufacture the missing labels with a simple, separate clustering step, then learn from them. Before training the main network, the method runs k-means clustering over acoustic features extracted from the training audio. Each short frame of audio is assigned to the nearest cluster, and the cluster index becomes a discrete pseudo-label for that frame. These cluster indices are the hidden units that give the model its name. With a sequence of discrete targets now available for every utterance, the network is trained to predict those targets for frames that have been hidden behind a mask, exactly as a masked language model predicts hidden words from their surrounding context.
A point the authors stress is that the quality of the clusters matters less than their consistency. The k-means teacher does not need to recover linguistically meaningful units; it only needs to assign similar sounds to the same cluster reliably. The representation that the transformer learns while trying to predict these noisy labels turns out to be far better than the labels themselves, and that learned representation can then be used to produce cleaner labels for a second round of training. This bootstrap, weak labels yielding a better model yielding better labels, is the engine behind the method.
The target labels are produced offline, meaning they are computed once and stored before the masked-prediction training begins, rather than being generated on the fly. In the first iteration the features fed to k-means are 39-dimensional Mel-frequency cepstral coefficients (MFCCs): 13 base coefficients together with their first and second-order time derivatives. These are classic, hand-engineered acoustic features that capture the coarse spectral shape of each frame. The paper runs k-means with 100 clusters over these MFCC vectors, so the first set of hidden units is a quantization of relatively shallow acoustic information.
Because the network learns better representations than MFCCs, later iterations switch the clustering input from MFCCs to the model's own internal activations. In the second iteration, k-means with 500 clusters is run on the latent features taken from an intermediate transformer layer of the HuBERT model trained in the first iteration; the authors use the output of the sixth transformer layer of the Base model. The larger number of clusters reflects the richer, more discriminative features now available. The targets from this second round are used to train the final models. The paper also explores cluster ensembles, combining multiple k-means models with different numbers of clusters, which lets a single network predict targets at several granularities at once and can improve the learned representation.
With discrete targets in hand, training mirrors BERT. A subset of the audio frames is masked, and the model must predict the cluster labels of the masked frames using the unmasked context around them. Spans are masked rather than isolated frames: in the paper, roughly 8 percent of timesteps are chosen as starting points and a span of 10 consecutive frames is hidden from each one, so a sizable contiguous region of audio disappears at a time. The convolutional features for masked frames are replaced with a shared learned mask embedding before they enter the transformer.
The loss is computed only over the masked positions. This detail is deliberate and important. If the model were scored on unmasked frames too, it could succeed by simply copying the input through, learning little. By restricting the loss to masked regions, the objective forces the network to infer the identity of hidden sounds from surrounding acoustic and sequential cues, which the authors describe as learning a combined acoustic and language model over continuous inputs. In practice the prediction is framed as a classification over the cluster vocabulary: the model embeds each candidate cluster, measures the similarity between its output at a masked frame and those cluster embeddings, and is trained with a cross-entropy objective to pick the correct one.
The full recipe is iterative. Iteration one clusters MFCCs into 100 units and trains a HuBERT model to predict them. Iteration two re-clusters using features from that first model, producing 500 cleaner units, and trains a fresh model on the improved targets. Each pass yields representations that more closely track the phonetic content of speech, and the paper reports that two iterations are enough to reach strong results, though additional iterations can give further small gains. Larger models in particular benefit from being trained on targets generated by an earlier, already-capable network.
HuBERT's network is built from a convolutional waveform encoder followed by a transformer encoder, the same backbone design popularized by Wav2Vec 2.0. The convolutional front end takes the raw 16 kHz waveform and is identical across all model sizes. It consists of seven temporal convolution layers, each with 512 channels, using strides of [5, 2, 2, 2, 2, 2, 2] and kernel widths of [10, 3, 3, 3, 3, 2, 2]. The cumulative downsampling factor is 320, which turns a 16 kHz waveform into a feature sequence at roughly a 20 millisecond frame rate, so each output frame summarizes about 20 ms of audio.
The resulting feature frames are projected and passed to a stack of transformer encoder layers with self-attention, which model long-range temporal context. Positional information is supplied by a convolutional positional embedding rather than fixed sinusoids. The paper defines three sizes, summarized below.
| Configuration | Transformer layers | Hidden dimension | Parameters |
|---|---|---|---|
| HuBERT Base | 12 | 768 | about 95 million |
| HuBERT Large | 24 | 1024 | about 317 million |
| HuBERT X-Large | 48 | 1280 | about 964 million |
The Base model is trained on the 960-hour LibriSpeech corpus, while the Large and X-Large models are pre-trained on the much larger 60,000-hour Libri-Light dataset of unlabeled read speech. During self-supervised pre-training a projection layer maps transformer outputs into the space of cluster embeddings for the prediction loss. When the model is fine-tuned for recognition, that projection is discarded and replaced with a task head.
In the Hugging Face Transformers library the architecture is exposed through classes such as HubertModel, which returns raw hidden states, HubertForCTC, which adds a Connectionist Temporal Classification head for transcription, and HubertForSequenceClassification, used for utterance-level tasks like keyword spotting. Because HuBERT and Wav2Vec 2.0 share the same encoder layout, the Transformers implementation reuses the Wav2Vec 2.0 processor and much of its modeling code, and HuBERT models expect a one-dimensional float array of raw audio sampled at 16 kHz as input. The original research code was released as part of Meta's fairseq toolkit.
HuBERT and Wav2Vec 2.0 are close cousins that solve the same problem with different objectives, and they share an almost identical encoder. The decisive difference lies in how each defines its self-supervised target. Wav2Vec 2.0 uses a contrastive objective: it quantizes the audio into discrete codes with a learned, jointly trained quantizer, masks part of the sequence, and trains the model to identify the true quantized code for each masked step against a set of distractor codes drawn from elsewhere in the utterance. The quantizer and the context network are learned together, end to end, which can make training sensitive and requires care to keep the codebook from collapsing.
HuBERT instead separates target creation from representation learning. The clustering that produces its labels happens offline, decoupled from the network being trained, and the objective is a straightforward classification, predict the right cluster, rather than a contrastive comparison against negatives. This sidesteps the need to balance a contrastive loss and to maintain a jointly trained codebook. It also makes the targets explicit and inspectable, and it lets the method improve its own labels through the iterative re-clustering described above, an option the contrastive setup does not naturally provide. The authors report that with two clustering iterations HuBERT matches or improves on Wav2Vec 2.0 across the LibriSpeech and Libri-Light fine-tuning settings, with the largest gains coming from scaling up to the billion-parameter X-Large model. On the harder dev-other and test-other evaluation subsets, the 1-billion-parameter HuBERT showed up to 19 percent and 13 percent relative reductions in word error rate. In short, the two methods reach comparable peak accuracy, but HuBERT does so with a conceptually simpler and arguably more stable training objective.
HuBERT is used in two broad modes. The first is fine-tuning for recognition. Starting from a pre-trained checkpoint, a small CTC head is added and the model is trained on labeled audio. Because pre-training has already taught the network to represent speech well, only a modest amount of transcribed data is needed; the paper demonstrates fine-tuning on splits as small as ten minutes and as large as the full 960 hours of LibriSpeech. The published facebook/hubert-large-ls960-ft model is exactly this: a Large model pre-trained on Libri-Light and fine-tuned on 960 hours of LibriSpeech, distributed under the Apache 2.0 license.
The second mode treats HuBERT as a frozen feature extractor. Its hidden states, or the discrete units obtained by clustering them, feed a downstream model while HuBERT's own weights stay fixed. This is the setting evaluated by the SUPERB benchmark, described below, and it is the basis for a family of follow-on systems. Meta's textless NLP line of work, including the Generative Spoken Language Model (GSLM), uses HuBERT-derived discrete units as a form of pseudo-text: speech is encoded into units, a language model is trained over those units, and the units are converted back to audio, allowing generation and resynthesis of spoken language with no transcripts anywhere in the pipeline. HuBERT units have likewise been used in speech resynthesis and in speech-to-speech translation research.
A notable derivative is ContentVec, introduced in 2022, which adapts the HuBERT training paradigm to disentangle the linguistic content of speech from the identity of the speaker. ContentVec converts training utterances toward a canonical voice before generating teacher labels and adds a contrastive term that penalizes differences between representations of the same content spoken by different people, yielding features that capture what was said while suppressing who said it. The Hugging Face checkpoint lengyue233/content-vec-best, which redirects to this article, ports ContentVec into the Transformers library by extending HubertModel with an extra projection layer. It has become a standard content encoder for singing and speech voice conversion systems, including Retrieval-based Voice Conversion (RVC) and So-VITS-SVC, where a clean, speaker-independent content representation is exactly what is required to map one person's delivery onto another's voice. Multilingual variants such as mHuBERT have also extended the method beyond English.
On LibriSpeech, HuBERT is competitive with or ahead of Wav2Vec 2.0 across the full range of fine-tuning budgets. The table below reports word error rate (lower is better) for models pre-trained on Libri-Light, in the format dev-clean / dev-other, drawn from the paper's results.
| Fine-tuning data | HuBERT Base | HuBERT Large | HuBERT X-Large | Wav2Vec 2.0 Large |
|---|---|---|---|---|
| 10 minutes | 9.7 / 15.3 | 6.6 / 10.1 | 4.6 / 6.8 | 6.6 / 10.3 |
| 1 hour | 6.1 / 11.3 | 2.9 / 5.4 | 2.8 / 4.8 | 2.9 / 5.8 |
| 10 hours | 4.3 / 9.4 | 2.4 / 4.6 | 2.3 / 4.0 | 2.6 / 4.9 |
| 100 hours | 3.4 / 8.1 | 2.1 / 3.9 | 1.9 / 3.5 | 2.0 / 4.0 |
The pattern is clear: at every data level the Large and X-Large HuBERT models are even with or better than Wav2Vec 2.0 Large, and the X-Large model is strongest of all, especially on the noisier "other" subsets and in the very-low-resource ten-minute and one-hour settings where good pre-training matters most.
HuBERT's broader influence is most visible on SUPERB, the Speech processing Universal PERformance Benchmark introduced in 2021. SUPERB evaluates a single frozen self-supervised model across a suite of ten tasks spanning content, speaker, semantic, and prosodic aspects of speech: phoneme recognition, automatic speech recognition, keyword spotting, query-by-example spoken term detection, speaker identification, automatic speaker verification, speaker diarization, intent classification, slot filling, and emotion recognition. For each task a small task-specific head is trained on top of the frozen representation, which tests how generally useful the representation is rather than how well it can be fine-tuned end to end. HuBERT was among the top-performing models on SUPERB, conquering content-oriented tasks such as phoneme recognition and intent classification by large margins with only lightweight linear prediction heads, and substantially improving speaker-related metrics over a log-mel filterbank baseline. These results established HuBERT, alongside Wav2Vec 2.0, as a reference-quality general speech encoder.
HuBERT inherits several constraints from its design. The multi-stage recipe is more involved than a single end-to-end objective: each iteration requires extracting features, running k-means over a large corpus, storing the resulting labels, and then training a network, which adds engineering steps and compute. Pre-training the Large and X-Large models on tens of thousands of hours of audio is expensive and out of reach for most groups without substantial hardware, although the released checkpoints let practitioners skip that cost.
The published models were trained on English read speech from audiobooks, LibriSpeech and Libri-Light, so their representations are biased toward clean, well-articulated English. Performance degrades on conversational or spontaneous speech, on noisy or far-field recordings, and on other languages, which is part of why multilingual successors such as mHuBERT were developed. Inputs must be 16 kHz audio, and the model has a fixed temporal resolution of about 20 ms per frame. As with other self-supervised speech systems, the discrete units capture phonetic content but also entangle speaker, channel, and prosodic information unless an additional method such as ContentVec is applied to disentangle them, which limits the usefulness of raw HuBERT units for applications that need speaker-invariant content. Finally, like any model trained on a specific data distribution, HuBERT can reflect the demographic and acoustic biases of its training corpus.
HuBERT emerged from Facebook AI Research in mid-2021, part of a concentrated effort at the lab on self-supervised speech that also produced Wav2Vec 2.0 in 2020. The paper was posted to arXiv on June 14, 2021, as arXiv:2106.07447, and was subsequently published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing. The reference implementation and pre-trained checkpoints were released in Meta's open-source fairseq sequence-modeling toolkit, and the models were later integrated into the Hugging Face Transformers library, where the implementation was contributed by Patrick von Platen and reused the existing Wav2Vec 2.0 code paths.
The model arrived during a period when masked prediction, already dominant in text through BERT and in images through masked autoencoders, was being extended to every modality, and HuBERT became the canonical example of the idea for speech. Its discrete-unit framing fed directly into Meta's textless NLP research later in 2021, which treated HuBERT units as a substitute for text and opened a line of work on generative spoken language modeling. In the years since, HuBERT and its variants have remained widely used baselines and building blocks: as starting points for fine-tuned recognizers, as frozen encoders in the SUPERB evaluation, as content extractors in voice conversion pipelines through ContentVec, and as the basis for multilingual extensions. The facebook/hubert-large-ls960-ft checkpoint continues to see hundreds of thousands of downloads per month, a sign of how thoroughly the model settled into the standard speech toolkit.