HuBERT

Machine Learning Speech & Audio AI

17 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v3 · 3,430 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HuBERT (Hidden-Unit BERT) is a self-supervised learning model for speech representation, introduced by researchers at Meta AI (then Facebook AI Research) in 2021 ^[1]. It learns useful representations of spoken audio directly from raw, unlabeled waveforms by predicting cluster assignments, called hidden units, for masked portions of the signal, an approach that matched or improved on the state-of-the-art Wav2Vec 2.0 results on the LibriSpeech (960 hours) and Libri-Light (60,000 hours) benchmarks and, with a 1-billion-parameter model, cut word error rate by up to 19 percent on the harder dev-other subset and 13 percent on test-other ^[1]. The method borrows the masked-prediction idea from BERT in natural language processing and adapts it to continuous audio, where there is no ready-made vocabulary of discrete tokens.

After self-supervised pre-training, a HuBERT model can be fine-tuned with a small amount of transcribed audio to perform automatic speech recognition, or used as a frozen feature extractor for many other speech tasks. The most widely downloaded checkpoint, facebook/hubert-large-ls960-ft, is a large model fine-tuned on 960 hours of LibriSpeech and is distributed through Hugging Face ^[2].

HuBERT was presented in the paper "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units" by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed ^[1]. It became one of the strongest entries on the SUPERB benchmark for general-purpose speech representations ^[4]. It should not be confused with the Hungarian language model also informally called huBERT; the two share a naming pun on BERT but are unrelated.

What is HuBERT?

The central problem HuBERT was designed to solve is that speech is hard to model in a self-supervised way for reasons that text does not share. The paper frames this as three difficulties ^[1]. First, each spoken utterance contains many sound units rather than a single label. Second, there is no lexicon, no dictionary of the units, available during pre-training, so the model cannot simply be told what the target symbols are. Third, sound units have variable lengths and there are no clear boundaries marking where one ends and the next begins. As the authors put it, "Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation" ^[1]. A model that wants to learn from raw audio without transcripts has to cope with all three at once.

HuBERT's answer is to manufacture the missing labels with a simple, separate clustering step, then learn from them. Before training the main network, the method runs k-means clustering over acoustic features extracted from the training audio. Each short frame of audio is assigned to the nearest cluster, and the cluster index becomes a discrete pseudo-label for that frame. These cluster indices are the hidden units that give the model its name. With a sequence of discrete targets now available for every utterance, the network is trained to predict those targets for frames that have been hidden behind a mask, exactly as a masked language model predicts hidden words from their surrounding context.

A point the authors stress is that the quality of the clusters matters less than their consistency. The paper states that "HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels" ^[1]. The k-means teacher does not need to recover linguistically meaningful units; it only needs to assign similar sounds to the same cluster reliably. The representation that the transformer learns while trying to predict these noisy labels turns out to be far better than the labels themselves, and that learned representation can then be used to produce cleaner labels for a second round of training. This bootstrap, weak labels yielding a better model yielding better labels, is the engine behind the method.

How does HuBERT work?

Creating clustering targets

The target labels are produced offline, meaning they are computed once and stored before the masked-prediction training begins, rather than being generated on the fly. In the first iteration the features fed to k-means are 39-dimensional Mel-frequency cepstral coefficients (MFCCs): 13 base coefficients together with their first and second-order time derivatives. These are classic, hand-engineered acoustic features that capture the coarse spectral shape of each frame. The paper runs k-means with 100 clusters over these MFCC vectors, so the first set of hidden units is a quantization of relatively shallow acoustic information ^[1].

Because the network learns better representations than MFCCs, later iterations switch the clustering input from MFCCs to the model's own internal activations. In the second iteration, k-means with 500 clusters is run on the latent features taken from an intermediate transformer layer of the HuBERT model trained in the first iteration; the authors use the output of the sixth transformer layer of the Base model ^[1]. The larger number of clusters reflects the richer, more discriminative features now available. The targets from this second round are used to train the final models. The paper also explores cluster ensembles, combining multiple k-means models with different numbers of clusters, which lets a single network predict targets at several granularities at once and can improve the learned representation ^[1].

Masked prediction

With discrete targets in hand, training mirrors BERT. A subset of the audio frames is masked, and the model must predict the cluster labels of the masked frames using the unmasked context around them. Spans are masked rather than isolated frames: in the paper, roughly 8 percent of timesteps are chosen as starting points and a span of 10 consecutive frames is hidden from each one, so a sizable contiguous region of audio disappears at a time ^[1]. The convolutional features for masked frames are replaced with a shared learned mask embedding before they enter the transformer.

The loss is computed only over the masked positions. This detail is deliberate and important. The authors describe it as "applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs" ^[1]. If the model were scored on unmasked frames too, it could succeed by simply copying the input through, learning little. By restricting the loss to masked regions, the objective forces the network to infer the identity of hidden sounds from surrounding acoustic and sequential cues. In practice the prediction is framed as a classification over the cluster vocabulary: the model embeds each candidate cluster, measures the similarity between its output at a masked frame and those cluster embeddings, and is trained with a cross-entropy objective to pick the correct one.

The full recipe is iterative. Iteration one clusters MFCCs into 100 units and trains a HuBERT model to predict them. Iteration two re-clusters using features from that first model, producing 500 cleaner units, and trains a fresh model on the improved targets. Each pass yields representations that more closely track the phonetic content of speech, and the paper reports that two iterations of clustering are enough to reach strong results, though additional iterations can give further small gains ^[1]. Larger models in particular benefit from being trained on targets generated by an earlier, already-capable network.

What is the HuBERT architecture?

HuBERT's network is built from a convolutional waveform encoder followed by a transformer encoder, the same backbone design popularized by Wav2Vec 2.0. The convolutional front end takes the raw 16 kHz waveform and is identical across all model sizes. It consists of seven temporal convolution layers, each with 512 channels, using strides of [5, 2, 2, 2, 2, 2, 2] and kernel widths of [10, 3, 3, 3, 3, 2, 2]. The cumulative downsampling factor is 320, which turns a 16 kHz waveform into a feature sequence at roughly a 20 millisecond frame rate, so each output frame summarizes about 20 ms of audio.

The resulting feature frames are projected and passed to a stack of transformer encoder layers with self-attention, which model long-range temporal context. Positional information is supplied by a convolutional positional embedding rather than fixed sinusoids. The paper defines three sizes, summarized below ^[1].

Configuration	Transformer layers	Hidden dimension	Parameters	Pre-training data
HuBERT Base	12	768	about 95 million	LibriSpeech 960 hours
HuBERT Large	24	1024	about 317 million	Libri-Light 60,000 hours
HuBERT X-Large	48	1280	about 964 million	Libri-Light 60,000 hours

The Base model is trained on the 960-hour LibriSpeech corpus, while the Large and X-Large models are pre-trained on the much larger 60,000-hour Libri-Light dataset of unlabeled read speech ^[1]. During self-supervised pre-training a projection layer maps transformer outputs into the space of cluster embeddings for the prediction loss. When the model is fine-tuned for recognition, that projection is discarded and replaced with a task head.

In the Hugging Face Transformers library the architecture is exposed through classes such as HubertModel, which returns raw hidden states, HubertForCTC, which adds a Connectionist Temporal Classification head for transcription, and HubertForSequenceClassification, used for utterance-level tasks like keyword spotting ^[3]. Because HuBERT and Wav2Vec 2.0 share the same encoder layout, the Transformers implementation reuses the Wav2Vec 2.0 processor and much of its modeling code, and HuBERT models expect a one-dimensional float array of raw audio sampled at 16 kHz as input ^[3]. The original research code was released as part of Meta's fairseq toolkit ^[7].

How does HuBERT compare to Wav2Vec 2.0?

HuBERT and Wav2Vec 2.0 are close cousins that solve the same problem with different objectives, and they share an almost identical encoder. The decisive difference lies in how each defines its self-supervised target. Wav2Vec 2.0 uses a contrastive objective: it quantizes the audio into discrete codes with a learned, jointly trained quantizer, masks part of the sequence, and trains the model to identify the true quantized code for each masked step against a set of distractor codes drawn from elsewhere in the utterance. The quantizer and the context network are learned together, end to end, which can make training sensitive and requires care to keep the codebook from collapsing.

HuBERT instead separates target creation from representation learning. The clustering that produces its labels happens offline, decoupled from the network being trained, and the objective is a straightforward classification, predict the right cluster, rather than a contrastive comparison against negatives. This sidesteps the need to balance a contrastive loss and to maintain a jointly trained codebook. It also makes the targets explicit and inspectable, and it lets the method improve its own labels through the iterative re-clustering described above, an option the contrastive setup does not naturally provide. The authors report that, starting with a simple k-means teacher of 100 clusters and using two iterations of clustering, HuBERT matches or improves on Wav2Vec 2.0 across the LibriSpeech and Libri-Light fine-tuning settings, with the largest gains coming from scaling up to the billion-parameter X-Large model ^[1]. On the harder dev-other and test-other evaluation subsets, the 1-billion-parameter HuBERT showed up to 19 percent and 13 percent relative reductions in word error rate ^[1]. In short, the two methods reach comparable peak accuracy, but HuBERT does so with a conceptually simpler and arguably more stable training objective.

What is HuBERT used for?

HuBERT is used in two broad modes. The first is fine-tuning for recognition. Starting from a pre-trained checkpoint, a small CTC head is added and the model is trained on labeled audio. Because pre-training has already taught the network to represent speech well, only a modest amount of transcribed data is needed; the paper demonstrates fine-tuning on splits as small as ten minutes and as large as the full 960 hours of LibriSpeech ^[1]. The published facebook/hubert-large-ls960-ft model is exactly this: a Large model pre-trained on Libri-Light and fine-tuned on 960 hours of LibriSpeech, distributed under the Apache 2.0 license ^[2].

The second mode treats HuBERT as a frozen feature extractor. Its hidden states, or the discrete units obtained by clustering them, feed a downstream model while HuBERT's own weights stay fixed. This is the setting evaluated by the SUPERB benchmark, described below, and it is the basis for a family of follow-on systems. Meta's textless NLP line of work, including the Generative Spoken Language Model (GSLM), uses HuBERT-derived discrete units as a form of pseudo-text: speech is encoded into units, a language model is trained over those units, and the units are converted back to audio, allowing generation and resynthesis of spoken language with no transcripts anywhere in the pipeline ^[8]. HuBERT units have likewise been used in speech resynthesis and in speech-to-speech translation research.

A notable derivative is ContentVec, introduced in 2022, which adapts the HuBERT training paradigm to disentangle the linguistic content of speech from the identity of the speaker ^[5]. ContentVec converts training utterances toward a canonical voice before generating teacher labels and adds a contrastive term that penalizes differences between representations of the same content spoken by different people, yielding features that capture what was said while suppressing who said it ^[5]. The Hugging Face checkpoint lengyue233/content-vec-best, which redirects to this article, ports ContentVec into the Transformers library by extending HubertModel with an extra projection layer ^[6]. It has become a standard content encoder for singing and speech voice conversion systems, including Retrieval-based Voice Conversion (RVC) and So-VITS-SVC, where a clean, speaker-independent content representation is exactly what is required to map one person's delivery onto another's voice. Multilingual variants such as mHuBERT have also extended the method beyond English.

How well does HuBERT perform?

On LibriSpeech, HuBERT is competitive with or ahead of Wav2Vec 2.0 across the full range of fine-tuning budgets. The table below reports word error rate (lower is better) for models pre-trained on Libri-Light, in the format dev-clean / dev-other, drawn from the paper's results ^[1].

Fine-tuning data	HuBERT Base	HuBERT Large	HuBERT X-Large	Wav2Vec 2.0 Large
10 minutes	9.7 / 15.3	6.6 / 10.1	4.6 / 6.8	6.6 / 10.3
1 hour	6.1 / 11.3	2.9 / 5.4	2.8 / 4.8	2.9 / 5.8
10 hours	4.3 / 9.4	2.4 / 4.6	2.3 / 4.0	2.6 / 4.9
100 hours	3.4 / 8.1	2.1 / 3.9	1.9 / 3.5	2.0 / 4.0

The pattern is clear: at every data level the Large and X-Large HuBERT models are even with or better than Wav2Vec 2.0 Large, and the X-Large model is strongest of all, especially on the noisier "other" subsets and in the very-low-resource ten-minute and one-hour settings where good pre-training matters most.

HuBERT's broader influence is most visible on SUPERB, the Speech processing Universal PERformance Benchmark introduced in 2021 ^[4]. SUPERB evaluates a single frozen self-supervised model across a suite of ten tasks spanning content, speaker, semantic, and prosodic aspects of speech: phoneme recognition, automatic speech recognition, keyword spotting, query-by-example spoken term detection, speaker identification, automatic speaker verification, speaker diarization, intent classification, slot filling, and emotion recognition ^[4]. For each task a small task-specific head is trained on top of the frozen representation, which tests how generally useful the representation is rather than how well it can be fine-tuned end to end. HuBERT was among the top-performing models on SUPERB, conquering content-oriented tasks such as phoneme recognition and intent classification by large margins with only lightweight linear prediction heads, and substantially improving speaker-related metrics over a log-mel filterbank baseline ^[4]. These results established HuBERT, alongside Wav2Vec 2.0, as a reference-quality general speech encoder.

What are HuBERT's limitations?

HuBERT inherits several constraints from its design. The multi-stage recipe is more involved than a single end-to-end objective: each iteration requires extracting features, running k-means over a large corpus, storing the resulting labels, and then training a network, which adds engineering steps and compute. Pre-training the Large and X-Large models on tens of thousands of hours of audio is expensive and out of reach for most groups without substantial hardware, although the released checkpoints let practitioners skip that cost.

The published models were trained on English read speech from audiobooks, LibriSpeech and Libri-Light, so their representations are biased toward clean, well-articulated English. Performance degrades on conversational or spontaneous speech, on noisy or far-field recordings, and on other languages, which is part of why multilingual successors such as mHuBERT were developed. Inputs must be 16 kHz audio, and the model has a fixed temporal resolution of about 20 ms per frame. As with other self-supervised speech systems, the discrete units capture phonetic content but also entangle speaker, channel, and prosodic information unless an additional method such as ContentVec is applied to disentangle them, which limits the usefulness of raw HuBERT units for applications that need speaker-invariant content ^[5]. Finally, like any model trained on a specific data distribution, HuBERT can reflect the demographic and acoustic biases of its training corpus.

When was HuBERT released, and what is its history?

HuBERT emerged from Facebook AI Research in mid-2021, part of a concentrated effort at the lab on self-supervised speech that also produced Wav2Vec 2.0 in 2020. The paper was posted to arXiv on June 14, 2021, as arXiv:2106.07447, and was subsequently published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing ^[1]. The reference implementation and pre-trained checkpoints were released in Meta's open-source fairseq sequence-modeling toolkit, and the models were later integrated into the Hugging Face Transformers library, where the implementation was contributed by Patrick von Platen and reused the existing Wav2Vec 2.0 code paths ^[3]^[7].

The model arrived during a period when masked prediction, already dominant in text through BERT and in images through masked autoencoders, was being extended to every modality, and HuBERT became the canonical example of the idea for speech. Its discrete-unit framing fed directly into Meta's textless NLP research later in 2021, which treated HuBERT units as a substitute for text and opened a line of work on generative spoken language modeling ^[8]. In the years since, HuBERT and its variants have remained widely used baselines and building blocks: as starting points for fine-tuned recognizers, as frozen encoders in the SUPERB evaluation, as content extractors in voice conversion pipelines through ContentVec, and as the basis for multilingual extensions. The facebook/hubert-large-ls960-ft checkpoint continues to see hundreds of thousands of downloads per month, a sign of how thoroughly the model settled into the standard speech toolkit ^[2].

ELI5: HuBERT explained simply

Imagine you want a computer to understand spoken words, but you only have lots of audio and no written transcripts to tell it what was said. HuBERT first plays a simple sorting game: it chops the audio into tiny slices and sorts similar-sounding slices into the same bin, the way you might sort a pile of LEGO bricks by color without knowing what each color is called. Those bin numbers become temporary "labels" for the sounds. Then HuBERT plays a fill-in-the-blank game: it hides some slices of the audio and tries to guess which bin the hidden slices belong to, using the sounds around them as clues. By getting good at this guessing game, it learns a lot about how speech works. After the first round, it uses what it learned to re-sort the slices into better bins and plays the game again, getting even smarter. Once trained, it can be taught to write down speech (transcription) using only a little bit of labeled audio, or it can hand its learned "understanding" to other speech programs.

References

Hsu, Wei-Ning; Bolte, Benjamin; Tsai, Yao-Hung Hubert; Lakhotia, Kushal; Salakhutdinov, Ruslan; Mohamed, Abdelrahman (2021). "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." arXiv:2106.07447. https://arxiv.org/abs/2106.07447 ↩
Facebook AI. "facebook/hubert-large-ls960-ft." Hugging Face model card. https://huggingface.co/facebook/hubert-large-ls960-ft ↩
Hugging Face. "HuBERT." Transformers documentation. https://huggingface.co/docs/transformers/model_doc/hubert ↩
Yang, Shu-wen; Chi, Po-Han; Chuang, Yung-Sung; et al. (2021). "SUPERB: Speech processing Universal PERformance Benchmark." arXiv:2105.01051. https://arxiv.org/abs/2105.01051 ↩
Qian, Kaizhi; Zhang, Yang; Chang, Shiyu; et al. (2022). "ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers." arXiv:2204.09224. https://arxiv.org/abs/2204.09224 ↩
lengyue233. "content-vec-best." Hugging Face model card. https://huggingface.co/lengyue233/content-vec-best ↩
Facebook AI Research. "HuBERT." fairseq examples. https://github.com/facebookresearch/fairseq/tree/main/examples/hubert ↩
Lakhotia, Kushal; Kharitonov, Eugene; Hsu, Wei-Ning; et al. (2021). "On Generative Spoken Language Modeling from Raw Audio" (textless NLP / GSLM). arXiv:2102.01192. https://arxiv.org/abs/2102.01192 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models Domain adaptation SpiRit-LM Text-to-Speech Models Voice Activity Detection Models Voicebox Wav2Vec

HuBERT

What is HuBERT?

How does HuBERT work?

Creating clustering targets

Masked prediction

Iterative refinement

What is the HuBERT architecture?

How does HuBERT compare to Wav2Vec 2.0?

What is HuBERT used for?

How well does HuBERT perform?

What are HuBERT's limitations?

When was HuBERT released, and what is its history?

ELI5: HuBERT explained simply

See also

References

Improve this article

What links here

What links here

What is HuBERT?

How does HuBERT work?

Creating clustering targets

Masked prediction

Iterative refinement

What is the HuBERT architecture?

How does HuBERT compare to Wav2Vec 2.0?

What is HuBERT used for?

How well does HuBERT perform?

What are HuBERT's limitations?

When was HuBERT released, and what is its history?

ELI5: HuBERT explained simply

See also

References

Improve this article

Related Articles

Audio Classification Models

Speech recognition

SUPERB

Wav2Vec 2.0

Word error rate

Audio-to-Audio Models

What links here

Related Articles

Audio Classification Models

Speech recognition

SUPERB

Wav2Vec 2.0

Word error rate

Audio-to-Audio Models

What links here