See also: Papers, Text-to-Speech, Microsoft Research
Overview
VALL-E is a zero-shot learning text-to-speech (TTS) system introduced by Microsoft Research in the January 2023 paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (arXiv:2301.02111).[1] The system reframes speech synthesis as a language modeling problem over discrete neural audio codec tokens rather than as a regression task on continuous mel-spectrograms, which had been the dominant paradigm in earlier neural TTS systems such as Tacotron and FastSpeech.[1]
VALL-E was trained on roughly 60,000 hours of English speech drawn from the LibriLight corpus, two to three orders of magnitude more audio than previous TTS systems used. Given a 3-second recording of an unseen speaker as an acoustic prompt and a target text, the model can synthesize speech in that speaker's voice while preserving the prompt's emotion, pacing, and acoustic environment.[1] At publication, VALL-E was the first TTS system to demonstrate strong in-context learning for voice cloning without per-speaker fine-tuning, mirroring how GPT-3 performs few-shot text generation.
Microsoft did not release model weights, training code, or a public API for VALL-E or any of its successors, citing concerns that the system could be misused for fraud, deepfake audio, scams, and bypassing voice authentication. Subsequent papers in the same line of work, including VALL-E X (March 2023), VALL-E R (June 2024), and VALL-E 2 (June 2024), extended the original architecture but were also withheld from public release.[2][3]
Authors and affiliation
The paper was written by Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei, all of Microsoft Research Asia and Microsoft Azure Speech.[1] Furu Wei led the natural language computing group at Microsoft Research Asia, which had previously worked on speech models including UniSpeech and WavLM. The team published a project page at microsoft.com/en-us/research/project/vall-e-x/ with audio samples but never released code or weights.
Architecture
VALL-E factorizes speech generation into three stages: a phoneme front end, a discrete audio tokenizer (Encodec), and two stacked Transformer language models that operate over the codec tokens.
Audio tokenization with Encodec
At the input layer, raw 24 kHz waveforms are converted into discrete tokens by Meta's Encodec neural audio codec, introduced by Defossez and colleagues in October 2022 (arXiv:2210.13438).[4] Encodec uses a streaming convolutional encoder, a residual vector quantization (RVQ) bottleneck, and a convolutional decoder that reconstructs the waveform. The 24 kHz Encodec model that VALL-E uses produces a sequence of 8 codebooks at 75 frames per second, where each frame is represented by 8 integer tokens drawn from codebooks of size 1024.[1][4]
Residual vector quantization is the key choice: the first codebook captures the most prominent acoustic content (roughly the coarse spectral envelope and prosody), while subsequent codebooks encode finer residual detail and high-frequency information.[4] This decomposition lets VALL-E separate the slow autoregressive content prediction from the fast parallel detail prediction.
Autoregressive (AR) decoder
The autoregressive Transformer generates only the tokens of the first codebook for each acoustic frame. It is a decoder-only Transformer conditioned on the phoneme sequence (from the target text) and the first-codebook tokens of the 3-second acoustic prompt.[1] Because the first codebook carries most of the prosodic and content information, the AR step determines the rhythm, intonation, and broad timbre of the output. The AR decoder generates one token at a time using sampling-based decoding (top-p / nucleus sampling), which encourages diversity in synthesized prosody and lets the same input text produce different valid outputs.
Non-autoregressive (NAR) decoder
The non-autoregressive Transformer predicts the tokens of codebooks 2 through 8 in parallel for each frame. It is conditioned on the phoneme sequence, the full 8-layer acoustic prompt, and the predicted tokens of all preceding codebooks. The NAR model uses greedy decoding, since the AR step has already fixed the prosody and only fine acoustic detail remains.[1] Splitting generation this way trades a small loss in fidelity for a large speedup, since codebooks 2 through 8 do not need to be generated token-by-token.
Inference modes
The paper proposes two prompting modes:
| Mode | Inputs | Use case |
|---|
| VALL-E | Phoneme transcription of prompt + first-codebook tokens of prompt + target phonemes | Voice conversion of an existing utterance |
| VALL-E continual | First 3 seconds of an utterance as acoustic prompt + full target phonemes | Pure zero-shot voice cloning of an unseen speaker |
In both modes the model never sees the target speaker during training and uses only the short prompt to condition synthesis.[1]
Training data
VALL-E was trained on the LibriLight corpus, a large unlabeled English audiobook dataset released by Facebook AI in 2019 that contains roughly 60,000 hours of speech from over 7,000 speakers.[1][5] Earlier TTS systems such as Tacotron 2 and FastSpeech were typically trained on a few hundred hours of clean studio recordings (for example, the LJSpeech and LibriTTS corpora). LibriLight is hundreds of times larger and substantially noisier, with varied recording conditions, accents, and speaker emotions.
Microsoft used a hybrid pseudo-labeling pipeline to obtain transcriptions: an automatic speech recognition (ASR) model labeled the audio, and the resulting noisy text plus codec tokens served as training pairs. The team argued that this scale and diversity were critical for the in-context learning behavior, since smaller, cleaner corpora do not contain enough variation in voice, prosody, and acoustic environment for the model to generalize across them at inference time.[1]
Capabilities
Zero-shot voice cloning
Given a 3-second recording of any English speaker the model has never heard, VALL-E synthesizes new utterances in that speaker's voice without any fine-tuning. The published demos include male and female voices across a range of ages, accents, and recording qualities.[6] On the LibriSpeech test-clean benchmark, the original VALL-E paper reported a +0.93 improvement in speaker mean opinion score (SMOS) and +0.12 in comparative mean opinion score (CMOS) over the YourTTS baseline, even though YourTTS had been trained on 97 of the test speakers and VALL-E had seen none of them.[1]
Emotion and acoustic environment preservation
Unlike earlier zero-shot systems that copied only timbre, VALL-E preserves paralinguistic features of the prompt. If the 3-second prompt is angry or whispered, the synthesized output stays angry or whispered. If the prompt is recorded in a reverberant room, the output sounds reverberant.[1] This emerges from training on noisy in-the-wild data rather than studio recordings.
Because the AR decoder uses sampling, VALL-E produces different valid prosodies for the same text and prompt across runs. The paper highlights this as a benefit for downstream applications such as audiobook generation, where mechanical repetition of the same prosody on similar sentences can sound unnatural.[1]
Continued speech generation
Given a phoneme transcript and an audio prompt that already contains the start of an utterance, VALL-E can continue the utterance in the same voice. This is useful for editing speech, for example replacing a misspoken word in the middle of a sentence.[1]
Benchmarks
The VALL-E paper evaluates on LibriSpeech test-clean (clean read English) and on the VCTK corpus (44 unseen speakers across British and Commonwealth accents). The two main objective metrics are word error rate (WER) on the synthesized audio, measured by running an ASR model over the output, and speaker encoder cosine similarity (SECS), measured by comparing speaker embeddings of the prompt and the synthesized speech. Subjective evaluations use mean opinion scores from human raters.
LibriSpeech test-clean (objective)
| System | WER (%) | SECS |
|---|
| Ground truth | 1.9 | 0.754 |
| YourTTS (baseline) | 7.7 | 0.337 |
| VALL-E | 5.9 | 0.580 |
| VALL-E 2 | 1.6 | 0.643 |
Values are taken from the VALL-E paper (Table 2) and the VALL-E 2 paper.[1][3] The VALL-E 2 entry uses repetition-aware sampling and grouped code modeling, which together push WER below ground truth (since ASR errors on synthesized speech can be lower than on the original noisy recordings).
LibriSpeech test-clean (subjective)
| System | CMOS vs ground truth | SMOS |
|---|
| Ground truth | 0.00 | 4.21 |
| YourTTS | -1.09 | 3.10 |
| VALL-E | -0.97 | 4.03 |
| VALL-E 2 | +0.04 | 4.27 |
CMOS measures comparative naturalness against ground truth on a [-3, +3] scale. SMOS measures speaker similarity on a [1, 5] scale. VALL-E 2 was the first reported zero-shot system to score above ground truth on both metrics, which Microsoft framed as crossing a human parity threshold.[3][7]
VCTK (objective)
| System | WER (%) | SECS |
|---|
| Ground truth | 2.2 | 0.736 |
| YourTTS | 11.9 | 0.357 |
| VALL-E | 7.9 | 0.382 |
| VALL-E 2 | 2.4 | 0.508 |
VCTK is harder for VALL-E because the model was trained almost entirely on American-accented LibriLight data, while VCTK speakers use British and Commonwealth accents.[1][3]
Public reaction to the demos
The project page hosted curated audio examples that were widely shared on social media in the days after the paper appeared. Listeners noted that the model captured idiosyncratic features like a slight lisp, a soft creaky voice register, or background room tone from the prompt. Researchers in the speech community pointed out that the same property had been visible in earlier prototype work (for example, Tortoise TTS had also reproduced acoustic environment), but VALL-E was the first system where the effect was robust enough to feature in nearly every demo, not just hand-picked ones.[6][10]
Successor systems
VALL-E X
VALL-E X is a cross-lingual extension introduced in March 2023 in the paper Speak Foreign Languages with Your Own Voice (arXiv:2303.03926).[8] It supports English, Chinese, and Japanese with a single model, and adds a language ID token at the start of the phoneme sequence so the decoder knows which target language to speak. The model can take an English prompt and synthesize Mandarin in the same voice (or vice versa), preserving the speaker's timbre across languages.
VALL-E X also introduced accent control: by varying the language ID token, the model can speak Mandarin with a slight English accent or speak English with a slight Mandarin accent, depending on the prompt and the target language. The team demonstrated zero-shot speech-to-speech translation by chaining a translation model with VALL-E X.[8] As with VALL-E, Microsoft did not release VALL-E X, although a community implementation by GitHub user Plachtaa appeared in August 2023 and was widely used.
VALL-E R
VALL-E R, published in June 2024 (arXiv:2406.07855), addresses two persistent failure modes of the original autoregressive design: word skipping and word repetition.[2] These artifacts arise because the AR decoder learns implicit attention alignments between phonemes and codec frames, and these alignments occasionally collapse on out-of-distribution inputs.
The VALL-E R fix is a monotonic alignment strategy. During training, the model jointly predicts the next acoustic token and the next phoneme position, with a loss that constrains the phoneme position to advance monotonically. At inference time, this constraint is enforced explicitly so the decoder cannot skip over phonemes or revisit earlier ones. The paper reports a WER close to ground truth and a 60% reduction in inference-time autoregressive steps compared to VALL-E.[2]
VALL-E 2
VALL-E 2, published in June 2024 (arXiv:2406.05370), claims to be the first zero-shot TTS system to reach human parity on robustness, naturalness, and speaker similarity, with WER, SMOS, and CMOS scores at or above ground truth on both LibriSpeech and VCTK.[3] The paper introduces two main techniques:
- Repetition-aware sampling. The original VALL-E used standard nucleus sampling, which occasionally chose tokens that produced word-level repetitions in the output. Repetition-aware sampling tracks how often each token has appeared in the recent decoding history and reduces its probability when it has been overused, similar to repetition penalties in text language models.[3]
- Grouped code modeling. Instead of generating one Encodec token per AR step, the VALL-E 2 AR decoder groups several adjacent frames and predicts them jointly. This shortens the effective sequence length, which both speeds up inference and reduces the long-context error accumulation that hurts the original VALL-E on longer utterances.[3]
Microsoft's VALL-E 2 announcement explicitly stated that the model would not be productized or released because of the risks of voice imitation without consent.[7][9] Multiple outlets (Live Science, Decrypt, MarkTechPost, Synced) covered the paper as the first credible demonstration of human-parity zero-shot voice cloning.[7][9]
Influence on later systems
VALL-E established the template that most later neural-codec TTS systems followed: tokenize audio with a residual VQ codec, condition a Transformer on the codec sequence and a phoneme transcript, and use a 3-second prompt for speaker conditioning. Suno's Bark, released in April 2023, was the most direct adopter, using Encodec tokens and a similar coarse-to-fine generation strategy in fully open-source form.[12] Coqui's XTTS v2 (September 2023) and the OpenVoice line of models also drew on the codec-language-model formulation. ElevenLabs has not published architecture details for its commercial models, but its Multilingual v2 release in August 2023 produced output with the same hallmarks (3-second prompt, emotion preservation, cross-lingual capabilities) that VALL-E and VALL-E X had demonstrated in research the same year.
Reception and ethical concerns
VALL-E drew immediate attention from the press and the speech research community when it was posted in January 2023. Coverage focused both on the technical leap and on the obvious abuse potential of a model that could clone a voice from three seconds of audio. TechNewsWorld, MIT Technology Review, and others ran stories within days of the arXiv release.[10] Microsoft's published ethics statement on the project page acknowledged the risks of impersonation, fraud, and bypassing voice authentication, and committed to keeping the model in a research-only state.
Industry observers noted that VALL-E was part of a broader pattern in 2023 and 2024 in which large labs published advanced voice-cloning research but withheld the models. Meta's Voicebox (June 2023), OpenAI's Voice Engine (March 2024), and Google DeepMind's audio research were all held back from public release for similar reasons.[7] By contrast, smaller commercial labs such as ElevenLabs and Resemble AI shipped voice cloning products that achieved comparable quality, accepting the safety risk in exchange for product distribution.
Research on detection of synthesized speech accelerated in parallel: the ASVspoof challenge added neural codec language model attacks to its 2024 evaluation set, and several detection papers used VALL-E samples (or community VALL-E X reproductions) as adversarial examples for spoofing detection.[11]
Comparison to other TTS systems
VALL-E sits in a wave of late 2022 and 2023 zero-shot TTS systems that all moved from continuous mel-spectrogram regression to either discrete codec tokens or latent diffusion. The most-discussed contemporaries are summarized below.
| System | Lab | Released | Approach | Public weights |
|---|
| VALL-E | Microsoft | Jan 2023 | AR + NAR Transformer over Encodec tokens | No |
| Tortoise TTS | Independent (James Betker) | Apr 2022 | AR Transformer + diffusion decoder | Yes |
| Bark | Suno | Apr 2023 | GPT-style Transformer over Encodec tokens | Yes |
| NaturalSpeech 2 | Microsoft | Apr 2023 | Latent diffusion over codec latents | No |
| Voicebox | Meta | Jun 2023 | Flow matching over mel-spectrograms | No |
| ElevenLabs Multilingual v2 | ElevenLabs | Aug 2023 | Proprietary AR Transformer | API only |
| XTTS v2 | Coqui | Sep 2023 | GPT-style AR over discrete tokens | Yes |
| OpenVoice | MyShot | Dec 2023 | Tone color converter + base TTS | Yes |
| NaturalSpeech 3 | Microsoft | Mar 2024 | Factorized vector quantization + diffusion | No |
| VALL-E 2 | Microsoft | Jun 2024 | AR + NAR with grouped codes, repetition-aware sampling | No |
| OpenAI Voice Engine | OpenAI | Mar 2024 | Undisclosed | No |
VALL-E most directly inspired Bark, which uses the same Encodec backbone and a similar coarse-to-fine token decomposition.[12] Tortoise TTS, released several months before VALL-E, also used a discrete token bottleneck (a custom autoencoder) and an AR Transformer, but routed its output through a diffusion decoder rather than a NAR codec head, which made it slower at inference.
Microsoft's NaturalSpeech series
Microsoft Research Asia ran a parallel TTS line under the NaturalSpeech name, starting with NaturalSpeech (a non-zero-shot mel-spectrogram model from 2022). After VALL-E, the NaturalSpeech line pivoted to zero-shot synthesis but used diffusion rather than autoregressive language modeling.
NaturalSpeech 2
NaturalSpeech 2, published in April 2023 (arXiv:2304.09116), uses a latent diffusion model over neural audio codec latents. The system was trained on roughly 44,000 hours of speech and singing data. It generates the codec latents directly with a diffusion model conditioned on phonemes and a speaker prompt, which avoids the long autoregressive sequence length that limits VALL-E on extended utterances. NaturalSpeech 2 was the first TTS paper to demonstrate strong zero-shot singing synthesis from a speech-only prompt.[13]
NaturalSpeech 3
NaturalSpeech 3, published in March 2024, factorizes speech into separate subspaces for content, prosody, timbre, and acoustic detail using a factorized vector quantization (FVQ) codec, then uses a separate diffusion model to generate each subspace. The factorization is meant to disentangle attributes so that, for example, prosody from one prompt can be combined with timbre from another. The paper reported state-of-the-art quality on LibriSpeech and outperformed VALL-E and NaturalSpeech 2 on speaker similarity and naturalness at the time of publication.[14]
The NaturalSpeech series and the VALL-E series represent two design philosophies inside Microsoft Research: language modeling over discrete tokens (VALL-E) versus diffusion over continuous or quantized latents (NaturalSpeech). Both lines remained research-only.
Open-source reproductions
Because Microsoft did not release weights, several community projects re-implemented the architecture and trained on public corpora. The most-used reproductions are:
| Project | Maintainer | First release | Notes |
|---|
| lifeiteng/vall-e | Li Feiteng | February 2023 | PyTorch implementation, trained on LibriTTS (smaller than LibriLight). Demo page at lifeiteng.github.io/valle. |
| Plachtaa/VALL-E-X | Plachtaa | August 2023 | Open-source VALL-E X reproduction supporting English, Chinese, Japanese. Widely used in voice-cloning applications. |
| enhuiz/vall-e | Zhou Enhui | January 2023 | Early academic reproduction. |
None of these reproductions matched Microsoft's published WER and SECS numbers, which the authors attributed mainly to the difference in training data scale (a few thousand hours of LibriTTS or LibriSpeech versus 60,000 hours of LibriLight) and the lack of access to Microsoft's internal preprocessing pipeline.[15]
Limitations
The original VALL-E paper itself enumerates several limitations that successor systems then tried to address.[1]
- Synthesis errors. The AR decoder occasionally skipped, repeated, or hallucinated words, especially on long sentences. VALL-E R and VALL-E 2 specifically targeted this with monotonic alignment and grouped codes.
- Data coverage. Despite 60,000 hours of training data, VALL-E performed worse on accents underrepresented in LibriLight (such as the Commonwealth accents in VCTK) and on voices that differ strongly in age or vocal style from the training distribution.
- Single language. VALL-E supports only English. VALL-E X added Chinese and Japanese, but coverage of low-resource languages remains an open problem.
- Inference cost. The AR step is sequential and takes roughly one Encodec frame (around 13 ms of output) per Transformer forward pass, which is slow on CPUs. NAR steps are parallel but still consume meaningful GPU time. Grouped code modeling in VALL-E 2 partially addressed this.
- No public release. Reproducing VALL-E required either re-implementing the architecture from scratch and assembling a comparable training corpus or using a community fork. Several open-source implementations (lifeiteng/vall-e, Plachtaa/VALL-E-X) reached partial parity but did not match Microsoft's reported metrics.[15]
- Misuse risk. The same property that makes the model useful (high-fidelity zero-shot cloning from a 3-second sample) makes it dangerous, and the published research provides limited mitigations beyond not releasing the model.
References
- Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., Wei, F. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111. https://arxiv.org/abs/2301.02111
- Han, B., Zhou, L., Liu, S., Chen, S., Meng, L., Qian, Y., Liu, Y., Zhao, S., Li, J., Wei, F. (2024). "VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment." arXiv:2406.07855. https://arxiv.org/abs/2406.07855
- Chen, S., Liu, S., Zhou, L., Liu, Y., Tan, X., Li, J., Zhao, S., Qian, Y., Wei, F. (2024). "VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers." arXiv:2406.05370. https://arxiv.org/abs/2406.05370
- Defossez, A., Copet, J., Synnaeve, G., Adi, Y. (2022). "High Fidelity Neural Audio Compression." arXiv:2210.13438. https://arxiv.org/abs/2210.13438
- Kahn, J., Riviere, M., Zheng, W., et al. (2020). "Libri-Light: A Benchmark for ASR with Limited or No Supervision." ICASSP 2020. https://arxiv.org/abs/1912.07875
- Microsoft Research project page, "VALL-E." https://www.microsoft.com/en-us/research/project/vall-e-x/
- Live Science (2024). "Microsoft's AI speech generator VALL-E 2 'reaches human parity' but it's too dangerous to release." https://www.livescience.com/technology/artificial-intelligence/ai-speech-generator-reaches-human-parity-but-its-too-dangerous-to-release-scientists-say
- Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., Wei, F. (2023). "Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling." arXiv:2303.03926. https://arxiv.org/abs/2303.03926
- Decrypt (2024). "Microsoft's AI Voice Cloning Tech Is So Good, You Can't Use It." https://decrypt.co/238419/microsoft-ai-voice-clone-human-parity
- TechNewsWorld (2023). "Microsoft VALL-E Clones Anyone's Voice From a 3-Second Sample." https://www.technewsworld.com/story/microsofts-new-ai-can-simulate-anyones-voice-from-a-3-second-sample-177646.html
- ASVspoof 2024 challenge overview. https://www.asvspoof.org/
- Suno AI Bark repository. https://github.com/suno-ai/bark
- Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., Zhao, S., Bian, J. (2023). "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." arXiv:2304.09116. https://arxiv.org/abs/2304.09116
- Ju, Z., Wang, Y., Shen, K., et al. (2024). "NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models." arXiv:2403.03100. https://arxiv.org/abs/2403.03100
- lifeiteng/vall-e community implementation. https://github.com/lifeiteng/vall-e