Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

Generative AI Microsoft Speech & Audio AI

20 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v5 · 3,927 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is VALL-E?

VALL-E is a zero-shot learning text-to-speech (TTS) system from Microsoft Research that clones a target voice from a 3-second recording and synthesizes new speech in that voice without any per-speaker training. Introduced in the January 2023 paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (arXiv:2301.02111, submitted January 5, 2023), it was the first TTS model to treat speech synthesis as a language modeling problem over discrete neural audio codec tokens, and to demonstrate that scaling such a model to 60,000 hours of speech produces in-context voice cloning from a single short prompt.^[1] The authors state the core idea directly: "we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work."^[1]

The shift away from continuous mel-spectrogram regression, the dominant paradigm in earlier neural TTS systems such as Tacotron and FastSpeech, is what made the approach novel.^[1] Given a 3-second recording of an unseen speaker as an acoustic prompt and a target text, VALL-E synthesizes speech in that speaker's voice while preserving the prompt's emotion, pacing, and acoustic environment.^[1] At publication it was the first TTS system to show strong in-context learning for voice cloning without per-speaker fine-tuning, mirroring how GPT-3 performs few-shot text generation.

Microsoft did not release model weights, training code, or a public API for VALL-E or any of its successors, citing concerns that the system could be misused for fraud, deepfake audio, scams, and bypassing voice authentication. Subsequent papers in the same line of work, including VALL-E X (March 2023), VALL-E R (June 2024), and VALL-E 2 (June 2024), extended the original architecture but were also withheld from public release.^[2]^[3]

Who built VALL-E?

The paper was written by Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei, all of Microsoft Research Asia and Microsoft Azure Speech.^[1] Furu Wei led the natural language computing group at Microsoft Research Asia, which had previously worked on speech models including UniSpeech and WavLM. The team published a project page at microsoft.com/en-us/research/project/vall-e-x/ with audio samples but never released code or weights.

How does VALL-E work?

VALL-E factorizes speech generation into three stages: a phoneme front end, a discrete audio tokenizer (EnCodec), and two stacked Transformer language models that operate over the codec tokens.

Audio tokenization with EnCodec

At the input layer, raw 24 kHz waveforms are converted into discrete tokens by Meta's EnCodec neural audio codec, introduced by Defossez and colleagues in October 2022 (arXiv:2210.13438).^[4] EnCodec uses a streaming convolutional encoder, a residual vector quantization (RVQ) bottleneck, and a convolutional decoder that reconstructs the waveform. The 24 kHz EnCodec model that VALL-E uses produces a sequence of 8 codebooks at 75 frames per second, where each frame is represented by 8 integer tokens drawn from codebooks of size 1024.^[1]^[4]

Residual vector quantization is the key choice: the first codebook captures the most prominent acoustic content (roughly the coarse spectral envelope and prosody), while subsequent codebooks encode finer residual detail and high-frequency information.^[4] This decomposition lets VALL-E separate the slow autoregressive content prediction from the fast parallel detail prediction.

Autoregressive (AR) decoder

The autoregressive Transformer generates only the tokens of the first codebook for each acoustic frame. It is a decoder-only Transformer conditioned on the phoneme sequence (from the target text) and the first-codebook tokens of the 3-second acoustic prompt.^[1] Because the first codebook carries most of the prosodic and content information, the AR step determines the rhythm, intonation, and broad timbre of the output. The AR decoder generates one token at a time using sampling-based decoding (top-p / nucleus sampling), which encourages diversity in synthesized prosody and lets the same input text produce different valid outputs.

Non-autoregressive (NAR) decoder

The non-autoregressive Transformer predicts the tokens of codebooks 2 through 8 in parallel for each frame. It is conditioned on the phoneme sequence, the full 8-layer acoustic prompt, and the predicted tokens of all preceding codebooks. The NAR model uses greedy decoding, since the AR step has already fixed the prosody and only fine acoustic detail remains.^[1] Splitting generation this way trades a small loss in fidelity for a large speedup, since codebooks 2 through 8 do not need to be generated token-by-token.

Inference modes

The paper proposes two prompting modes:

Mode	Inputs	Use case
VALL-E	Phoneme transcription of prompt + first-codebook tokens of prompt + target phonemes	Voice conversion of an existing utterance
VALL-E continual	First 3 seconds of an utterance as acoustic prompt + full target phonemes	Pure zero-shot voice cloning of an unseen speaker

In both modes the model never sees the target speaker during training and uses only the short prompt to condition synthesis.^[1]

How was VALL-E trained?

VALL-E was trained on the Libri-Light corpus, a large unlabeled English audiobook dataset released by Facebook AI in 2019 that contains roughly 60,000 hours of speech from over 7,000 speakers (7,439 unique speakers in the largest split).^[1]^[5] That training set is hundreds of times larger than what previous systems used: the paper notes the data was scaled "to 60K hours of English speech which is hundreds of times larger than existing systems."^[1] Earlier TTS systems such as Tacotron 2 and FastSpeech were typically trained on a few hundred hours of clean studio recordings (for example, the LJSpeech and LibriTTS corpora). Libri-Light is hundreds of times larger and substantially noisier, with varied recording conditions, accents, and speaker emotions.

Microsoft used a hybrid pseudo-labeling pipeline to obtain transcriptions: an automatic speech recognition (ASR) model labeled the audio, and the resulting noisy text plus codec tokens served as training pairs. The team argued that this scale and diversity were critical for the in-context learning behavior, since smaller, cleaner corpora do not contain enough variation in voice, prosody, and acoustic environment for the model to generalize across them at inference time.^[1]

What can VALL-E do?

Zero-shot voice cloning

Given a 3-second recording of any English speaker the model has never heard, VALL-E synthesizes new utterances in that speaker's voice without any fine-tuning. The published demos include male and female voices across a range of ages, accents, and recording qualities.^[6] On the LibriSpeech test-clean benchmark, the original VALL-E paper reported a +0.93 improvement in speaker mean opinion score (SMOS) and +0.12 in comparative mean opinion score (CMOS) over the YourTTS baseline, even though YourTTS had been trained on 97 of the test speakers and VALL-E had seen none of them.^[1] The paper summarizes the result plainly: VALL-E "significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity."^[1]

Emotion and acoustic environment preservation

Unlike earlier zero-shot systems that copied only timbre, VALL-E preserves paralinguistic features of the prompt. If the 3-second prompt is angry or whispered, the synthesized output stays angry or whispered. If the prompt is recorded in a reverberant room, the output sounds reverberant.^[1] This emerges from training on noisy in-the-wild data rather than studio recordings.

Diverse outputs from the same input

Because the AR decoder uses sampling, VALL-E produces different valid prosodies for the same text and prompt across runs. The paper highlights this as a benefit for downstream applications such as audiobook generation, where mechanical repetition of the same prosody on similar sentences can sound unnatural.^[1]

Continued speech generation

Given a phoneme transcript and an audio prompt that already contains the start of an utterance, VALL-E can continue the utterance in the same voice. This is useful for editing speech, for example replacing a misspoken word in the middle of a sentence.^[1]

How is VALL-E benchmarked?

The VALL-E paper evaluates on LibriSpeech test-clean (clean read English) and on the VCTK corpus (44 unseen speakers across British and Commonwealth accents). The two main objective metrics are word error rate (WER) on the synthesized audio, measured by running an ASR model over the output, and speaker encoder cosine similarity (SECS), measured by comparing speaker embeddings of the prompt and the synthesized speech. Subjective evaluations use mean opinion scores from human raters.

LibriSpeech test-clean (objective)

System	WER (%)	SECS
Ground truth	1.9	0.754
YourTTS (baseline)	7.7	0.337
VALL-E	5.9	0.580
VALL-E 2	1.6	0.643

Values are taken from the VALL-E paper (Table 2) and the VALL-E 2 paper.^[1]^[3] The VALL-E 2 entry uses repetition-aware sampling and grouped code modeling, which together push WER below ground truth (since ASR errors on synthesized speech can be lower than on the original noisy recordings).

LibriSpeech test-clean (subjective)

System	CMOS vs ground truth	SMOS
Ground truth	0.00	4.21
YourTTS	-1.09	3.10
VALL-E	-0.97	4.03
VALL-E 2	+0.04	4.27

CMOS measures comparative naturalness against ground truth on a [-3, +3] scale. SMOS measures speaker similarity on a [1, 5] scale. VALL-E 2 was the first reported zero-shot system to score above ground truth on both metrics, which Microsoft framed as crossing a human parity threshold.^[3]^[7]

VCTK (objective)

System	WER (%)	SECS
Ground truth	2.2	0.736
YourTTS	11.9	0.357
VALL-E	7.9	0.382
VALL-E 2	2.4	0.508

VCTK is harder for VALL-E because the model was trained almost entirely on American-accented Libri-Light data, while VCTK speakers use British and Commonwealth accents.^[1]^[3]

Public reaction to the demos

The project page hosted curated audio examples that were widely shared on social media in the days after the paper appeared. Listeners noted that the model captured idiosyncratic features like a slight lisp, a soft creaky voice register, or background room tone from the prompt. Researchers in the speech community pointed out that the same property had been visible in earlier prototype work (for example, Tortoise TTS had also reproduced acoustic environment), but VALL-E was the first system where the effect was robust enough to feature in nearly every demo, not just hand-picked ones.^[6]^[10]

What are the VALL-E successor systems?

VALL-E X

VALL-E X is a cross-lingual extension introduced in March 2023 in the paper Speak Foreign Languages with Your Own Voice (arXiv:2303.03926).^[8] It supports English, Chinese, and Japanese with a single model, and adds a language ID token at the start of the phoneme sequence so the decoder knows which target language to speak. The model can take an English prompt and synthesize Mandarin in the same voice (or vice versa), preserving the speaker's timbre across languages.

VALL-E X also introduced accent control: by varying the language ID token, the model can speak Mandarin with a slight English accent or speak English with a slight Mandarin accent, depending on the prompt and the target language. The team demonstrated zero-shot speech-to-speech translation by chaining a translation model with VALL-E X.^[8] As with VALL-E, Microsoft did not release VALL-E X, although a community implementation by GitHub user Plachtaa appeared in August 2023 and was widely used.

VALL-E R

VALL-E R, published in June 2024 (arXiv:2406.07855), addresses two persistent failure modes of the original autoregressive design: word skipping and word repetition.^[2] These artifacts arise because the AR decoder learns implicit attention alignments between phonemes and codec frames, and these alignments occasionally collapse on out-of-distribution inputs.

The VALL-E R fix is a monotonic alignment strategy. During training, the model jointly predicts the next acoustic token and the next phoneme position, with a loss that constrains the phoneme position to advance monotonically. At inference time, this constraint is enforced explicitly so the decoder cannot skip over phonemes or revisit earlier ones. The paper reports a WER close to ground truth and a 60% reduction in inference-time autoregressive steps compared to VALL-E.^[2]

Did VALL-E 2 reach human parity?

VALL-E 2, published in June 2024 (arXiv:2406.05370, submitted June 8, 2024), claims to be the first zero-shot TTS system to reach human parity on robustness, naturalness, and speaker similarity, with WER, SMOS, and CMOS scores at or above ground truth on both LibriSpeech and VCTK.^[3] In the authors' words it is "the first of its kind to reach human parity on these benchmarks."^[3] The paper introduces two main techniques:

Repetition-aware sampling. The original VALL-E used standard nucleus sampling, which occasionally chose tokens that produced word-level repetitions in the output. Repetition-aware sampling tracks how often each token has appeared in the recent decoding history and reduces its probability when it has been overused, similar to repetition penalties in text language models. The paper describes it as accounting "for token repetition in the decoding history," which "not only stabilizes the decoding but also circumvents the infinite loop issue."^[3]
Grouped code modeling. Instead of generating one EnCodec token per AR step, the VALL-E 2 AR decoder groups several adjacent frames and predicts them jointly. This "organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling."^[3]

Microsoft's VALL-E 2 release notes explicitly stated that the model would not be productized or released because of the risks of voice imitation without consent. The project's ethics statement reads: "VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public," and warns that it "may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker."^[16] Multiple outlets (Live Science, Decrypt, MarkTechPost, Synced) covered the paper as the first credible demonstration of human-parity zero-shot voice cloning.^[7]^[9]^[17]

Influence on later systems

VALL-E established the template that most later neural-codec TTS systems followed: tokenize audio with a residual VQ codec, condition a Transformer on the codec sequence and a phoneme transcript, and use a 3-second prompt for speaker conditioning. Suno's Bark, released in April 2023, was the most direct adopter, using EnCodec tokens and a similar coarse-to-fine generation strategy in fully open-source form.^[12] Coqui's XTTS v2 (September 2023) and the OpenVoice line of models also drew on the codec-language-model formulation. ElevenLabs has not published architecture details for its commercial models, but its Multilingual v2 release in August 2023 produced output with the same hallmarks (3-second prompt, emotion preservation, cross-lingual capabilities) that VALL-E and VALL-E X had demonstrated in research the same year.

Why was VALL-E controversial?

VALL-E drew immediate attention from the press and the speech research community when it was posted in January 2023. Coverage focused both on the technical leap and on the obvious abuse potential of a model that could clone a voice from three seconds of audio. TechNewsWorld, MIT Technology Review, and others ran stories within days of the arXiv release.^[10] Microsoft's published ethics statement on the project page acknowledged the risks of impersonation, fraud, and bypassing voice authentication, and committed to keeping the model in a research-only state.

Industry observers noted that VALL-E was part of a broader pattern in 2023 and 2024 in which large labs published advanced voice-cloning research but withheld the models. Meta's Voicebox (June 2023), OpenAI's Voice Engine (March 2024), and Google DeepMind's audio research were all held back from public release for similar reasons.^[7] By contrast, smaller commercial labs such as ElevenLabs and Resemble AI shipped voice cloning products that achieved comparable quality, accepting the safety risk in exchange for product distribution.

Research on detection of synthesized speech accelerated in parallel: the ASVspoof challenge added neural codec language model attacks to its 2024 evaluation set, and several detection papers used VALL-E samples (or community VALL-E X reproductions) as adversarial examples for spoofing detection.^[11]

How does VALL-E compare to other TTS systems?

VALL-E sits in a wave of late 2022 and 2023 zero-shot TTS systems that all moved from continuous mel-spectrogram regression to either discrete codec tokens or latent diffusion. The most-discussed contemporaries are summarized below.

System	Lab	Released	Approach	Public weights
VALL-E	Microsoft	Jan 2023	AR + NAR Transformer over EnCodec tokens	No
Tortoise TTS	Independent (James Betker)	Apr 2022	AR Transformer + diffusion decoder	Yes
Bark	Suno	Apr 2023	GPT-style Transformer over EnCodec tokens	Yes
NaturalSpeech 2	Microsoft	Apr 2023	Latent diffusion over codec latents	No
Voicebox	Meta	Jun 2023	Flow matching over mel-spectrograms	No
ElevenLabs Multilingual v2	ElevenLabs	Aug 2023	Proprietary AR Transformer	API only
XTTS v2	Coqui	Sep 2023	GPT-style AR over discrete tokens	Yes
OpenVoice	MyShot	Dec 2023	Tone color converter + base TTS	Yes
NaturalSpeech 3	Microsoft	Mar 2024	Factorized vector quantization + diffusion	No
VALL-E 2	Microsoft	Jun 2024	AR + NAR with grouped codes, repetition-aware sampling	No
OpenAI Voice Engine	OpenAI	Mar 2024	Undisclosed	No

VALL-E most directly inspired Bark, which uses the same EnCodec backbone and a similar coarse-to-fine token decomposition.^[12] Tortoise TTS, released several months before VALL-E, also used a discrete token bottleneck (a custom autoencoder) and an AR Transformer, but routed its output through a diffusion decoder rather than a NAR codec head, which made it slower at inference.

How does VALL-E differ from Microsoft's NaturalSpeech series?

Microsoft Research Asia ran a parallel TTS line under the NaturalSpeech name, starting with NaturalSpeech (a non-zero-shot mel-spectrogram model from 2022). After VALL-E, the NaturalSpeech line pivoted to zero-shot synthesis but used diffusion rather than autoregressive language modeling.

NaturalSpeech 2

NaturalSpeech 2, published in April 2023 (arXiv:2304.09116), uses a latent diffusion model over neural audio codec latents. The system was trained on roughly 44,000 hours of speech and singing data. It generates the codec latents directly with a diffusion model conditioned on phonemes and a speaker prompt, which avoids the long autoregressive sequence length that limits VALL-E on extended utterances. NaturalSpeech 2 was the first TTS paper to demonstrate strong zero-shot singing synthesis from a speech-only prompt.^[13]

NaturalSpeech 3

NaturalSpeech 3, published in March 2024, factorizes speech into separate subspaces for content, prosody, timbre, and acoustic detail using a factorized vector quantization (FVQ) codec, then uses a separate diffusion model to generate each subspace. The factorization is meant to disentangle attributes so that, for example, prosody from one prompt can be combined with timbre from another. The paper reported state-of-the-art quality on LibriSpeech and outperformed VALL-E and NaturalSpeech 2 on speaker similarity and naturalness at the time of publication.^[14]

The NaturalSpeech series and the VALL-E series represent two design philosophies inside Microsoft Research: language modeling over discrete tokens (VALL-E) versus diffusion over continuous or quantized latents (NaturalSpeech). Both lines remained research-only.

Are there open-source reproductions of VALL-E?

Because Microsoft did not release weights, several community projects re-implemented the architecture and trained on public corpora. The most-used reproductions are:

Project	Maintainer	First release	Notes
lifeiteng/vall-e	Li Feiteng	February 2023	PyTorch implementation, trained on LibriTTS (smaller than Libri-Light). Demo page at lifeiteng.github.io/valle.
Plachtaa/VALL-E-X	Plachtaa	August 2023	Open-source VALL-E X reproduction supporting English, Chinese, Japanese. Widely used in voice-cloning applications.
enhuiz/vall-e	Zhou Enhui	January 2023	Early academic reproduction.

None of these reproductions matched Microsoft's published WER and SECS numbers, which the authors attributed mainly to the difference in training data scale (a few thousand hours of LibriTTS or LibriSpeech versus 60,000 hours of Libri-Light) and the lack of access to Microsoft's internal preprocessing pipeline.^[15]

What are VALL-E's limitations?

The original VALL-E paper itself enumerates several limitations that successor systems then tried to address.^[1]

Synthesis errors. The AR decoder occasionally skipped, repeated, or hallucinated words, especially on long sentences. VALL-E R and VALL-E 2 specifically targeted this with monotonic alignment and grouped codes.
Data coverage. Despite 60,000 hours of training data, VALL-E performed worse on accents underrepresented in Libri-Light (such as the Commonwealth accents in VCTK) and on voices that differ strongly in age or vocal style from the training distribution.
Single language. VALL-E supports only English. VALL-E X added Chinese and Japanese, but coverage of low-resource languages remains an open problem.
Inference cost. The AR step is sequential and takes roughly one EnCodec frame (around 13 ms of output) per Transformer forward pass, which is slow on CPUs. NAR steps are parallel but still consume meaningful GPU time. Grouped code modeling in VALL-E 2 partially addressed this.
No public release. Reproducing VALL-E required either re-implementing the architecture from scratch and assembling a comparable training corpus or using a community fork. Several open-source implementations (lifeiteng/vall-e, Plachtaa/VALL-E-X) reached partial parity but did not match Microsoft's reported metrics.^[15]
Misuse risk. The same property that makes the model useful (high-fidelity zero-shot cloning from a 3-second sample) makes it dangerous, and the published research provides limited mitigations beyond not releasing the model.

References

Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., Wei, F. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111. https://arxiv.org/abs/2301.02111 ↩
Han, B., Zhou, L., Liu, S., Chen, S., Meng, L., Qian, Y., Liu, Y., Zhao, S., Li, J., Wei, F. (2024). "VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment." arXiv:2406.07855. https://arxiv.org/abs/2406.07855 ↩
Chen, S., Liu, S., Zhou, L., Liu, Y., Tan, X., Li, J., Zhao, S., Qian, Y., Wei, F. (2024). "VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers." arXiv:2406.05370. https://arxiv.org/abs/2406.05370 ↩
Defossez, A., Copet, J., Synnaeve, G., Adi, Y. (2022). "High Fidelity Neural Audio Compression." arXiv:2210.13438. https://arxiv.org/abs/2210.13438 ↩
Kahn, J., Riviere, M., Zheng, W., et al. (2020). "Libri-Light: A Benchmark for ASR with Limited or No Supervision." ICASSP 2020. https://arxiv.org/abs/1912.07875 ↩
Microsoft Research project page, "VALL-E." https://www.microsoft.com/en-us/research/project/vall-e-x/ ↩
Live Science (2024). "Microsoft's AI speech generator VALL-E 2 'reaches human parity' but it's too dangerous to release." https://www.livescience.com/technology/artificial-intelligence/ai-speech-generator-reaches-human-parity-but-its-too-dangerous-to-release-scientists-say ↩
Zhang, Z., Zhou, L., Wang, C., Chen, S., Wu, Y., Liu, S., Chen, Z., Liu, Y., Wang, H., Li, J., He, L., Zhao, S., Wei, F. (2023). "Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling." arXiv:2303.03926. https://arxiv.org/abs/2303.03926 ↩
Decrypt (2024). "Microsoft's AI Voice Cloning Tech Is So Good, You Can't Use It." https://decrypt.co/238419/microsoft-ai-voice-clone-human-parity ↩
TechNewsWorld (2023). "Microsoft VALL-E Clones Anyone's Voice From a 3-Second Sample." https://www.technewsworld.com/story/microsofts-new-ai-can-simulate-anyones-voice-from-a-3-second-sample-177646.html ↩
ASVspoof 2024 challenge overview. https://www.asvspoof.org/ ↩
Suno AI Bark repository. https://github.com/suno-ai/bark ↩
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., Zhao, S., Bian, J. (2023). "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." arXiv:2304.09116. https://arxiv.org/abs/2304.09116 ↩
Ju, Z., Wang, Y., Shen, K., et al. (2024). "NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models." arXiv:2403.03100. https://arxiv.org/abs/2403.03100 ↩
lifeiteng/vall-e community implementation. https://github.com/lifeiteng/vall-e ↩
Microsoft Research project page, "VALL-E 2" (Ethics Statement). https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e-2/ ↩
Synced (2024). "Microsoft's VALL-E 2: First Time Human Parity in Zero-Shot Text-to-Speech Achieved." https://syncedreview.com/2024/06/11/microsofts-vall-e-2-first-time-human-parity-in-zero-shot-text-to-speech-achieved/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models CosyVoice EnCodec F5-TTS Kyutai Papers SoundStream Voice cloning Voicebox XTTS (Coqui XTTS)

What is VALL-E?

Who built VALL-E?

How does VALL-E work?

Audio tokenization with EnCodec

Autoregressive (AR) decoder

Non-autoregressive (NAR) decoder

Inference modes

How was VALL-E trained?

What can VALL-E do?

Zero-shot voice cloning

Emotion and acoustic environment preservation

Diverse outputs from the same input

Continued speech generation

How is VALL-E benchmarked?

LibriSpeech test-clean (objective)

LibriSpeech test-clean (subjective)

VCTK (objective)

Public reaction to the demos

What are the VALL-E successor systems?

VALL-E X

VALL-E R

Did VALL-E 2 reach human parity?

Influence on later systems

Why was VALL-E controversial?

How does VALL-E compare to other TTS systems?

How does VALL-E differ from Microsoft's NaturalSpeech series?

NaturalSpeech 2

NaturalSpeech 3

Are there open-source reproductions of VALL-E?

What are VALL-E's limitations?

References

Improve this article

Related Articles

AudioCraft

Music

Suno

ElevenLabs

Voice cloning

Lyria

What links here

Related Articles

AudioCraft

Music

Suno

ElevenLabs

Voice cloning

Lyria

What links here