Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E): Difference between revisions

(Created page with "{{see also|Papers}} ==Introduction== In the last decade, there have been significant advances in speech synthesis via neural networks and end to end modeling. Current text-to-speech (TTS), systems require high-quality data from recording studios. They also suffer from poor generalization for unseen speaker in zero-shot situations. A new TTS framework, VALL-E, has been developed to address this issue. It uses audio codec codes for an intermediate representation as well a...")
 
Line 11: Line 11:


==Vall-E==
==Vall-E==
VALL-E is a zero-shot TTS model that operates on discrete audio representations, which consists of an autoregressive (AR) decoder-only language model and a non-autoregressive (NAR) language model.
The AR model generates tokens from the first quantizer codebook and is conditioned on the phoneme sequence x and the acoustic prompt C˜:,1.
The NAR model generates tokens from the second to last quantizers codebooks, and is conditioned on the phoneme sequence x, acoustic prompt C˜ and predicted acoustic tokens belong to previous codebook C:,<j.
Inference methods such as sampling-based decoding and greedy decoding are used for the AR and NAR models respectively.
Two settings of VALL-E are proposed; one using only the phoneme transcription and first layer acoustic token from enrolled speech as prompts, while another uses first 3 seconds of utterance as prompts.
3

edits