Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E): Difference between revisions

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) (view source)

871 bytes added , 2 March 2023

3

edits

@@ Line 11: / Line 11: @@
 ==Vall-E==
+VALL-E is a zero-shot TTS model that operates on discrete audio representations, which consists of an autoregressive (AR) decoder-only language model and a non-autoregressive (NAR) language model.
+The AR model generates tokens from the first quantizer codebook and is conditioned on the phoneme sequence x and the acoustic prompt C˜:,1.
+The NAR model generates tokens from the second to last quantizers codebooks, and is conditioned on the phoneme sequence x, acoustic prompt C˜ and predicted acoustic tokens belong to previous codebook C:,<j.
+Inference methods such as sampling-based decoding and greedy decoding are used for the AR and NAR models respectively.
+Two settings of VALL-E are proposed; one using only the phoneme transcription and first layer acoustic token from enrolled speech as prompts, while another uses first 3 seconds of utterance as prompts.