3
edits
Nicoboomer (talk | contribs) (Created page with "{{see also|Papers}} ==Introduction== In the last decade, there have been significant advances in speech synthesis via neural networks and end to end modeling. Current text-to-speech (TTS), systems require high-quality data from recording studios. They also suffer from poor generalization for unseen speaker in zero-shot situations. A new TTS framework, VALL-E, has been developed to address this issue. It uses audio codec codes for an intermediate representation as well a...") |
Nicoboomer (talk | contribs) (→Vall-E) |
||
Line 11: | Line 11: | ||
==Vall-E== | ==Vall-E== | ||
VALL-E is a zero-shot TTS model that operates on discrete audio representations, which consists of an autoregressive (AR) decoder-only language model and a non-autoregressive (NAR) language model. | |||
The AR model generates tokens from the first quantizer codebook and is conditioned on the phoneme sequence x and the acoustic prompt C˜:,1. | |||
The NAR model generates tokens from the second to last quantizers codebooks, and is conditioned on the phoneme sequence x, acoustic prompt C˜ and predicted acoustic tokens belong to previous codebook C:,<j. | |||
Inference methods such as sampling-based decoding and greedy decoding are used for the AR and NAR models respectively. | |||
Two settings of VALL-E are proposed; one using only the phoneme transcription and first layer acoustic token from enrolled speech as prompts, while another uses first 3 seconds of utterance as prompts. |
edits