Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E): Difference between revisions

no edit summary
No edit summary
 
Line 2: Line 2:


==Introduction==
==Introduction==
In the last decade, there have been significant advances in speech synthesis via neural networks and end to end modeling. Current text-to-speech (TTS), systems require high-quality data from recording studios. They also suffer from poor generalization for unseen speaker in zero-shot situations. A new TTS framework, VALL-E, has been developed to address this issue. It uses audio codec codes for an intermediate representation as well as large, diverse and multi-speaker speech training data. VALL-E is the first TTS framework to have in-context learning capabilities. It also allows prompt-based approaches for zero shot TTS. It is significantly more natural than the state-of the-art zero shot TTS system in speech naturalness and speaker similarity. It can synthesize multiple outputs from the same input text, while keeping the acoustic environment intact and the speaker's emotion. VALL-E is trained using LibriLight. This corpus contains 60K hours English speech and over 7000 speakers.
In the last decade, there have been significant advances in speech synthesis via neural networks and end to end modeling. Current text-to-speech (TTS), systems require high-quality data from recording studios. They also suffer from poor generalization for unseen speaker in zero-shot situations. A new TTS framework, VALL-E, has been developed to address this issue. It uses audio codec codes for an intermediate representation as well as large, diverse and multi-speaker speech training data. VALL-E is the first TTS framework to have [[in-context learning]] capabilities. It also allows prompt-based approaches for zero shot TTS. It is significantly more natural than the state-of the-art zero shot TTS system in speech naturalness and speaker similarity. It can synthesize multiple outputs from the same input text, while keeping the acoustic environment intact and the speaker's emotion. VALL-E is trained using LibriLight. This corpus contains 60K hours English speech and over 7000 speakers.


==Related Work==
==Related Work==