Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E): Revision history

Diff selection: Mark the radio buttons of the revisions to compare and hit enter or the button at the bottom.
Legend: (cur) = difference with latest revision, (prev) = difference with preceding revision, m = minor edit.

7 April 2023

2 March 2023

  • curprev 17:0717:07, 2 March 2023Nicoboomer talk contribs 3,809 bytes +871 →‎Vall-E
  • curprev 16:5616:56, 2 March 2023Nicoboomer talk contribs 2,938 bytes +2,938 Created page with "{{see also|Papers}} ==Introduction== In the last decade, there have been significant advances in speech synthesis via neural networks and end to end modeling. Current text-to-speech (TTS), systems require high-quality data from recording studios. They also suffer from poor generalization for unseen speaker in zero-shot situations. A new TTS framework, VALL-E, has been developed to address this issue. It uses audio codec codes for an intermediate representation as well a..."