WaveNet

Deep Learning Google DeepMind Speech & Audio AI

9 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,765 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WaveNet is a deep generative model for raw audio waveforms developed by DeepMind that synthesizes speech by predicting one waveform sample at a time, each conditioned on all the samples before it. It was introduced in September 2016 in the paper "WaveNet: A Generative Model for Raw Audio," with an accompanying blog post published on 8 September 2016 and the paper posted to arXiv on 12 September 2016 ^[1]^[2]. Rather than working with intermediate audio features, WaveNet generates sound directly at the level of individual waveform samples. Applied to text-to-speech, it produced speech that listeners judged substantially more natural than the leading systems of the day, narrowing the gap to recorded human speech by 51 percent for US English and 69 percent for Mandarin Chinese ^[1]^[2]. It went on to power the synthetic voices in Google Assistant and Google Cloud Text-to-Speech ^[3]^[4]. The DeepMind blog summarized the advance plainly: "WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time." ^[2]

The original model was authored by Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu ^[1]. The same core team later produced the faster successor, Parallel WaveNet, that made the approach practical to run in real products.

What problem was text-to-speech facing before WaveNet?

Before WaveNet, production text-to-speech relied on two dominant families of methods. Concatenative synthesis stitched together short fragments of recorded human speech drawn from a large database. It could sound natural within a fragment, but it was difficult to modify, demanded a large recorded corpus per voice, and tended to produce audible glitches at the joins between units. Parametric synthesis instead generated speech from a statistical model, often a hidden Markov model or, later, a recurrent neural network, that drove a signal-processing component called a vocoder. Parametric systems were more flexible and compact, but the resulting audio usually sounded more muffled and less natural than concatenative output ^[2].

Both approaches shared a basic limitation. They assembled or shaped speech out of pre-built pieces or hand-designed signal representations rather than modelling the audio signal itself. WaveNet broke from that tradition by treating the raw waveform as the thing to be learned and generated directly.

How does WaveNet work? Dilated causal convolutions

WaveNet is a fully probabilistic, autoregressive model. As the paper states, it is "fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones." ^[1] It defines the joint probability of an audio waveform as a product of conditional probabilities, where the distribution for each sample is conditioned on all of the previous samples ^[1]. Generation therefore proceeds sample by sample: the network predicts a probability distribution over the next sample, a value is drawn from it, that value is fed back as input, and the process repeats. Speech is typically modelled at 16,000 samples per second, so a single second of audio requires sixteen thousand of these sequential prediction steps ^[2].

To make each sample's distribution tractable, WaveNet does not predict a continuous value. It applies a mu-law companding transformation to the 16-bit audio and quantizes the result to 256 possible values, turning sample prediction into a 256-way classification problem at each step ^[1].

The central architectural idea is the dilated causal convolution, a variant of the convolutional neural network. The convolutions are causal in that the prediction for a given sample can only depend on earlier samples, never later ones, which is what makes the model a valid autoregressive generator. They are dilated in that each convolutional layer skips input positions by a fixed step, so the effective span of the filter grows without adding parameters or losing resolution. In WaveNet the dilation factor doubles with depth in repeating blocks (1, 2, 4, and so on up to 512), and each such block of layers covers a receptive field of 1,024 samples. Stacking these blocks lets the receptive field grow exponentially with the number of layers while keeping computation manageable ^[1]. This is what gives the network access to a long stretch of audio history without resorting to the recurrence used by recurrent neural networks.

Several supporting components round out the design ^[1]:

Component	Role
Dilated causal convolutions	Provide a large receptive field over past samples without recurrence
Gated activation units	Combine a tanh "content" path with a sigmoid "gate" path, multiplied elementwise
Residual connections	Help train a much deeper stack of layers
Skip connections	Aggregate outputs from many layers before the final prediction
Conditioning inputs	Steer generation, for example on linguistic features for TTS or on speaker identity

For text-to-speech, WaveNet is conditioned on linguistic features derived from the input text, which guide the otherwise free-running waveform generator toward the intended words and prosody. Conditioning on a speaker identity instead lets a single network reproduce many different voices ^[1].

How much more natural was WaveNet? The results

The headline result came from text-to-speech. In subjective listening tests, scored as a mean opinion score (MOS) on a 1 to 5 scale, WaveNet outperformed both the best parametric and the best concatenative baselines in US English and Mandarin Chinese ^[1]. The paper reported that WaveNet could "generate raw speech signals with subjective naturalness never before reported in the field of text-to-speech." ^[1]

System	US English MOS	Mandarin Chinese MOS
LSTM-RNN parametric	3.67	3.79
HMM-driven concatenative	3.86	3.47
WaveNet	4.21	4.08
Natural speech (16-bit PCM)	4.55	4.21

By DeepMind's own framing, WaveNet narrowed the gap between the previous state of the art and natural human speech by over 50 percent for both languages; the paper reports reductions of 51 percent for US English and 69 percent for Mandarin ^[1]^[2]. The model was not limited to speech. As a generative model trained on a corpus of music, it generated novel and often realistic musical fragments, and when generating speech without text conditioning it produced human-like babbling along with non-speech sounds such as breaths and mouth movements, which underscored that it was modelling raw audio rather than just phonemes ^[1]^[2].

Why was WaveNet too slow, and how did Parallel WaveNet fix it?

WaveNet's quality came at a steep computational cost. Because generation is autoregressive, each of the thousands of samples per second has to be produced in sequence, with every new sample depending on the previous output. That sequential dependency does not map well onto the parallel hardware used in production, which made the original model far too slow to serve in a real-time, consumer-facing setting ^[3]^[4]. The Parallel WaveNet paper put the obstacle bluntly: sequential, one-sample-at-a-time generation is "poorly suited to today's massively parallel computers, and also to deployment in a real-time production setting." ^[5]

DeepMind addressed this in November 2017 with "Parallel WaveNet: Fast High-Fidelity Speech Synthesis," posted to arXiv on 28 November 2017 ^[5]. The method, called probability density distillation, is a form of knowledge distillation: it trains a parallel feed-forward "student" network using an already-trained autoregressive WaveNet as a "teacher." The student can generate every sample of an utterance in parallel rather than one after another, while the distillation objective keeps its output distribution close to the teacher's, so quality is preserved. DeepMind reported the student matched the teacher "with no significant difference in quality." ^[3]^[5]

The payoff was large. The distilled system generated high-fidelity speech more than 20 times faster than real time, and DeepMind reported that the production model was more than 1,000 times faster than the original WaveNet, requiring roughly 50 milliseconds to synthesize one second of speech ^[3]^[4]^[6]. The parallel model also moved to higher-quality output, generating audio at 24,000 samples per second with 16 bits per sample, up from the original prototype's 8-bit quantization ^[3]^[4].

When did WaveNet ship in Google Assistant and Cloud Text-to-Speech?

The faster model went into production quickly. On 4 October 2017, DeepMind announced that an updated WaveNet was generating the Google Assistant voices for US English and Japanese across all platforms, describing it as "capable of producing better and more realistic-sounding speech than existing techniques." ^[3] DeepMind reported that the new US English voice scored a MOS of 4.347, against 4.667 for recordings of natural human speech ^[3].

WaveNet then reached external developers through Google Cloud. On 28 March 2018, Google introduced Cloud Text-to-Speech, offering 32 voices across 12 languages and variants, with the WaveNet-generated voices available as a premium option built on the distilled model running on Cloud TPU hardware ^[4]. Google reported that listeners rated the new US English WaveNet voices at a MOS of 4.1, which it described as over 20 percent better than its standard voices and as reducing the gap with human speech by over 70 percent ^[4].

Why does WaveNet matter?

WaveNet marked a turning point for audio generation. It showed that a neural network could model and synthesize raw waveforms directly at high quality, displacing the long-standing reliance on concatenative units and hand-built parametric vocoders for the most natural-sounding output. Its core mechanism, the dilated causal convolution, became a standard tool for sequence modelling well beyond speech, and the autoregressive, sample-by-sample formulation influenced a wave of later neural audio systems.

The Parallel WaveNet work was equally consequential in practice. It demonstrated that the quality of a slow autoregressive teacher could be transferred into a fast parallel student through distillation, a pattern that recurs across modern generative modelling. Together, the two papers turned a research demonstration into the engine behind everyday synthetic voices and set the direction for DeepMind's later audio research, including systems such as AudioLM and the Lyria family of music models developed at Google DeepMind.

References

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. "WaveNet: A Generative Model for Raw Audio." arXiv:1609.03499, 12 September 2016. https://arxiv.org/abs/1609.03499 ↩
DeepMind. "WaveNet: A generative model for raw audio." DeepMind Blog, 8 September 2016. https://deepmind.google/blog/wavenet-a-generative-model-for-raw-audio/ ↩
DeepMind. "WaveNet launches in the Google Assistant." DeepMind Blog, 4 October 2017. https://deepmind.google/discover/blog/wavenet-launches-in-the-google-assistant/ ↩
Google Cloud. "Introducing Cloud Text-to-Speech powered by DeepMind WaveNet technology." Google Cloud Blog, 28 March 2018. https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-text-to-speech-powered-by-deepmind-wavenet-technology ↩
Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, et al. "Parallel WaveNet: Fast High-Fidelity Speech Synthesis." arXiv:1711.10433, 28 November 2017. https://arxiv.org/abs/1711.10433 ↩
DeepMind. "High-fidelity speech synthesis with WaveNet." DeepMind Blog, 22 November 2017. https://deepmind.google/blog/high-fidelity-speech-synthesis-with-wavenet/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Audio-to-Audio Models AudioLM Autoregressive Model Convolution Karén Simonyan Machine learning terms/Google Cloud Magenta (project)Text-to-Speech Models VQ-VAE (Vector Quantized Variational Autoencoder)Voice AI Voice cloning

What problem was text-to-speech facing before WaveNet?

How does WaveNet work? Dilated causal convolutions

How much more natural was WaveNet? The results

Why was WaveNet too slow, and how did Parallel WaveNet fix it?

When did WaveNet ship in Google Assistant and Cloud Text-to-Speech?

Why does WaveNet matter?

References

Improve this article

Related Articles

Lyria

AudioCraft

Audio Classification Models

Whisper

Wav2Vec

Speech recognition

What links here

Related Articles

Lyria

AudioCraft

Audio Classification Models

Whisper

Wav2Vec

Speech recognition

What links here