WaveNet
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,556 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,556 words
Add missing citations, update stale details, or suggest a clearer explanation.
WaveNet is a deep generative model for raw audio waveforms developed by DeepMind. It was introduced in September 2016 in the paper "WaveNet: A Generative Model for Raw Audio," with an accompanying blog post published on 8 September 2016 and the paper posted to arXiv on 12 September 2016 [1][2]. Rather than working with intermediate audio features, WaveNet generates sound directly at the level of individual waveform samples, predicting one sample at a time from all of the samples that came before it. Applied to text-to-speech, it produced speech that listeners judged substantially more natural than the leading systems of the day, and it went on to power the synthetic voices in Google Assistant and Google Cloud Text-to-Speech [3][4].
The original model was authored by Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu [1]. The same core team later produced the faster successor, Parallel WaveNet, that made the approach practical to run in real products.
Before WaveNet, production text-to-speech relied on two dominant families of methods. Concatenative synthesis stitched together short fragments of recorded human speech drawn from a large database. It could sound natural within a fragment, but it was difficult to modify, demanded a large recorded corpus per voice, and tended to produce audible glitches at the joins between units. Parametric synthesis instead generated speech from a statistical model, often a hidden Markov model or, later, a recurrent neural network, that drove a signal-processing component called a vocoder. Parametric systems were more flexible and compact, but the resulting audio usually sounded more muffled and less natural than concatenative output [2].
Both approaches shared a basic limitation. They assembled or shaped speech out of pre-built pieces or hand-designed signal representations rather than modelling the audio signal itself. WaveNet broke from that tradition by treating the raw waveform as the thing to be learned and generated directly.
WaveNet is a fully probabilistic, autoregressive model. It defines the joint probability of an audio waveform as a product of conditional probabilities, where the distribution for each sample is conditioned on all of the previous samples [1]. Generation therefore proceeds sample by sample: the network predicts a probability distribution over the next sample, a value is drawn from it, that value is fed back as input, and the process repeats. Speech is typically modelled at 16,000 samples per second, so a single second of audio requires sixteen thousand of these sequential prediction steps [2].
To make each sample's distribution tractable, WaveNet does not predict a continuous value. It applies a mu-law companding transformation to the 16-bit audio and quantizes the result to 256 possible values, turning sample prediction into a 256-way classification problem at each step [1].
The central architectural idea is the dilated causal convolution. The convolutions are causal in that the prediction for a given sample can only depend on earlier samples, never later ones, which is what makes the model a valid autoregressive generator. They are dilated in that each convolutional layer skips input positions by a fixed step, so the effective span of the filter grows without adding parameters or losing resolution. In WaveNet the dilation factor doubles with depth in repeating blocks (1, 2, 4, and so on up to 512), and each such block of layers covers a receptive field of 1,024 samples. Stacking these blocks lets the receptive field grow exponentially with the number of layers while keeping computation manageable [1]. This is what gives the network access to a long stretch of audio history without resorting to the recurrence used by recurrent neural networks.
Several supporting components round out the design [1]:
| Component | Role |
|---|---|
| Dilated causal convolutions | Provide a large receptive field over past samples without recurrence |
| Gated activation units | Combine a tanh "content" path with a sigmoid "gate" path, multiplied elementwise |
| Residual connections | Help train a much deeper stack of layers |
| Skip connections | Aggregate outputs from many layers before the final prediction |
| Conditioning inputs | Steer generation, for example on linguistic features for TTS or on speaker identity |
For text-to-speech, WaveNet is conditioned on linguistic features derived from the input text, which guide the otherwise free-running waveform generator toward the intended words and prosody. Conditioning on a speaker identity instead lets a single network reproduce many different voices [1].
The headline result came from text-to-speech. In subjective listening tests, scored as a mean opinion score (MOS) on a 1 to 5 scale, WaveNet outperformed both the best parametric and the best concatenative baselines in US English and Mandarin Chinese [1].
| System | US English MOS | Mandarin Chinese MOS |
|---|---|---|
| LSTM-RNN parametric | 3.67 | 3.79 |
| HMM-driven concatenative | 3.86 | 3.47 |
| WaveNet | 4.21 | 4.08 |
| Natural speech (16-bit PCM) | 4.55 | 4.21 |
By DeepMind's own framing, WaveNet narrowed the gap between the previous state of the art and natural human speech by over 50 percent for both languages; the paper reports reductions of 51 percent for US English and 69 percent for Mandarin [1][2]. The model was not limited to speech. Trained on a corpus of music, it generated novel and often realistic musical fragments, and when generating speech without text conditioning it produced human-like babbling along with non-speech sounds such as breaths and mouth movements, which underscored that it was modelling raw audio rather than just phonemes [1][2].
WaveNet's quality came at a steep computational cost. Because generation is autoregressive, each of the thousands of samples per second has to be produced in sequence, with every new sample depending on the previous output. That sequential dependency does not map well onto the parallel hardware used in production, which made the original model far too slow to serve in a real-time, consumer-facing setting [3][4].
DeepMind addressed this in November 2017 with "Parallel WaveNet: Fast High-Fidelity Speech Synthesis," posted to arXiv on 28 November 2017 [5]. The method, called probability density distillation, trains a parallel feed-forward "student" network using an already-trained autoregressive WaveNet as a "teacher." The student can generate every sample of an utterance in parallel rather than one after another, while the distillation objective keeps its output distribution close to the teacher's, so quality is preserved [3][5].
The payoff was large. The distilled system generated high-fidelity speech more than 20 times faster than real time, and DeepMind reported that the production model was more than 1,000 times faster than the original WaveNet, requiring roughly 50 milliseconds to synthesize one second of speech [3][4][5]. The parallel model also moved to higher-quality output, generating audio at 24,000 samples per second with 16 bits per sample, up from the original prototype's 8-bit quantization [3][4].
The faster model went into production quickly. On 4 October 2017, DeepMind announced that an updated WaveNet was generating the Google Assistant voices for US English and Japanese across all platforms [3]. DeepMind reported that the new US English voice scored a MOS of 4.347, against 4.667 for recordings of natural human speech [3].
WaveNet then reached external developers through Google Cloud. On 28 March 2018, Google introduced Cloud Text-to-Speech, offering 32 voices across 12 languages and variants, with the WaveNet-generated voices available as a premium option built on the distilled model running on Cloud TPU hardware [4]. Google reported that listeners rated the new US English WaveNet voices at a MOS of 4.1, which it described as over 20 percent better than its standard voices and as reducing the gap with human speech by over 70 percent [4].
WaveNet marked a turning point for audio generation. It showed that a neural network could model and synthesize raw waveforms directly at high quality, displacing the long-standing reliance on concatenative units and hand-built parametric vocoders for the most natural-sounding output. Its core mechanism, the dilated causal convolution, became a standard tool for sequence modelling well beyond speech, and the autoregressive, sample-by-sample formulation influenced a wave of later neural audio systems.
The Parallel WaveNet work was equally consequential in practice. It demonstrated that the quality of a slow autoregressive teacher could be transferred into a fast parallel student through distillation, a pattern that recurs across modern generative modelling. Together, the two papers turned a research demonstration into the engine behind everyday synthetic voices and set the direction for DeepMind's later audio research, including systems such as AudioLM and the Lyria family of music models developed at Google DeepMind.