Jump to Content

Research

WaveNet launches in the Google Assistant

Published
Authors

Aäron van den Oord, Tom Walters

Just over a year ago we presented WaveNet, a new deep neural network for generating raw audio waveforms that is capable of producing better and more realistic-sounding speech than existing techniques. At that time, the model was a research prototype and was too computationally intensive to work in consumer products.

But over the last 12 months we have worked hard to significantly improve both the speed and quality of our model and today we are proud to announce that an updated version of WaveNet is being used to generate the Google Assistant voices for US English and Japanese across all platforms.

Using the new WaveNet model results in a range of more natural sounding voices for the Assistant.

US English voice I

US English voice II

US English third party voice

Japanese voice

To understand why WaveNet improves on the current state of the art, it is useful to understand how text-to-speech (TTS) - or speech synthesis - systems work today.

The majority of these are based on so-called concatenative TTS, which uses a large database of high-quality recordings, collected from a single voice actor over many hours. These recordings are split into tiny chunks that can then be combined - or concatenated - to form complete utterances as needed. However, these systems can result in unnatural sounding voices and are also difficult to modify because a whole new database needs to be recorded each time a set of changes, such as new emotions or intonations, are needed.

To overcome some of these problems, an alternative model known as parametric TTS is sometimes used. This does away with the need for concatenating sounds by using a series of rules and parameters about grammar and mouth movements to guide a computer-generated voice. Although cheaper and quicker, this method creates less natural sounding voices.

WaveNet takes a totally different approach. In the original paper we described a deep generative model that can create individual waveforms from scratch, one sample at a time, with 16,000 samples per second and seamless transitions between individual sounds.

The structure of the convolutional neural network that underpins the original WaveNet model

It was built using a convolutional neural network, which was trained on a large dataset of speech samples. During this training phase, the network determined the underlying structure of the speech, such as which tones followed each other and what waveforms were realistic (and which were not). The trained network then synthesised a voice one sample at a time, with each generated sample taking into account the properties of the previous sample. The resulting voice contained natural intonation and other features such as lip smacks. Its “accent” depended on the voices it had trained on, opening up the possibility of creating any number of unique voices from blended datasets. As with all text-to-speech systems, WaveNet used a text input to tell it which words it should generate in response to a query.

Building up sound waves at such high-fidelity using the original model was computationally expensive, meaning WaveNet showed promise but was not something we could deploy in the real world. But over the last 12 months our teams have worked hard to develop a new model that is capable of more quickly generating waveforms. It is also now capable of running at scale and is the first product to launch on Google’s latest TPU cloud infrastructure.

The WaveNet team will now turn their focus to preparing a publication detailing the research behind the new model, but the results speak for themselves. The new, improved WaveNet model still generates a raw waveform but at speeds 1,000 times faster than the original model, meaning it requires just 50 milliseconds to create one second of speech. In fact, the model is not just quicker, but also higher-fidelity, capable of creating waveforms with 24,000 samples a second. We have also increased the resolution of each sample from 8 bits to 16 bits, the same resolution used in compact discs.

This makes the new model more natural sounding according to tests with human listeners. For example, the new US English voice I gets a mean-opinion-score (MOS) of 4.347 on a scale of 1-5, where even human speech is rated at just 4.667.

The new model also retains the flexibility of the original WaveNet, allowing us to make better use of large amounts of data during the training phase. Specifically, we can train the network using data from multiple voices. This can then be used to generate high-quality, nuanced voices even where there is little training data available for the desired output voice.

We believe this is just the start for WaveNet and we are excited by the possibilities that the power of a voice interface could now unlock for all the world's languages.