Given the rise of smart assistants and smart home devices, text-to-speech (TTS) is increasingly one primary method of interaction. Google is today introducing its Cloud Text-to-Speech technology and making it available for developers to use.
A part of the Google Cloud Platform, anyone can leverage it to have IoT enabled devices “talk back to you,” covert text-based media like articles and books into spoken audio, and more interestingly power real-time natural language conversations, which are ideal for voice response systems in call centers.
Google notes that Cloud Text-to-Speech authentically and “correctly pronounces complex text such as names, dates, times and addresses.” Meanwhile, developers can customize pitch, speaking rate, and volume gain, with the service featuring 32 different voices from 12 languages and variants.
Some of those voices are created by Alphabet AI subsidiary DeepMind through its generative model for raw audio. WaveNet is responsible for a “selection” of high-fidelity voices that are more natural-sounding, with Google touting a 50% reduction in the gap with human performance.
With these adjustments, the new WaveNet model produces more natural sounding speech. In tests, people gave the new US English WaveNet voices an average mean-opinion-score (MOS) of 4.1 on a scale of 1-5 — over 20% better than for standard voices and reducing the gap with human speech by over 70%. As WaveNet voices also require less recorded audio input to produce high quality models, we expect to continue to improve both the variety as well as quality of the WaveNet voices available to Cloud customers in the coming months.
WaveNet has come a long way since its inception in late 2016, with the current version running on Google Cloud TPUs and generating raw waveforms 1,000 times faster, with one second of speech created in just 50 milliseconds.