Given the rise of smart speakers and other devices that talk back to you, text-to-speech (TTS) is an important technology. Google’s solution for third-party developers is now generally available with new languages and features. Meanwhile, Cloud Speech-to-Text is also gaining new beta features.
Developed by DeepMind, WaveNet allows for high-fidelity voices in Cloud Text-to-Speech that sound more natural. After Google opened up Cloud TTS to developers in March with a public beta, it is now generally available.
No longer limited to US English, there are 17 new WaveNet voices, thus allowing developers to build apps for more languages. In total, Cloud Text-to-Speech now supports 30 standards voices and 26 WaveNet variants in 14 languages.
Meanwhile, given the wide use cases, Cloud TTS is launching Audio Profiles in beta. This allows developers to specify what device — headphones, smart speakers, or traditional telephony — the TTS is intended for so that Google can optimize the output.
From headphones to speakers to phone lines, audio files can sound quite different on different playback media and mechanisms. The physical properties of each device, as well as the environment they are placed in, influence the range of frequencies and level of detail they produce (e.g., bass, treble and volume). With the release of Audio Profiles, you can optimize Cloud Text-to-Speech for playback on different types of hardware.
There are also updates to the counterpart Cloud Speech-to-Text API. Multi-channel recognition allows for better transcription quality by separating each speaker into their own audio channel. For example, one channel could be for a customer and the other for the agent helping them.
If separation is not possible, speaker diarization allows developers to manually note the number of speakers in a conversation with machine learning working to assign each word to a speaker.
With language auto-detect, Cloud STT can determine what language was spoken by a user and return the transcript in that recognized language. Developers can expose an option to users to set their prefered languages. Another feature aimed at improving accuracy is word-level confidence scores where the API notes how sure it is about a transcription.
For example, if a user inputs “please setup a meeting with John for tomorrow at 2PM” into your app, you can decide to prompt the user to repeat “John” or “2PM,” if either have low confidence, but not to reprompt for “please” even if has low confidence since it’s not critical to that particular sentence.