Google claims that the latest version of its AI-powered speech synthesis system, Tacotron 2, is almost indistinguishable from human speech – and has put some comparative examples online to demonstrate.
Tacotron 2 works directly from written text, and Google says it can use context to correctly pronounce identically-spelled words like ‘read’ (to read) and ‘read’ (has read), responds to punctuation and can learn to stress words …
Dave Gershgorn explained in a Quartz piece how it works.
The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.
Google doesn’t say which of its comparative samples is the original, and which was generated by Tacotron 2, but Gershgorn noticed that if you view the page source, the file names give the game away – spoiler below, so listen to the examples before reading on.
The context-driven pronunciation is particularly impressive, coping with phrases like “he thought it was time to present the present.” It can also interpret commas correctly, and use question marks to adjust the pitch of the sentence appropriately.
I have to say that the human voice they’ve imitated is a rather easier target than a typical voice, but it’s impressive all the same. I couldn’t tell the two apart until I peeked. To save you the trouble of doing the same, the artificial versions in the four comparison phrases are 2nd, 1st, 1st and 2nd.