Google voice search just got a whole lot smarter thanks to new acoustic model

Cam Bunton | Sep 25 2015 - 1:51 pm PT

Google’s voice recognition technology used in software and services like Google Now and search has been among the best for the past couple of years. Still, the company hasn’t been resting and just announced that the methods in which it detects and predicts words have been improved to give much faster, more efficient results with better reliability…

The big news is that Google has switched its acoustic model from what’s known as DNN (Deep Neural Networks) to RNN (Recurrent Neural Networks). In its research blog there’s a whole lot of nerd talk, but simply means, Google now uses models that can better understand entire words and phrases being spoken. More specifically, it can recognize where each different sound in a word begins and ends, even when vowel sounds sort-of blend into each other.

Advertisement - scroll for more content

The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example – /m j u z i @ m/ in phonetic notation – it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.

Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud – “museum” – it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.

Perhaps more importantly, Google also trained its system to recognize ambient noise, to improve its ability to filter it out, ensuring users can have their speech and commands recognized even when they’re in a noisy environment.

Having developed these new models, Google then had to translate them in to a process which would work in as close to real-time as possible. Enhanced capabilities meant the networks would take a little longer to predict sentences and words. Initially it was delaying its predictions by around 300 milliseconds. So then Google had to train the engine to output its predictions quicker. The result: More accurate and faster predictions that work reliably even in noisy environments.