Since 2017, Google Cloud has offered a Speech-to-Text (STT) API that third-parties can take advantage of in their own services. The newest models for Google speech recognition improve accuracy due to a “major” technology improvement, and are particularly suited for creating voice UIs.

The new neural sequence-to-sequence model for Google’s Speech-to-Text API improves accuracy in 23 languages and 61 of the supported locales. In addition to “out-of-box quality improvements,” there’s expanded support for different kinds of voices, noise environments, and acoustic conditions.

For the past several years, automated speech recognition (ASR) techniques have been based on separate acoustic, pronunciation, and language models. Historically, each of these three individual components was trained separately, then assembled afterwards to do speech recognition. 

The conformer models that we’re announcing today are based on a single neural network. As opposed to training three separate models that need to be subsequently brought together, this approach offers more efficient use of model parameters.

These improvements allow for “more accurate outputs in more contexts,” with Google specifically touting how speech recognition can now be brought to more use cases. In the case of voice control UIs, “users [can] speak to these interfaces more naturally and in longer sentences.”

  • “Latest long” is specifically designed for long-form spontaneous speech, similar to the existing “video” model.
  • “Latest short,” on the other hand, gives great quality and great latency on short utterances like commands or phrases.

Spotify has been an early adopter of these new models, and worked “closely with Google” on the “Hey Spotify” voice interface found on the mobile apps and Car Thing, which we noted in our review was good at the underlying task of voice recognition and transcription:

The basics work fine, but having a voice assistant that can’t do anything additional beyond what, say, an always-listening Google Assistant on your phone could do is a bit frustrating. It is nice, though, that Car Thing moves the mics away from your phone for better accuracy. I was never disappointed with Car Thing’s ability to hear my commands.

FTC: We use income earning auto affiliate links. More.


Check out 9to5Google on YouTube for more news:

You’re reading 9to5Google — experts who break news about Google and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Google on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

About the Author

Abner Li

Editor-in-chief. Interested in the minutiae of Google and Alphabet. Tips/talk: abner@9to5g.com