Following last month’s Cloud Text-to-Speech update that added more natural voices through DeepMind WaveNet models, Google is now revamping the inverse of that API. Cloud Speech-to-Text is today gaining its “largest overhaul” for business clients since it launched in 2016.
Going after more business users, Speech-to-Text is adding new video and phone call transcription models that are specifically tuned for uses like call centers. In the latter case, which Google has previously touted, the API could support 2-4 speakers and account for background noise like phone line static and hold music.
Meanwhile, another possible use for the Google Cloud service is transcribing the TV broadcast of a basketball game where there are over four speakers, such as hosts, player interviews, and ads, while accounting for crowd cheer, sound effects, and other game noises. From four minutes in the first use to over two hours in the latter, Google’s demonstrating how adaptable Speech-to-Text is.
This optimization can replace the automatic model selection, while this tailoring was achieved after customers asked Google to use real data to train the model. Enhanced phone_call allows customers to volunteer training data in exchange for getting access to these improvements. As a result of real data, the new model has 54% fewer errors than the basic phone_call.
Most major cloud providers use speech data from incoming requests to improve their products. Here at Google Cloud, we’ve avoided this practice, but customers routinely request that we use real data that’s representative of theirs, to improve our models. We want to meet this need, while being thoughtful about privacy and adhering to our data protection policies. That’s why today, we’re putting forth one of the industry’s first opt-in programs for data logging, and introducing a first model based on this data
Additional, a new video model uses machine learning similar to YouTube captioning with a 64% reduction in errors compared to the standard.
Meanwhile, Google is adding a beta feature to automatically punctuate long-form speech transcription that suggests commas, questions marks, and periods. Lastly, the company will allow users to tag transcribed audio or video in order to tell Google what models Speech-to-Text should prioritize next.