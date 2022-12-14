As part of December’s Pixel Feature Drop, Google’s excellent Recorder app gained Speaker Labels that can identify multiple people. As with previous editions, the team behind it is out with an explanation of how the feature came to be.

Speaker Labels are powered by Turn-to-Diarize, Google’s new speaker diarization system. There are three main components to it that “run fully on the device”:

Speaker turn detection model that detects a change of speaker in the input speech

Speaker encoder model that extracts voice characteristics from each speaker turn

Multi-stage clustering algorithm that annotates speaker labels to each speaker turn in a highly efficient way

Our speaker diarization system leverages several highly optimized machine learning models and algorithms to allow diarizing hours of audio in a real-time streaming fashion with limited computational resources on mobile devices.

Google notes that audio recordings from the Recorder app can be “as long as up to 18 hours,” and that more audio means greater “confidence on predicted speaker labels.” As such, Recorder will “occasionally make corrections to previously predicted low-confidence speaker labels,” while users can manually make edits and split the transcript.

The current system mostly runs on Tensor’s CPU, with both the first generation and G2 supported across the Pixel 6, 6 Pro, 6a, 7, and 7 Pro. For the future, Google is “working on delegating more computations to the TPU block, which will further reduce the overall power consumption of the diarization system.” At the moment, Recorder 4.2 contains warning text about how Speaker Labels will not work if your “Device is too hot.”

Another future work direction is to leverage multilingual capabilities of speaker encoder and speech recognition models to expand this feature to more languages.

