Google explains how ‘Look and Talk’ works on the Nest Hub Max

Abner Li | Jul 27 2022 - 4:30 pm PT

“Look and Talk” for the Nest Hub Max launched at I/O 2022, and Google has now detailed how exactly the camera-based, “Hey Google” replacement works.

Google calls “Look and Talk” — which is very amusingly codenamed “Blue Steel” after Zoolander — the “first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max.”

There are three processing phases to every interaction, with Assistant looking for signals like Proximity, Face Match, Head Orientation, Gaze Direction, Lip Movement, Voice Match, Contextual Awareness, and Intent Classification. There are over 100 signals from the camera and microphone in total with processing occurring locally on-device.

Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device.

It starts with the Nest Hub Max determining whether a “user is demonstrating an intent to engage with [Assistant].” They have to be within five feet and be recognized by Face Match, with Google taking care to ignore brief glances at the device.

Advertisement - scroll for more content

For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.

Phase two has the Hub Max start listening, verify Voice Match, and predict “whether the user’s utterance was intended to be an Assistant query.”

This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.

The third phase, after the first two are met, is fulfillment “where it communicates with the Assistant server to obtain a response to the user’s intent and query text.” This feature involved a wide range of testing:

We developed a diverse dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.

For Google, “Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible.” Quick Phrases — where predefined commands (e.g. set an alarm or turn lights on/off) don’t require the hotword — are coming next to the Nest Hub Max.