Google’s AI can translate languages it’s never learned, lip-read better than people

Ben Lovejoy | Nov 23 2016 - 5:33 am PT

A couple of Google announcements today highlight the astonishing progress being made in artificial intelligence. A Google Research blog post explains how the company’s switch to neural learning for Google Translate means that the machine can translate between language pairs it has never explicitly learned, while a DeepMind project showed that AI can lip-read better than people.

The company said that Google Translate no longer has individual systems for each language pair, but instead uses a single system with tokens indicating input and output languages. The AI learns from millions of examples, and it was this that made the team wonder whether it could translate between two languages without specifically being taught how to do so …

Advertisement - scroll for more content

This inspired us to ask the following question: Can we translate between a language pair which the system has never seen before? An example of this would be translations between Korean and Japanese where Korean⇄Japanese examples were not shown to the system. Impressively, the answer is yes — it can generate reasonable Korean⇄Japanese translations, even though it has never been taught to do so. We call this “zero-shot” translation.

One of the more mind-boggling realities of AI is that it effectively operates on a ‘black box’ basis, where the programmers don’t actually know how the system does what it does. Google said that it had to figure out a way to determine exactly how the AI was pulling off this trick.

The success of the zero-shot translation raises another important question: Is the system learning a common representation in which sentences with the same meaning are represented in similar ways regardless of language — i.e. an “interlingua”? Using a 3-dimensional representation of internal network data, we were able to take a peek into the system as it translates a set of sentences between all possible pairs of the Japanese, Korean, and English languages […]

[Analysis showed that] the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network.

You can read more about it in the blog post.

The lip-reading project was reported by New Scientist. A professional lip-reader attempting to decipher 200 randomly-selected TV clips achieved a success rate of just 12.4%, while Google’s DeepMind system managed 46.8%.

The AI system was trained using some 5000 hours from six different TV programmes, including Newsnight, BBC Breakfast and Question Time. In total, the videos contained 118,000 sentences […]

By only looking at each speaker’s lips, the system accurately deciphered entire phrases, with examples including “We know there will be hundreds of journalists here as well” and “According to the latest figures from the Office of National Statistics”.

The system was even able to cope with the audio and video were out of sync.

A computer system was taught the correct links between sounds and mouth shapes. Using this information, the system figured out how much the feeds were out of sync when they didn’t match up, and realigned them.

One of the potential uses described by the team was silent dictation to Siri in noisy environments, the iPhone reading your lips via its camera rather than listening to your voice.