For millions who can’t hear, lip reading offers a window into conversations that would be lost without it. But the practice is hard—and the results are often inaccurate (as you can see in these Bad Lip Reading videos). Now, researchers are reporting a new artificial intelligence (AI) program that outperformed professional lip readers and the best AI to date, with just half the error rate of the previous best algorithm.
If perfected and integrated into smart devices, the approach could put lip reading in the palm of everyone’s hands. “It’s a fantastic piece of work,” says Helen Bear, a computer scientist at the Queen Mary University of London who was not involved with the project. Writing computer code that can read lips is maddeningly difficult.
So in the new study scientists turned to a form of AI called machine learning, in which computers learn from data. They fed their system thousands of hours of videos along with transcripts and had the computer solve the task for itself.
The researchers started with 140,000 hours of YouTube videos of people talking in diverse situations. Then, they designed a program that created clips a few seconds long with the mouth movement for each phoneme, or word sound, annotated. The program filtered out non-English speech, nonspeaking faces, low-quality video, and video that wasn’t shot straight ahead. Then, they cropped the videos around the mouth.
That yielded nearly 4000 hours of footage, including more than 127,000 English words. The process and the resulting data set—seven times larger than anything of its kind—are “important and valuable” for anyone else who wants to train similar systems to read lips, says Hassan Akbari, a computer scientist at Columbia University who was not involved in the research.