Google reveals AI tricks behind new augmented reality animations

Google reveals AI tricks behind new augmented reality animations

The animated masks, glasses, and hats that apps like YouTube Stories overlay atop faces are pretty nifty, but how on earth do look so realistic? Well, thanks to a deep dive published this morning by Google’s AI research division, it’s less of a mystery than before. In it, engineers at the Mountain View company describe the AI tech at the core of Stories and ARCore’s Augmented Faces API, which they say can simulate light reflections, model face occlusions, model specular reflection, and more in real time with a single camera.

“One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world,” Google AI’s Artsiom Ablavatski and Ivan Grishchenko explain, “a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.”

Google’s augmented reality (AR) pipeline, which taps TensorFlow Lite — a lightweight, mobile and embedded implementation of Google’s TensorFlow machine learning framework — for hardware-accelerated processing where available, comprises two neural networks (i.e., layers of math functions modeled after biological neurons). The first — a detector — operates on camera data and computes face locations, while the second — a 3D mesh model — uses that location data to predict surface geometry.

Why the two-model approach? Two reasons, Ablavatski and Grishchenko say: it “drastically reduces” the need to augment the dataset with synthetic data, and it allows the AI system to dedicate most of its capacity toward accurately predicting mesh coordinates. “[Both of these are] critical to achieve proper anchoring of the virtual content,” Ablavatski and Grishchenko say.

The next step entails applying the mesh network to a single frame of camera footage at a time, using a smoothing technique that minimizes lag and noise. Said mesh is generated from cropped video frames and predicts coordinates on labeled real-world data, providing both 3D point positions and probabilities of faces being present and “reasonably aligned” in-frame.

Recent performance and accuracy improvements to the AR pipeline come courtesy the latest TensorFlow Lite, which Ablavatski and Grishchenko say boosts performance while “significantly” lowering power consumption. They’re also the result of a workflow which iteratively bootstraps and refines the mesh model’s predictions, making it easier for the team to tackle challenging cases (such as grimaces and oblique angles) and artifacts (like camera imperfections and extreme lighting conditions.)

Interestingly, the pipeline doesn’t rely on just one or two models — instead, it comprises a “variety” of architectures designed to support a range of devices. “Lighter” networks — those which require less memory and processing power — necessarily use lower-resolution input data (128 x 128), while the most mathematically complex models bumps up the resolution to 256 x 256.

According to Ablavatski and Grishchenko, the fastest “full mesh” model achieves an inference time of less than 10 milliseconds on the Google Pixel 3 (using the graphics chip), while the lightest cuts that down to three milliseconds per frame. They’re a bit slower on Apple’s iPhone X, but only by a hair: the lightest model performs inference in about 4 seconds (using the GPU), while the full mesh takes 14 milliseconds.

Images Powered by Shutterstock