Unlike other smartphone cameras that feature a Portrait Mode, the Pixel line gets by with only one rear camera. With the Pixel 3, Google turned to machine learning to improve depth estimation and “produce even better Portrait Mode results.”
With the Pixel 2, Google was able to calculate depth within an image with a single camera by using dual-pixel autofocus, or Phase-Detection Autofocus (PDAF) pixels. At a high-level, a neural network determines “what pixels correspond to people versus the background.”
PDAF pixels capture two slightly different views of a scene and looks for horizontal parallax movement in the background:
Because parallax is a function of the point’s distance from the camera and the distance between the two viewpoints, we can estimate depth by matching each point in one view with its corresponding point in the other view.
However, this technique is difficult given how slight the movement is, which results in depth estimation errors and “unpleasant artifacts.”
With the Pixel 3, Google looked for other visual cues in an image, and then used machine learning to train an algorithm.
For example, points that are far away from the in-focus plane appear less sharp than ones that are closer, giving us a defocus depth cue.
In addition, even when viewing an image on a flat screen, we can accurately tell how far things are because we know the rough size of everyday objects (e.g. one can use the number of pixels in a photograph of a person’s face to estimate how far away it is). This is called a semantic cue.
Training data was collected with a “Frankenphone” rig comprised of five Pixel 3 phones programmed over Wi-Fi to simultaneously capture an image. High-quality depth is then computed by using structure from motion and multi-view stereo .
Specifically, we train a convolutional neural network, written in TensorFlow, that takes as input the PDAF pixels and learns to predict depth. This new and improved ML-based method of depth estimation is what powers Portrait Mode on the Pixel 3.
To ensure fast results, we use TensorFlow Lite, a cross-platform solution for running machine learning models on mobile and embedded devices and the Pixel 3’s powerful GPU to compute depth quickly despite our abnormally large inputs. We then combine the resulting depth estimates with masks from our person segmentation neural network to produce beautiful Portrait Mode results.