Humans are hardwired to look at each other’s faces. Three-month-old infants prefer looking at faces when given a chance. We have a separate brain region devoted to facial recognition, and a human can fail to recognize faces while all the rest of the visual processing functions perfectly well (a condition known as prosopagnosia). We are much better at recognizing faces and emotions than virtually anything else; in 1973, Hermann Chernoff even suggested using drawings of faces for multivariate data visualization.
For us humans, it makes sense to specialize on faces. We are social animals whose brains had probably evolved for social reasons and who have an urgent need not only to distinguish individuals but to recognize variations in emotions: the difference between fear and anger in a fellow primate might mean life or death. But it turns out that in artificial intelligence, problems related to human faces are also coming to the forefront of computer vision. Below, we consider some of them, discuss the current state of the art, and introduce a common solution that might advance it in the near future.
Common Issues in Computer Vision
First, face recognition itself has obvious security-related applications from unlocking your phone to catching criminals with CCTV cameras. Usually face recognition is an added layer of security, but as the technology progresses, it might rival fingerprints and other biometrics. Formally, it is a classification problem: choose the correct answer out of several alternatives. But there are a lot of faces, and we need to add new people on the fly. Therefore, face recognition systems usually operate by learning to extract features, i.e., map the picture of a face to a much smaller space of features and then perform information retrieval in this feature space. Feature learning is almost invariably done with deep neural networks. While modern face recognition systems achieve excellent results and are widely used in practice, this problem still, to this day, gives rise to new fundamental ideas in deep learning.
Emotion recognition (classifying facial expressions) is another human forte, but automating it is important. AI assistants can be more helpful if they recognize emotions, and a car might recognize whether the driver is about to fall asleep at the wheel (this technology is close to production). There are also numerous medical applications: emotions (or lack of such) are important in diagnosing Parkinson’s disease, strokes and cortical lesions, and much more. Again, emotion recognition is a classification problem, and the best results are achieved by rather standard deep learning architectures, although medical applications usually augment images with other modalities such as respiration or electrocardiograms.
Gaze estimation, i.e., predicting where a person is looking, is important for smartphones, AR/VR, and various eye tracking applications such as, again, car safety. This problem does not require large networks because the input images are rather small, but results keep improving, lately, e.g., with few-shot adaptation to a specific person. The current state of gaze estimation is already sufficient to create AR/VR software fully controlled by gaze, and we expect this market to grow very rapidly.
Segmentation, a classical computer vision problem, is important for human faces as well, mostly for video editing and similar applications. If you want to cut a person out really well, say add a cool background to your video conferencing app, segmentation turns into background matting, a much harder problem where the segmentation mask is not binary but can also be “semi-transparent” to a degree. This is important for object boundaries, hair, glasses, and the like. Background matting has only very recently started getting satisfactory solutions, and there is a lot to be done yet.
Many specialized face-related problems rely on facial keypoint detection, the problem of finding characteristic points on a human face. A common keypoint scheme includes several dozen (68 in the popular IBUG scheme) points that all need to be labeled on a face. Facial keypoints can serve as the first step for tracking faces in images and video, recognizing faces and facial expressions, and numerous biometric and medical applications. There exist state-of-the-art solutions both based on deep neural networks and ensembles of classical models.
The Limitations of Manually Labeled Data
Face-related problems represent an important AI frontier. Interestingly, most of them struggle with the same obstacle: lack of labeled training data. There exist datasets with millions of faces, but a face recognition system has to add a new person by just 1-2 photos. In many other problems, manually labeled data is challenging and costly to obtain. Imagine how much work it is to manually draw a segmentation mask for a human face, and then imagine that you have to make this mask “soft” for background matting. Facial keypoints are also notoriously difficult to label: in engineering practice, researchers even have to explicitly account for human biases in labeling that vary across datasets. Lack of representative training data has also led to bias in deployed models resulting in poor performance with certain ethnicities.
Moreover, significant changes in conditions often render existing datasets virtually useless: you might need to recognize faces from an infrared camera of a smartphone that users hold below their chins, but the datasets only provide frontal RGB photos. This lack of data can impose a hard limit on what AI researchers can do.
Synthetic Data Presents a Solution
Fortunately, a solution is already presenting itself: many AI models can be trained on synthetic data. If you have a CGI-based 3D human head crafted with sufficient fidelity, this head can be put in a wide variety of conditions, including lighting, camera angles, camera modalities, backgrounds, occlusions, and much more. Even more importantly, since you control everything going on in your virtual 3D scene, you know where every pixel is coming from and can get perfect labeling for all of these problems for free, even hard ones like background matting. Every 3D model of a human head can give you an endless stream of perfectly labeled highly varied data for any face-related problem—what’s not to like?
Synthetic data appears to be a key solution, but it raises questions. First, synthetic images cannot be perfectly photorealistic, leading to the domain shift problem. Models are trained on the synthetic domain to be used on real images. Second, creating a new 3D head from scratch is a lot of manual labor, and variety in synthetic data is essential, so (at least semi-) automatic generation of synthetic data will probably see much more research in the nearest future. However, in practice, synthetic data is already proving itself for human faces even in its most straightforward form: creating hybrid synthetic+real datasets and training standard models on this data.
Let us summarize. Several important computer vision problems related to human faces are increasingly finding real-world applications in security, biometrics, AR/VR, video editing, car safety, and more. Most of them are far from solved, and the amount of labeled data for such problems is limited because real data is expensive. Fortunately, it appears that synthetic data is picking up the torch. Human faces may well be the next frontier for modern AI, and it looks like we are well-positioned to get there.
About the Author
Sergey I. Nikolenko is Head of AI at Synthesis AI. Sergey is a computer scientist specializing in machine learning and analysis of algorithms. Synthesis AI is a San Francisco based company specializing on the generation and use of synthetic data for modern machine learning models. He also serves as the Head of the Artificial Intelligence Lab at the Steklov Mathematical Institute at St. Petersburg, Russia. Sergey’s interests include synthetic data in machine learning, deep learning models for natural language processing, image manipulation, and computer vision, and algorithms for networking. Sergey has authored a seminal text in field, “Synthetic Data for Deep Learning,” published by Springer.
Sign up for the free insideAI News newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1
Speak Your Mind