arXiv:2006.10499v1 [cs.CV] 22 May 2020Mohammad Rami Koujan1 Nikolai Dochev1 Anastasios Roussos1;2 1...

2
Real-Time Monocular 4D Face Reconstruction using the LSFM models Mohammad Rami Koujan 1 Nikolai Dochev 1 Anastasios Roussos 1, 2 1 Department of Computer Science, University of Exeter 2 Department of Computer Science, Imperial College of London 1 Introduction 4D face reconstruction from a single camera is a challenging task, espe- cially when it is required to be performed in real time. We demonstrate a system of our own implementation that solves this task accurately and runs in real time on a commodity laptop, using a webcam as the only in- put. Our system is interactive, allowing the user to freely move their head and show various expressions while standing in front of the camera. As a result, the put forward system both reconstructs and visualises the identity of the subject in the correct pose along with the acted facial expressions in real-time. The 4D reconstruction in our framework is based on the recently-released Large-Scale Facial Models (LSFM) [2, 3], which are the largest-scale 3D Morphable Models of facial shapes ever constructed, based on a dataset of more than 10,000 facial identities from a wide range of gender, age and ethnicity combinations. This is the first real-time demo that gives users the opportunity to test in practice the capabilities of the recently-released Large-Scale Facial Models (LSFM) [2, 3]. 2 Proposed Implementation The designed application relies on three main steps to generate the 4D facial reconstruction results: Facial landmarks detection. Once a face gets detected, a set of facial landmarks are extracted from each input frame and used in later stages for estimating the 3D shape of that face. As our application is real-time, the landmarks extraction step needs to be fast. Accuracy and robustness of the landmarker are of paramount importance since the fitting approach relies solely on landmarks for reconstruction and the application does not necassitate any controlled environment. Toward these requirements, we adopt the method of Kazemi et al. [7] which offers a trade-off between the aspired landmarker’s characteristics. When this landmarker fails, which might happen in case of occlusions or extreme poses, we switch to the more accurate (yet more computationally expensive) landmarker of Bulat et al. [5] 3D Morphable Model (3DMM). Our application makes use of 3DMMs fitting for the reconstruction task [1]. As in [3], our adopted 3DMMs are derived from a linear combination of facial identity and ex- pression variation: The identity part originates from the LSFM models (either global or bespoke) and the expression part originates from the blendshapes model of Facewarehouse [6]. Fig. 1a shows the LSFM- global model mean shape (μ ) and first five principal components. The combination of both LSFM and Facewarehouse has first been proposed and tested in [4] and [8]. Booth et al. [4] combine both LSFM and Face- warehouse to create a 3DMM for both identity and expression, which they use along with their suggested framework for fitting 3DMMs to images and videos captured under challenging environments (in-the-wild). We also make use of this combination in our implementation to inspect more its ability and promises for real-time 4D face reconstruction. Fig. 1c demonstrates such combination of identity and expression models. Pose, Identity, and Expression Estimation. To start with, we esti- mate the subject’s face pose in each input frame based on an orthographic camera model. The idea is to use the extracted landmarks from the current frame along with their correspondences on the mean 3DMM face to esti- mate the pose, boiling down to a linear system of equations, which can be solved efficiently using Singular Value Decomposition (SVD). Estimating the identity and expression amounts to minimising a linear least squares problem similar to the one addressed in [9]. To avoid any jitter effects, a temporal smoothing step is employed, using the average pose of three consecutive frames before visualising the result. This ensures that camera Figure 1: (a) Visualisation of the shape model of LSFM-global: mean shape (μ ) and first five principal components as additions and subtractions from the mean shape [3]. (B) Exemplar instances generated from LSFM with distinct demographics. (c) Combined identity and expression model with fused mean faces and variations (-3 σ ,3 σ ) along three principal components for identity, and two principal components for expression, with the first row showing only the identity part [4]. inherent noise and inacurracies associated with the landmarks will have minimal repercussions on the final result. 2.1 Results As described earlier, our application produces real-time results. Reaching such instantaneous performance necessitated an efficient implementation tacking into account the intricate details to avoid any costly steps. Our application Graphical User Interface (GUI) offers the user a number of options to begin with. In addition to be used for normal reconstruction with/out showing the extracted landmarks with a bounding box around the detected face, it can visualise the reconstructed face with various ex- aggerated expressions, creating a caricature-like face. We provide the user with the ability to choose either the global or a bespoke LSFM model, among Black (all ages), Chinese (all ages) and White ethnic group, which is further clustered into four age groups: under 7 years old (White-under 7), 7-18 years old (White-7 to 18), 18-50 years old (White-18 to 50) and over 50 years old (White-over 50) [3]. Fig. 2 presents the reconstruction results of a subject acting in front of a camera. Figure 2: Exemplar frames of reconstruction results of a subject showing different facial expressions in front of a camera. [1] Volker Blanz and Thomas Vetter. A morphable model for the syn- arXiv:2006.10499v1 [cs.CV] 22 May 2020

Transcript of arXiv:2006.10499v1 [cs.CV] 22 May 2020Mohammad Rami Koujan1 Nikolai Dochev1 Anastasios Roussos1;2 1...

Page 1: arXiv:2006.10499v1 [cs.CV] 22 May 2020Mohammad Rami Koujan1 Nikolai Dochev1 Anastasios Roussos1;2 1 Department of Computer Science, University of Exeter 2 Department of Computer Science,

Real-Time Monocular 4D Face Reconstruction using the LSFM models

Mohammad Rami Koujan1

Nikolai Dochev1

Anastasios Roussos1,2

1 Department of Computer Science,University of Exeter

2 Department of Computer Science,Imperial College of London

1 Introduction

4D face reconstruction from a single camera is a challenging task, espe-cially when it is required to be performed in real time. We demonstratea system of our own implementation that solves this task accurately andruns in real time on a commodity laptop, using a webcam as the only in-put. Our system is interactive, allowing the user to freely move their headand show various expressions while standing in front of the camera. As aresult, the put forward system both reconstructs and visualises the identityof the subject in the correct pose along with the acted facial expressionsin real-time. The 4D reconstruction in our framework is based on therecently-released Large-Scale Facial Models (LSFM) [2, 3], which arethe largest-scale 3D Morphable Models of facial shapes ever constructed,based on a dataset of more than 10,000 facial identities from a wide rangeof gender, age and ethnicity combinations. This is the first real-time demothat gives users the opportunity to test in practice the capabilities of therecently-released Large-Scale Facial Models (LSFM) [2, 3].

2 Proposed Implementation

The designed application relies on three main steps to generate the 4Dfacial reconstruction results:

Facial landmarks detection. Once a face gets detected, a set of faciallandmarks are extracted from each input frame and used in later stages forestimating the 3D shape of that face. As our application is real-time, thelandmarks extraction step needs to be fast. Accuracy and robustness ofthe landmarker are of paramount importance since the fitting approachrelies solely on landmarks for reconstruction and the application does notnecassitate any controlled environment. Toward these requirements, weadopt the method of Kazemi et al. [7] which offers a trade-off between theaspired landmarker’s characteristics. When this landmarker fails, whichmight happen in case of occlusions or extreme poses, we switch to themore accurate (yet more computationally expensive) landmarker of Bulatet al. [5]

3D Morphable Model (3DMM). Our application makes use of3DMMs fitting for the reconstruction task [1]. As in [3], our adopted3DMMs are derived from a linear combination of facial identity and ex-pression variation: The identity part originates from the LSFM models(either global or bespoke) and the expression part originates from theblendshapes model of Facewarehouse [6]. Fig. 1a shows the LSFM-global model mean shape (µ) and first five principal components. Thecombination of both LSFM and Facewarehouse has first been proposedand tested in [4] and [8]. Booth et al. [4] combine both LSFM and Face-warehouse to create a 3DMM for both identity and expression, which theyuse along with their suggested framework for fitting 3DMMs to imagesand videos captured under challenging environments (in-the-wild). Wealso make use of this combination in our implementation to inspect moreits ability and promises for real-time 4D face reconstruction. Fig. 1cdemonstrates such combination of identity and expression models.

Pose, Identity, and Expression Estimation. To start with, we esti-mate the subject’s face pose in each input frame based on an orthographiccamera model. The idea is to use the extracted landmarks from the currentframe along with their correspondences on the mean 3DMM face to esti-mate the pose, boiling down to a linear system of equations, which can besolved efficiently using Singular Value Decomposition (SVD). Estimatingthe identity and expression amounts to minimising a linear least squaresproblem similar to the one addressed in [9]. To avoid any jitter effects,a temporal smoothing step is employed, using the average pose of threeconsecutive frames before visualising the result. This ensures that camera

Figure 1: (a) Visualisation of the shape model of LSFM-global: meanshape (µ) and first five principal components as additions and subtractionsfrom the mean shape [3]. (B) Exemplar instances generated from LSFMwith distinct demographics. (c) Combined identity and expression modelwith fused mean faces and variations (-3 σ , 3 σ ) along three principalcomponents for identity, and two principal components for expression,with the first row showing only the identity part [4].

inherent noise and inacurracies associated with the landmarks will haveminimal repercussions on the final result.

2.1 Results

As described earlier, our application produces real-time results. Reachingsuch instantaneous performance necessitated an efficient implementationtacking into account the intricate details to avoid any costly steps. Ourapplication Graphical User Interface (GUI) offers the user a number ofoptions to begin with. In addition to be used for normal reconstructionwith/out showing the extracted landmarks with a bounding box aroundthe detected face, it can visualise the reconstructed face with various ex-aggerated expressions, creating a caricature-like face. We provide the userwith the ability to choose either the global or a bespoke LSFM model,among Black (all ages), Chinese (all ages) and White ethnic group, whichis further clustered into four age groups: under 7 years old (White-under7), 7-18 years old (White-7 to 18), 18-50 years old (White-18 to 50) andover 50 years old (White-over 50) [3]. Fig. 2 presents the reconstructionresults of a subject acting in front of a camera.

Figure 2: Exemplar frames of reconstruction results of a subject showingdifferent facial expressions in front of a camera.

[1] Volker Blanz and Thomas Vetter. A morphable model for the syn-

arX

iv:2

006.

1049

9v1

[cs

.CV

] 2

2 M

ay 2

020

Page 2: arXiv:2006.10499v1 [cs.CV] 22 May 2020Mohammad Rami Koujan1 Nikolai Dochev1 Anastasios Roussos1;2 1 Department of Computer Science, University of Exeter 2 Department of Computer Science,

thesis of 3d faces. In Proceedings of the 26th annual conference onComputer graphics and interactive techniques, pages 187–194. ACMPress/Addison-Wesley Publishing Co., 1999.

[2] James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Pon-niah, and David Dunaway. A 3d morphable model learnt from 10,000faces. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5543–5552, 2016.

[3] James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway,and Stefanos Zafeiriou. Large scale 3d morphable models. Interna-tional Journal of Computer Vision, 126(2-4):233–254, 2018.

[4] James Booth, Anastasios Roussos, Evangelos Ververas, Epameinon-das Antonakos, Stylianos Poumpis, Yannis Panagakis, and Stefanos PZafeiriou. 3d reconstruction of" in-the-wild" faces in images andvideos. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2018.

[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solv-ing the 2d & 3d face alignment problem?(and a dataset of 230,000 3dfacial landmarks). In International Conference on Computer Vision,volume 1, page 4, 2017.

[6] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou.Facewarehouse: A 3d facial expression database for visual comput-ing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2014.

[7] Vahid Kazemi and Josephine Sullivan. One millisecond face align-ment with an ensemble of regression trees. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pages 1867–1874, 2014.

[8] Mohammad Rami Koujan and Anastasios Roussos. Combining densenonrigid structure from motion and 3d morphable models for monoc-ular 4d face reconstruction. In Proceedings of the 15th ACM SIG-GRAPH European Conference on Visual Media Production, pages1–9, 2018.

[9] Stefanos Zafeiriou, Grigorios Chrysos, Anastasios Roussos, Evange-los Ververas, Jiankang Deng, and George Trigeorgis. The 3d menpofacial landmark tracking challenge. 2018.

2