On the Multi-View Fitting and Construction of Dense Deformable Face Models · 2008-12-03 ·...
Transcript of On the Multi-View Fitting and Construction of Dense Deformable Face Models · 2008-12-03 ·...
On the Multi-View Fitting and Construction of Dense
Deformable Face Models
Krishnan Ramnath
CMU-RI-TR-07-10
May 2007
Submitted in partial fulfillment of the requirements
for the degree of Master of Science
The Robotics Institute
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213
c©2007 by Krishnan Ramnath
Abstract
Active Appearance Models (AAMs) are generative, parametric models that have been
successfully used in the past to model deformable objects such as human faces. Fitting
an AAM to an image consists of minimizing the error between the input image and the
closest model instance; i.e. solving a nonlinear optimization problem. In this thesis we study
three important topics related to deformable face models such as AAMs: (1) multi-view
3D face model fitting, (2) multi-view 3D face model construction, and (3) automatic dense
deformable face model construction.
The original AAMs formulation was 2D, but they have recently been extended to include
a 3D shape model. A variety of single-view algorithms exist for fitting and constructing 3D
AAMs but one area that has not been studied is multi-view algorithms. In the first part of
this thesis we describe an algorithm for fitting a single AAM to multiple images, captured
simultaneously by cameras with arbitrary locations, rotations, and response functions. This
algorithm uses the scaled orthographic imaging model used by previous authors, and in the
process of fitting computes, or calibrates, the scaled orthographic camera matrices. We also
describe an extension of this algorithm to calibrate weak perspective (or full perspective)
camera models for each of the cameras. In essence, we use the human face as a (non-
rigid) calibration grid. We demonstrate that the performance of this algorithm is roughly
comparable to a standard algorithm using a calibration grid. We then show how camera
calibration improves the performance of AAM fitting.
A variety of non-rigid structure-from-motion algorithms, both single-view and multi-
view, have been proposed that can be used to construct the corresponding 3D non-rigid
shape models of a 2D AAM. In the second part of this thesis we show that constructing a
3D face model using non-rigid structure-from-motion suffers from the Bas-Relief ambiguity
and may result in a “scaled” (stretched/compressed) model. We outline a robust non-rigid
iii
motion-stereo algorithm for calibrated multi-view 3D AAM construction and show how using
calibrated multi-view motion-stereo can eliminate the Bas-Relief ambiguity and yield face
models with higher 3D fidelity.
An important step in computing dense deformable face models such as 3D Morphable
Models (3DMMs) is to register the input texture maps using optical flow. However, optical
flow algorithms perform poorly on images of faces because of the appearance and disappear-
ance of structure such as teeth and wrinkles, and because of the non-Lambertian, textureless
cheek regions. In the final part of this thesis we propose a different approach to building
dense face models. Our algorithm iteratively builds a face model, fits the model to the input
image data, and then refines the model. The refinement consists of three steps: (1) the addi-
tion of more mesh points to increase the density, (2) image consistent re-triangulation of the
mesh, and (3) refinement of the shape modes. Using a carefully collected dataset containing
hidden marker ground-truth, we show that our algorithm generates dense models that are
quantitatively better than those obtained using off the shelf optical flow algorithms. We also
show how our algorithm can be used to construct dense deformable models automatically,
starting with a rigid planar model of the face that is subsequently refined to model the
non-planarity and the non-rigid components.
iv
Acknowledgements
I wish to express my sincere gratitude to my advisor Dr. Simon Baker for his able
guidance and motivation throughout the course of my research work. It is his continued
support, enthusiasm and technical advice that saw me through as a student and researcher
at Carnegie Mellon University. My repeated interactions with him helped me develop both
on a technical as well as personal front. I cannot thank him enough for that.
I am very thankful my co-advisor Dr. Iain Matthews for his insightful suggestions during
various stages of my research. I have learnt a lot in terms of research approach, programming
practices and presentation skills from him. I also cherish all the fun times I have had with
him during work and otherwise. It is indeed my pleasure to have worked with highly talented
people such as my advisors and many others here at the Robotics Institute, CMU.
My sincere thanks to Seth Koterba for providing immense help during the initial stages
of my research. I also enjoyed our frequent interactions and squash matches. I also wish to
thank my thesis committee members: Dr. Alexei (Alyosha) Efros and Ankur Datta for their
suggestions. Talking to Alyosha has always been a pleasure, our conversations were always
enjoyable.
I am grateful to Prof. Martial Herbert, Prof. Michael Erdmann, Prof. Matthew Mason,
Prof. Srinivas Narasimhan, Prof. Tom Mitchell, Prof. Eric Xing for providing expert
technical guidance during my interactions with them.
I wish to thank Dr. Simon Lucey for patiently sitting through all the data collection
sessions. I am grateful to Dr. Deva Ramanan (TTI Chicago) for his contributions and help
towards my research. I gratefully acknowledge all the help provided by Sanjeev Koppal and
Mohit Gupta; for sitting through my practice talks and providing constructive criticisms.
Many thanks to Ralph Gross, Goksel Dedeoglu and Fernando De La Torre for their help.
I wish to thank my classmates Francisco Calderon, Peter Barnum, Kevin Yoon, Manuel
v
Quero, Stefan, Javier, Kristina, Ling for their help with the face data collection. I also wish
to thank all my friends here at RI who made my day, everyday, over the past two years.
A special thanks to Suzanne Lyons Muth for all the help and for making my life easier
on numerous occasions.
Finally, I wish to thank the Almighty for his divine grace.
vi
Contents
Abstract iii
Acknowledgements v
List of Tables x
List of Figures xi
1 Introduction 1
1.1 Multi-View Face Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Multi-View Face Model Construction . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Dense Face Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background 10
2.1 2D Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Fitting a 2D AAM to a Single Image . . . . . . . . . . . . . . . . . . . . . . 12
2.3 2D+3D Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Fitting a 2D+3D AAM to a Single Image . . . . . . . . . . . . . . . . . . . . 15
3 Multi-View 2D+3D AAM Fitting and Camera Calibration 18
3.1 Multi-View 2D+3D AAM Fitting . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Camera Calibration: Image Formation Model . . . . . . . . . . . . . . . . . 24
3.4 Camera Calibration Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Calibration using Two Time Instants . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Multiple Time Instant Calibration Algorithm . . . . . . . . . . . . . . . . . . 27
3.7 Calibration as a Single Optimization . . . . . . . . . . . . . . . . . . . . . . 27
3.8 Empirical Evaluation of Calibration . . . . . . . . . . . . . . . . . . . . . . . 28
3.8.1 Qualitative Comparison of Epipolar Geometry . . . . . . . . . . . . . 29
3.8.2 Quantitative Comparison of Epipolar Geometry . . . . . . . . . . . . 30
3.9 Calibrated Multi-View Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.10 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.10.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.10.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Multi-View 3D Model Construction 40
4.1 Non-Rigid Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Multi-view Structure-from-Motion . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Motion-Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.2 Qualitative Multi-View Model Construction Comparison . . . . . . . 46
4.5.3 Quantitative Comparison using Camera Calibration . . . . . . . . . . 47
5 Dense Face Model Construction 51
5.1 Model Densification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
viii
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1.1 Ground-Truth Data Collection . . . . . . . . . . . . . . . . 58
5.2.1.2 Images used for Optical Flow Computation . . . . . . . . . 60
5.2.1.3 2D Ground-Truth Points Prediction Results . . . . . . . . . 60
5.2.1.4 3D Ground-Truth Points Prediction Results . . . . . . . . . 61
5.2.2 Fitting Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.3 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.4 Application to Rigid Tracker Output . . . . . . . . . . . . . . . . . . 68
6 Conclusion 70
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 74
ix
List of Tables
3.1 Fitting Algorithms Timing Results . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Quantitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . 50
x
List of Figures
1.1 Experimental Setup for Multi-View Fitting . . . . . . . . . . . . . . . . . . . 4
1.2 Example Face Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Dense Face Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 AAM Shape Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 AAM Appearance Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 AAM Model Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 2D+3D AAM Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Uncalibrated Multi-View Fitting Algorithm . . . . . . . . . . . . . . . . . . 22
3.2 Multi-View Tracking Example . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Input to Calibration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Qualitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . . 30
3.5 Quantitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . 31
3.6 Quantitative Comparison of Calibration Algorithms . . . . . . . . . . . . . . 32
3.7 Calibrated Multi-View Fitting Example . . . . . . . . . . . . . . . . . . . . . 35
3.8 Quantitative Comparison of Fitting Algorithms . . . . . . . . . . . . . . . . 37
4.1 Input to 3D Model Construction Algorithms . . . . . . . . . . . . . . . . . . 45
4.2 Qualitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . . 47
xi
4.3 Quantitative Comparison of 3D Models . . . . . . . . . . . . . . . . . . . . . 48
5.1 Densification Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Algorithm to Add Vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Adding Vertices Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Algorithm for Image Consistent Re-Triangulation . . . . . . . . . . . . . . . 55
5.5 Image Consistent Re-Triangulation Example . . . . . . . . . . . . . . . . . . 55
5.6 Smoothness Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Input to Densification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 59
5.8 Quantitative Comparison of Densification Algorithm . . . . . . . . . . . . . 62
5.9 Input to Optical Flow Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 63
5.10 3D Ground-Truth Point Prediction Results . . . . . . . . . . . . . . . . . . . 64
5.11 Dense Multi-Person Tracking Results 1 . . . . . . . . . . . . . . . . . . . . . 65
5.12 Dense Multi-Person Tracking Results 2 . . . . . . . . . . . . . . . . . . . . . 66
5.13 Dense Model Fitting Robustness . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.14 Face Tracking using Automatic Dense AAM Example 1 . . . . . . . . . . . . 67
5.15 Face Tracking using Automatic Dense AAM Example 2 . . . . . . . . . . . . 68
xii
Chapter 1
Introduction
Active Appearance Models (AAMs) [11, 12, 14, 13, 17], and the related concepts of Active
Blobs [34, 35] and Morphable Models [5, 25, 41], are generative models of a certain visual
phenomenon. AAMs are examples of statistical models that are used to characterize the
shape and the appearance of the underlying object by a set of model parameters. Though
AAMs are useful for other phenomena [34, 25], they are commonly used to model faces. In
a typical application, once an AAM has been constructed, the first step is to fit it to an
input image, i.e. model parameters are found to maximize the match between the model
instance and the input image. The model parameters can then be passed to a classifier.
Many different classification tasks are possible.
In this thesis we study three important topics related to deformable face models such as
AAMs: (1) we outline various techniques to simultaneously fit a 3D face model to multiple
images captured from multiple viewpoints, (2) we present a multi-view algorithm for 3D face
model construction, and (3) we present an automatic algorithm for dense deformable face
model construction.
1
1.1 Multi-View Face Model Fitting
Although AAMs were originally formulated as 2D, there are other deformable 3D models
(3D Morphable Models [5]) and AAMs have also been extended to 3D (2D+3D AAMs [44].)
A number of algorithms have been proposed to build deformable 3D face models and to fit
them efficiently [44, 33, 2, 37, 43, 32, 16]. Deformable 3D face models have a wide variety
of applications. Not only can they be used for tasks like pose estimation, which just require
the estimation of the 3D rigid motion, but also for tasks such as expression recognition and
lipreading, which require, explicitly or implicitly, estimation of the 3D non-rigid motion.
Most of the previous algorithms for AAM fitting and construction have been single-view.
One area that has not been studied much in the past (an exception is [15]) is the development
of simultaneous multi-view algorithms. Multi-view algorithms can potentially perform better
than single-view as they can take into account more visual information. In this thesis we
present multi-view algorithms to both fit and build 3D AAMs.
In the first part of this thesis we study multi-view fitting of AAMs. Fitting an AAM to
an image consists of minimizing the error between the input image and the closest model
instance; i.e. solving a nonlinear optimization problem. Face models are usually fit to a single
image of a face. In many application scenarios, however, it is possible to set up two or more
cameras and acquire simultaneous multiple views of the face. If we integrate the information
from multiple views, we can possibly obtain better application performance. For example,
Gross et al. [19] demonstrated improved face recognition performance by combining multiple
images of the same face captured from multiple widely spaced viewpoints. In Chapter 3, we
describe how a single AAM can be fit to multiple images, captured by cameras with arbitrary
locations, rotations, and response functions.
The main technical challenge is relating the AAM shape parameters in one view with
the corresponding parameters in the other views. This relationship is complex for a 2D
2
shape model but is straightforward for a 3D shape model. We use 2D+3D AAMs [44] in this
thesis. A 2D+3D AAM contains both a 2D shape model and a 3D shape model. Besides the
requirement of having a 3D shape model, the main advantage of using a 2D+3D AAM is that
2D+3D AAMs can be fit very efficiently in real-time [44]. Corresponding multi-view fitting
algorithms could also be derived for other 3D face models such as 3D Morphable Models [5].
We could easily have used a 3D Morphable Model instead to conduct the research in this
thesis, but the fitting algorithms would have been slower.
To generalize the 2D+3D fitting algorithm to multiple images, we use a separate set of
2D shape parameters for each image, but just a single, global set of 3D shape parameters
as represented in Figure 1.1. We impose the constraints that for each view separately,
the 2D shape model for that view must approximately equal the projection of the single
3D shape model. Imposing these constraints indirectly couples the 2D shape parameters
for each view in a physically consistent manner. Our algorithm can use any number of
cameras, positioned arbitrarily. The cameras can be moved and replaced with different
cameras without any retraining. The computational cost of the multi-view 2D+3D algorithm
is only approximately N times more than the single-view algorithm where N is the number
of cameras. In Section 3.1 we present a qualitative evaluation of our multi-view 2D+3D
fitting algorithm. We defer the quantitative evaluation to Section 3.9 where we also compare
it with a calibrated multi-view algorithm.
We also study how our multi-view fitting algorithm can be used for camera calibration.
The multi-view fitting algorithm of Section 3.1 uses the scaled orthographic imaging model
used by previous authors, and in the process of fitting computes, or calibrates, the scaled
orthographic camera matrices. In Section 3.3 we describe an extension of this algorithm to
calibrate weak perspective (or full perspective) camera models for each of the cameras. In
essence, both of these algorithms use the human face as a (non-rigid) calibration grid. Such
3
Camera pro)ection matrix
and 2D shape
Ob)ect 3D rotation7 translation
and 3D shape
frames
Camera pro)ection matrix
and 2D shape
Camera pro)ection matrix
and 2D shape
Figure 1.1: A representation of the experimental setup for multi-view 2D+3D AAM fitting. For
each view we have a separate set of 2D shape parameters and camera projection matrices, but just
a single, global set of 3D shape parameters and the associated global 3D rotation and translation.
Our fitting algorithm imposes the constraints that for each view separately, the 2D shape model
for that view must approximately equal the projection of the single 3D shape model.
an algorithm may be useful in a surveillance setting where we wish to install the cameras on
the fly, but avoid walking around the scene with a calibration grid.
The perspective algorithm requires at least two sets of multi-view images of the face at two
different locations. More images can be used to improve the accuracy if they are available.
We evaluate our algorithm by comparing it with an algorithm that uses a calibration grid
and show the performance to be roughly comparable.
We then show how camera calibration can improve the performance of multi-view face
model fitting. We present an extension of the multi-view AAM fitting algorithm of Sec-
4
tion 3.1 that takes advantage of calibrated cameras. We use the calibration algorithm of
Section 3.3 to explicitly provide calibration information to the multi-view fitting algorithm.
We demonstrate that this algorithm results in far better fitting performance than either the
single-view fitting (Chapter 2) or the uncalibrated1 multi-view fitting (Section 3.1) algo-
rithms. We consider two performance measures: (1) the robustness of fitting - the likelihood
of convergence for a given magnitude perturbation from the ground-truth, and (2) speed
of fitting - the average number of iterations required to converge from a given magnitude
perturbation from the ground-truth.
1.2 Multi-View Face Model Construction
In the second part of this thesis we study calibrated multi-view construction of AAMs. A
variety of non-rigid structure-from-motion algorithms have be proposed, both non-linear
[8, 40] and linear [9, 45, 46] that can be used for deformable 3D model construction from
both a single view [8, 9, 45, 46] and multiple views [40].
In most cases, it is only practical to apply face model construction algorithms to data
with relatively little pose variation. Tracking facial feature points becomes more difficult the
more pose variation there is. Unfortunately, single-view and multi-view algorithms such as
non-rigid structure-from-motion have a tendency to scale (stretch or compress) the face in
the depth-direction when applied to data with only medium amounts of pose variation. The
problem is not the algorithms themselves, but the Bas-Relief ambiguity between the camera
translation/rotation and the depth [47, 38, 36, 23]. The Bas-Relief ambiguity is normally
formulated in the case of rigid structure-from-motion, but applies equally in the non-rigid
1Note that for the uncalibrated multi-view algorithm described in Section 3.1, the calibration parametersare unknown and are estimated as a part of the optimization. For the calibrated multi-view fitting algorithmthe calibration parameters are known and are obtained from a calibration algorithm (possibly the algorithmof Section 3.3.)
5
case. As empirically validated in Chapter 4, the result is a compressed/stretched face model,
which gives erroneous estimates of the 3D rigid and non-rigid motion.
One way to eliminate the ambiguity is to use a calibrated stereo rig instead of a single
camera. The known, fixed translation between the cameras then sets the scale and breaks the
ambiguity. The straightforward approach is to use stereo to build a static 3D model at each
time instant and then build the deformable model by modeling how the 3D shape changes
across time. Two algorithms that takes this approach are [10, 18], one in the uncalibrated
case [10], the other in the calibrated case [18]. An alternative approach is to extend the non-
rigid structure-from-motion paradigm of [9, 8, 40, 45] and pose the face model construction
problem as a single large optimization over the unknown shape model modes, in essence
a large bundle adjustment. In Chapter 4 of this thesis we derive a calibrated multi-view
non-rigid motion-stereo algorithm [42, 48] to do exactly this. Our multi-view algorithm
explicitly incorporates the knowledge of the calibrated relative orientation of the cameras in
the stereo rig. In Section 4.5 we present qualitative results to validate these claims. We also
use the multi-view calibration algorithm described in Chapter ?? to quantitatively compare
the fidelity of 3D models.
1.3 Dense Face Model Construction
Deformable face models are generative parametric models that are used to model both rigid
and non-rigid deformations. The two best known examples of deformable models are Active
Appearance Models (AAMs) [12, 14, 13, 17, 26] and 3D Morphable Models (3DMMs) [5,
25, 33, 41, 8]. Although AAMs and 3DMMs are closely related, there are a number of
differences between them. One main difference between AAMs and 3DMMs is that AAMs
are typically sparse whereas 3DMMs are typically dense. This difference is mainly based
6
1
1
2 2
Figure 1.2: An illustration of the effects that make optical flow hard for human faces: (1) appear-
ance/disappearance of structures such as teeth and wrinkles (2) non-Lambertian, largely textureless
regions of skin such as the cheeks.
on how these models are constructed. AAMs are normally constructed from a collection
of training images of faces with a mesh of canonical feature points hand-marked on them.
Since the feature points are hand-marked, the correspondence can only be sparse. 3DMMs
are usually computed by running an optical flow algorithm to estimate the dense non-rigid
alignment of the texture maps [5]. AAMs could also potentially be constructed from dense
correspondence estimation using optical flow.
Computing dense alignment of face images using optical flow is difficult for a number of
reasons, as illustrated in Figure 1.2. Observe the appearance/disappearance of structures
such as teeth and wrinkles. Also note the non-Lambertian reflectance of textureless regions
such as the cheeks. Optical flow algorithms are not robust to such variations in the images.
Note, however, that the failure of the optical flow algorithm in the construction of 3DMMs
is hidden in the fact that most applications are graphics. The artifacts are hidden by the
texture mapping of the 3D mesh.
In the final part of this thesis we propose a different approach to building a dense de-
7
formable face model. Rather than assuming that dense correspondence can be computed
in a pre-processing step, our algorithm instead builds a dense model (Figure 1.3) by itera-
tively building a face model, fitting the model to image data and then refining the model.
There are three ways in which the model is refined: 1) By adding more mesh vertices 2) By
changing the mesh connectivity using image-consistent surface triangulation [30], and 3) By
refining the shape modes using a modification of the algorithm in [4]. Although the goal of
our algorithm is to compute a dense model, note that in the process it implicitly computes
the dense correspondence that an optical flow algorithm would. However, as the model is
refined, it builds a model of visual effects such as the appearance/disappearance of structure
such as the teeth and wrinkles, and also builds an implicit model of the illumination variation
manifested across the face including large textureless regions such as the cheeks. This is the
reason our algorithm is able to outperform standard optical flow algorithms.
In Section 5.2 we show a number of results to illustrate that our densification algorithm
can be used to accurately build dense models. We first evaluate our algorithm quantitatively
with a set of ground-truth data using a form of hidden markers. We compare with a number
of popular optical flow algorithms [24, 27, 31] for the same task and find our algorithm to
be more robust and more accurate. We then perform comparisons to show improvement in
fitting robustness. We also present a number of tracking results to qualitatively illustrate
other aspects of our algorithm. Finally, we also show how our algorithm can be used to
construct dense deformable models automatically, starting with a rigid planar model of the
face that is subsequently refined to model the non-planarity and the non-rigid components.
8
Automated Model Densification
Krishnan Ramnath, Simon Baker, Iain Matthews
Figure 1.3: An example dense mesh achieved using our densification algorithm. On the left we
show the initial sparse mesh as well as the mesh vertices. On the right we show the resulting
triangulated mesh as well as vertices after applying the densification algorithm.
9
Chapter 2
Background
In this section we review 2D Active Appearance Models (AAMs) [13] and 2D+3D Active
Appearance Models [44]. We also revisit the efficient inverse compositional fitting algo-
rithms [3, 44].
2.1 2D Active Appearance Models
The 2D shape s of a 2D Active Appearance Model is a 2D triangulated mesh. In particular,
s is a column vector containing the vertex locations of the mesh. AAMs allow linear shape
variation. This means that the 2D shape s can be expressed as a base shape s0 plus a linear
combination of m shape vectors si:
s = s0 +m∑
i=1
pi si (2.1)
where the coefficients pi are the shape parameters. AAMs are normally computed from
training data consisting of a set of images with the shape mesh (hand) marked on them
[13]. The Procrustes alignment algorithm and Principal Component Analysis (PCA) are
then applied to compute the the base shape s0 and the shape variation si [13]. An example
mesh is shown in Figure 2.1.
10
s0 s1 s2 s3
Figure 1: The linear shape model of an independent AAM. The model consists of a triangulated base mesh
s0 plus a linear combination of n shape vectors si. The base mesh is shown on the left, and to the right are
the first three shape vectors s1, s2, and s3 overlaid on the base mesh.
apply Principal Component Analysis (PCA) to the training meshes [11]. The base shape s0 is the
mean shape and the vectors s0 are the n eigenvectors corresponding to the n largest eigenvalues.
Usually, the training meshes are first normalised using a Procrustes analysis [10] before PCA is
applied. This step removes variation due to a chosen global shape normalising transformation so
that the resulting PCA is only concerned with local, non-rigid shape deformation. See Section 4.2
for the details of how such a normalisation affects the AAM fitting algorithm described in this
paper.
An example shape model is shown in Figure 1. On the left of the figure, we plot the triangulated
base mesh s0. In the remainder of the figure, the base mesh s0 is overlaid with arrows corresponding
to each of the first three shape vectors s1, s2, and s3.
2.1.2 Appearance
The appearance of an independent AAM is defined within the base mesh s0. Let s0 also denote the
set of pixels x = (x, y)T that lie inside the base mesh s0, a convenient abuse of terminology. The
appearance of an AAM is then an image A(x) defined over the pixels x ! s0. AAMs allow linear
appearance variation. This means that the appearance A(x) can be expressed as a base appearance
A0(x) plus a linear combination ofm appearance images Ai(x):
A(x) = A0(x) +m
!
i=1
!iAi(x) " x ! s0 (3)
4
Figure 2.1: The 2D linear shape model of an AAM. The model consists of triangulated base mesh
s0 plus a linear combination of m shape vectors si. The base mesh is shown on the left, followed
by the first three shape vectors s1, mathbfs2 and s3 overlaid over the base mesh.
The appearance of an AAM is defined within the base mesh s0. Let s0 also denote the
set of pixels u = (u, v)T that lie inside the base mesh s0, a convenient notational short-cut.
The appearance of the AAM is then an image A(u) defined over the pixels u ∈ s0. AAMs
allow linear appearance variation. This means that the appearance A(u) can be expressed
as a base appearance A0(u) plus a linear combination of l appearance images Ai(u):
A(u) = A0(u) +l∑
i=1
λi Ai(u) (2.2)
where the coefficients λi are the appearance parameters. The base (mean) appearance A0
and appearance images Ai are usually computed by applying Principal Component Analysis
to the shape normalised training images [13]. The appearance variation of an AAM is
illustrated in Figure 2.2.
Although Equations (2.1) and (2.2) describe the AAM shape and appearance variation,
they do not describe how to generate a model instance. The AAM model instance (Figure 2.3)
with shape parameters p and appearance parameters λi is created by warping the appearance
A from the base mesh s0 to the model shape mesh s. In particular, the pair of meshes s0
and s define a piecewise affine warp from s0 to s denoted1 W(u;p)[28].
1Note that for ease of presentation we have omitted any mention of the 2D similarity transformation that
11
A0(x) A1(x) A2(x) A3(x)
Figure 2: The linear appearance variation of an independent AAM. The model consists of a base appear-
ance image A0 defined on the pixels inside the base mesh s0 plus a linear combination of m appearance
images Ai also defined on the same set of pixels.
In this expression the coefficients !i are the appearance parameters. Since we can always perform
a linear reparameterization, wherever necessary we assume that the images Ai are orthonormal.
As with the shape component, the base appearance A0 and the appearance images Ai are nor-
mally computed by applying PCA to a set of shape normalised training images. Each training
image is shape normalised by warping the (hand labelled) training mesh onto the base mesh s0.
Usually the mesh is triangulated and a piecewise affine warp is defined between corresponding
triangles in the training and base meshes [11] (although there are ways to avoid triangulating the
mesh using, for example, thin plate splines rather than piecewise affine warping [10].) The base
appearance is set to be the mean image and the images Ai to be them eigenimages corresponding
to the m largest eigenvalues. The fact that the training images are shape normalised before PCA
is applied normally results in a far more compact appearance eigenspace than would otherwise be
obtained.
The appearance of an example independent AAM is shown in Figure 2. On the left of the figure
we plot the base appearance A0. On the right we plot the first three appearance images A1–A3.
2.1.3 Model Instantiation
Equations (2) and (3) describe the AAM shape and appearance variation. However, they do not de-
scribe how to generate a model instance. Given the AAM shape parameters p = (p1, p2, . . . , pn)T
we can use Equation (2) to generate the shape of the AAM s. Similarly, given the AAM appear-
ance parameters ! = (!1, !2, . . . , !m)T, we can generate the AAM appearanceA(x) defined in the
interior of the base mesh s0. The AAM model instance with shape parameters p and appearance
5
Figure 2.2: The 2D linear appearance model of an AAM. The model consists of a base appearance
A0 defined over all the pixels inside the base shape mesh s0 plus a linear combination of l appearance
vectors Ai.
2.2 Fitting a 2D AAM to a Single Image
The goal of fitting a 2D AAM to a single input image I [28] is to minimize:
∑u∈s0
[A0(u) +
l∑i=1
λiAi(u)− I(W(u;p))
]2
=
∥∥∥∥∥A0(u) +l∑
i=1
λiAi(u)− I(W(u;p))
∥∥∥∥∥2
(2.3)
with respect to the 2D shape p and appearance λi parameters. In [28] it was shown that the
inverse compositional algorithm [3] can be used to optimize the expression in Equation (2.3).
The algorithm uses the “project out” algorithm [21, 28] to break the optimization into two
steps. The first step consists of optimizing:
‖A0(u)− I(W(u;p))‖2span(Ai)⊥(2.4)
with respect to the shape parameters p where the subscript span(Ai)⊥ means project the
vector into the subspace orthogonal to the subspace spanned by Ai, i = 1, . . . , l. The second
step consists of solving for the appearance parameters:
is used with an AAM to normalise the shape [13]. In this thesis we include the normalising warp in W(u;p)and the similarity normalisation parameters in p. See [28] for a description of how to include the normalisingwarp in W(u;p).
12
9.1s3
W(x;p)
Appearance, A
Shape, s
=
=
A0
AAMModel Instance
M(W(x;p))
s0 ! +54s1 ! . . .10s2
. . .256A3!351A2++ 3559A1
Figure 3: An example of AAM instantiation. The shape parameters p = (p1, p2, . . . , pn)T are used to
compute the model shape s and the appearance parameters ! = (!1,!2, . . . ,!m)T are used to compute themodel appearance A. The model appearance is defined in the base mesh s0. The pair of meshes s0 and s
define a (piecewise affine) warp from s0 to s which we denote W(x;p). The final AAM model instance,
denoted M(W(x;p)), is computed by forwards warping the appearance A from s0 to s usingW(x;p).
parameters ! is then created by warping the appearance A from the base mesh s0 to the model
shape s. This process is illustrated in Figure 3 for concrete values of p and !.
In particular, the pair of meshes s0 and s define a piecewise affine warp from s0 to s. For each
triangle in s0 there is a corresponding triangle in s. Any pair of triangles define a unique affine
warp from one to the other such that the vertices of the first triangle map to the vertices of the
second triangle. See Section 4.1.1 for more details. The complete warp is then computed: (1) for
any pixel x in s0 find out which triangle it lies in, and then (2) warp x with the affine warp for that
triangle. We denote this piecewise affine warp W(x;p). The final AAM model instance is then
computed by warping the appearance A from s0 to s with warp W(x;p). This process is defined
by the following equation:
M(W(x;p)) = A(x) (4)
where M is a 2D image of the appropriate size and shape that contains the model instance. This
equation, describes a forwards warping that should be interpreted as follows. Given a pixel x in
6
Figure 2.3: An example of an AAM model instance. The shape parameters p are used to create
the shape model sand the appearance parameters λi are used to create the appearance model A.
The AAM model instance (Figure 2.3) is then created by warping the appearance A from the base
mesh s0 to the model shape mesh s. In particular, the pair of meshes s0 and s define a piecewise
affine warp from s0 to s denoted W(u;p)
λi = −∑u∈s0
Ai(u) [A0(u)− I(W(u;p)] (2.5)
where the appearance vectors Ai are orthonormal. Optimizing Equation (2.4) itself can be
performed by iterating the following two steps. Step 1 consists of computing:
∆p = −H−12D∆pSD where ∆pSD =
∑u∈s0 [SD2D(u)]T [A0(u)− I(W(u;p)]
13
where the following two terms can be pre-computed (and combined) to achieve high
efficiency:
SD2D(u) =[∇A0
∂W∂p
]span(Ai)⊥
H2D =∑
u∈s0 [SD2D(u)]T SD2D(u)
where ∇A0 =[
∂A0
∂x∂A0
∂y
].
Step 2 consists of updating the warp by composing with the inverse incremental warp:
W(u;p) ← W(u;p) ◦W(u; ∆p)−1 (2.6)
The resulting 2D AAM fitting algorithm runs at over 200 frames per second. See [28] for
more details.
2.3 2D+3D Active Appearance Models
Most deformable 3D face models, including 3D Morphable Models [5] and the models in [9, 8,
40, 45], use a 3D linear shape variation model, essentially equivalent to a 3D generalization of
the model in Section 2.1. The 3D shape s is a 3D triangulated mesh which can be expressed
as a base shape s0 plus a linear combination of m shape vectors sj:
s = s0 +m∑
j=1
pj sj (2.7)
where the coefficients pi are the shape parameters.
A 2D+3D AAM [44] consists of the 2D shape variation si of a 2D AAM governed by
Equation (2.1), the appearance variation Ai(u) of a 2D AAM governed by Equation (2.2),
and the 3D shape variation sj of a 3D AAM governed by Equation (2.7). The 2D shape
variation si and the appearance variation Ai(u) of the 2D+3D AAM are constructed exactly
as for a 2D AAM. The construction of the 3D shape variation sj is the subject of Chapter 4
of this thesis.
14
To generate a 2D+3D model instance, an image formation model is needed to convert
the 3D shape s into a 2D mesh, onto which the appearance is warped. In [44] the following
scaled orthographic imaging model was used:
u = Pso x = σ
ix iy iz
jx jy jz
x +
ox
oy
. (2.8)
where x = (x, y, z) is a 3D vertex location, (ox, oy) is an offset to the origin, σ is the scale
and the projection axes i = (ix, iy, iz) and j = (jx, jy, jz) are unit length and orthogonal:
i · i = j · j = 1; i · j = 0. The model instance is then computed by projecting every 3D shape
vertex onto a 2D vertex using Equation (2.8). The 2D appearance A(u) is finally warped
onto the 2D mesh (taking into account visibility) to generate the final model instance.
2.4 Fitting a 2D+3D AAM to a Single Image
The goal of fitting a 2D+3D AAM to an image I [44] is to minimize:∥∥∥∥∥A0(u) +l∑
i=1
λiAi(u)− I(W(u;p))
∥∥∥∥∥2
+ K ·
∥∥∥∥∥∥s0 +m∑
i=1
pi si −Pso
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2
(2.9)
with respect to p, λi, Pso, and p where K is a large constant weight. A pictorial represen-
tation of the 2D+3D AAM fitting is shown in Figure 2.4.
Equation (2.9) should be interpreted as follows. The first term in Equation (2.9) is the
2D AAM fitting criterion. The second term enforces the (heavily weighted, soft) constraints
that the 2D shape s equals the projection of the 3D shape s with projection matrix Pso.
In [44] it was shown that the 2D AAM fitting algorithm [28] can be extended to a 2D+3D
AAM. The resulting algorithm still runs in real-time [29].
As with the 2D AAM algorithm, the “project out” algorithm [28] is used to break the
optimization into two steps, the first optimizing:
‖A0(u)− I(W(u;p))‖2span(Ai)⊥+ K ·
∑i
F 2i (p;Pso;p) (2.10)
15
Fitting a 2D+3D AAM
Pso(x) = !
!
ix iy izjx jy jz
"
x +
!
ou
ov
"
Scaled orthographic projection of 3D Shape
3D Shape
arg min2D Shape, 3D Shape
Appearance, Pso
!
pixels
"
# AAM ! Image
$
%
2
+!
mesh
"
# 2D Shape ! Pso(3D Shape)
$
%
2
Warp to s0
2D (pixels) Proj. 3D to 2D (vertices)
Figure 2.4: A representation of the 2D+3D AAM fitting algorithm. The fitting goal consists of
two terms: (1) the 2D fitting goal, and (2) the regularization term that enforces the 2D shape s to
equal the projection of the 3D shape s with projection matrix Pso.
with respect to p, Pso, and p, where Fi(p;Pso;p) is the error inside the L2 norm in the
second term in Equation (2.9) for each of the mesh x and y vertices. The second step
solves for the appearance parameters using Equation (2.5). The 2D+3D algorithm has more
unknowns to solve for than the 2D algorithm. As a notational convenience, concatenate all
the unknown parameters into one vector q = (p;Pso;p). Optimizing Equation (2.10) is then
performed by iterating the following two steps. Step 1 consists of computing2:
∆q = −H−13D∆qSD = −H−1
3D
∆pSD
0
+ K ·∑
i
(∂Fi
∂q
)T
Fi(q)
(2.11)
2To simplify presentation, in this thesis we omit the additional correction that needs to be made toFi(p;Pso;p) to use the inverse compositional algorithm. See [44] for details.
16
where:
H3D =
H2D 0
0 0
+ K ·∑
i
(∂Fi
∂q
)T∂Fi
∂q. (2.12)
Step 2 consists of first extracting the parameters p, Pso, and p from q, and then updating
the warp using Equation (2.6), and the other parameters Pso and p additively [29].
17
Chapter 3
Multi-View 2D+3D AAM Fitting and
Camera Calibration
In the previous chapter we reviewed some of the efficient algorithms to fit an AAM to
a single image. If we have multiple, simultaneous, views of the face, the performance of
AAM fitting can be improved if we use all views. In this chapter we first describe an
algorithm to fit a single 2D+3D AAM simultaneously to multiple images. During fitting
we impose the constraints that for each view separately, the 2D shape model for that view
must approximately equal the projection of the single 3D shape model. Our algorithm can
use any number of cameras, positioned arbitrarily. We then show how our multi-view fitting
algorithm can be used for camera calibration. We describe an algorithm to calibrate weak
perspective (or full perspective) camera models for each of the cameras using the human
face as a (non-rigid) calibration grid. Finally we show how camera calibration can improve
the performance of multi-view face model fitting.
18
3.1 Multi-View 2D+3D AAM Fitting
Suppose that we have N images In : n = 1, . . . , N of a face that we wish to fit the 2D+3D
AAM to. In this section we assume that the images are captured simultaneously by syn-
chronized, but uncalibrated cameras (see Section 3.9 for a calibrated algorithm.) The naive
algorithm is to fit the 2D+3D AAM independently to each of the images. This algorithm can
be improved upon by using the fact that, since the images In are captured simultaneously,
the 3D shape of the face is the same in all views. We therefore pose fitting a single 2D+3D
AAM to multiple images as minimizing:
N∑n=1
∥∥∥∥∥A0(u) +l∑
i=1
λni Ai(u)− In(W(u;pn))
∥∥∥∥∥2
+
K ·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2 (3.1)
simultaneously with respect to the N sets of 2D shape parameters pn, the N sets of appear-
ance parameters λni (the appearance may be different in different images due to different
camera response functions, etc), the N sets of camera matrices Pnso, and the one, global
set of 3D shape parameters p. Note that the 2D shape parameters in each image are not
independent, but are coupled in a physically consistent1 manner through the single set of
3D shape parameters p. Optimizing Equation (3.1) therefore cannot be decomposed into
N independent optimizations. The appearance parameters λni can, however, be dealt with
using the “project out” algorithm [21, 28], in the usual way; i.e. we first optimize:
N∑n=1
‖A0(u)− In(W(u;pn))‖2span(Ai)⊥+ K ·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
s0 +m∑
j=1
pj sj
∥∥∥∥∥∥2
(3.2)
with respect to pn, Pnso, and p, and then solve for the appearance parameters:
1Note that directly coupling the 2D shape models would be difficult due to the complex relationshipbetween the 2D shape in one image and another. Multi-view face model fitting is best achieved with a 3Dmodel. A similar algorithm could be derived for other 3D face models such as 3D Morphable Models [5].The main advantage of using a 2D+3D AAM [44] is the fitting speed.
19
λni = −∑u∈s0 Ai(u) · [A0(u)− In(W(u;pn))] .
Organize the unknowns in Equation (3.2) into a single vector r = (p1;P1so; . . . ;p
N ;PNso;p).
Also, split the single-view 2D+3D AAM terms into parts from Equations (2.11) and (2.12)
that correspond to the 2D image parameters (pn and Pnso) and the 3D shape parameters (p):
∆qnSD =
∆qnSD,2D
∆qnSD,p
and Hn3D =
Hn3D,2D,2D Hn
3D,2D,p
Hn3D,p,2D Hn
3D,p,p
.
Optimising Equation (3.2) can then be performed by iterating the following two steps.
Step 1 consists of computing:
∆r = −H−1MV∆rSD = −H−1
MV
∆q1SD,2D
...
∆qNSD,2D∑N
n=1 ∆qnSD,p
(3.3)
where:
HMV =
H13D,2D,2D 0 . . . 0 H1
3D,2D,p
0 H23D,2D,2D . . . 0 H2
3D,2D,p
......
......
...
0 . . . 0 HN3D,2D,2D HN
3D,2D,p
H13D,p,2D H2
3D,p,2D . . . HN3D,p,2D
∑Nn=1 Hn
3D,p,p
.
Step 2 consists of extracting the parameters pn, Pnso, and p from r, and updating the
warp parameters pn using Equation (2.6), and the other parameters Pnso and p additively.
The N image algorithm is very similar to N copies of the single image algorithm. Almost
all of the computation is just replicated N times, one copy for each image. The only extra
computation is adding the N terms in the components of ∆rSD and HMV that correspond to
the single set of global 3D shape parameters p, inverting the matrix HMV, and the matrix
20
multiply in Equation (3.3). Overall, the N image algorithm is therefore approximately N
times slower than the single image 2D+3D fitting algorithm (It is more than N times slower
due to the large matrix inversion and matrix multiplication step, but in practice only slightly
so.)
3.2 Experimental Results
An example of using our algorithm to fit a single 2D+3D AAM to three simultaneously
captured images2 of a face is shown in Figure 3.1. For all the results in this Chapter,
the translation and scale of the 2D face model in each view is initialized by hand and the
2D shape set to be the mean shape. However, 2D+3D AAMs can easily be initialized
with a face detector [29]. See the movie iterations.mov for the fitting video sequence.
The initialization is displayed in the top row of the figure, the result after 5 iterations in
the middle row, and the final converged result in the bottom row. In each case, all three
input images are overlaid with the 2D shape pn plotted in dark dots. We also display the
recovered pose angles (roll, pitch and yaw) extracted from the three scaled orthographic
camera matrices Pnso in the top left of each image. Each camera computes a different relative
head pose, illustrating that the estimate of Pnso is view dependent. The single 3D shape p
for all views at the current iteration is displayed in the top-right of the center image. The
view-dependent camera projection of this 3D shape is also plotted as a white mesh overlaid
on the face.
Applying the multi-view fitting algorithm sequentially allows us to track the face simulta-
neously in N video sequences. Some example frames of the algorithm being using to track a
face in a trinocular sequence is shown in Figure 3.2. We also include the movie tracking.mov
2Note that the input images for all experiments described in this thesis are chosen such that there is noocclusion of the face. For ways to handle occlusion in the input data see [20, 29].
21
Init
ialis
atio
nA
fter
5It
erat
ions
Con
verg
ed
Left Camera Center Camera Right Camera
Figure 3.1: An example of using our uncalibrated multi-view fitting algorithm to fit a single 2D+3D
AAM to three simultaneous images of a face. Each image is overlaid with the corresponding 2D
shape for that image in dark dots. The head pose (extracted from the camera matrix PNso) is
displayed in the top left of each image as roll, pitch and yaw. The single 3D shape p for the current
‘3-frame’ is displayed in the top right of the center image. This 3D shape is also overlaid in each
image, using the corresponding PNso, as a white mesh. See the movie iterations.mov for a video
of the whole fitting sequence.
for the complete tracking sequence. The tracking remains accurate and stable both over time
and between views. In Section 3.9 we present a quantitative evaluation of this multi-view
algorithm.
22
Fram
e1
Fram
e12
0Fr
ame
200
Left Camera Center Camera Right Camera
Figure 3.2: An example of our multi-view fitting algorithm being used to track a face in a trinocular
sequence. As the face is tracked we compute a single 3D shape and three estimates of the head pose
using three independent camera matrices. See the movie tracking.mov for the complete sequence.
23
3.3 Camera Calibration: Image Formation Model
The multi-view fitting algorithm in Chapter 3 uses the scaled orthographic image formation
model in Equation (2.8). A more powerful model when working with multiple cameras
(because it models the coupling between the scales across the cameras through the focal
lengths and average depths) is the weak perspective model:
u = Pwp(x) =f
oz + z
ix iy iz
jx jy jz
x +
ou
ov
. (3.4)
In Equation (3.4), oz is the depth of the origin of the world coordinate system and z is
the average depth of the scene points measured relative to the world coordinate origin. The
“z” (depth) direction is k = i × j where × is the vector cross product, i = (ix, iy, iz), and
j = (jx, jy, jz). The average depth relative to the world origin z equals the average value of
k · x computed over all points x in the scene.
The weak perspective model is an approximation to the full perspective model:
u = Ppersp(x) =
f 0 0
0 f 0
0 0 1
ix iy iz ou
jx jy jz ov
kx ky kz oz
x
1
. (3.5)
where the depth of the scene k · x is assumed to be roughly constant z. The calibration
parameters of the two perspective models in Equations (3.4) and (3.5) are interchangeable.
When evaluating the calibration results in Section 3.8 below we use the full perspective
model. In the calibrated fitting algorithms in Section 3.9 we use the weak perspective model
because it is reasonable to assume that the depth of the face is roughly constant, a common
assumption in many face modeling papers [33, 44].
24
3.4 Camera Calibration Goal
Suppose we have N cameras n = 1, . . . , N . The goal of our camera calibration algorithm
is to compute the 2 × 3 camera projection matrix (i, j), the focal length f , the projection
of the world coordinate system origin into the image (ou, ov), and the depth of the world
coordinate system origin (oz) for each camera. If we superscript the camera parameters with
n we need to compute Pnwp = in, jn, fn, on
u, onv , and on
z . There are 7 unknowns in Pnwp (rather
than 10) because there are only 3 degrees of freedom in choosing the 2×3 camera projection
matrix (i, j) such that it is orthonormal.
3.5 Calibration using Two Time Instants
For ease of understanding, we first describe an algorithm that uses two sets of multi-view
images captured at two time instants. Deriving this algorithm also allows us to show that
two sets of images are needed and derive the requirements on the motion of the face between
the two time instants. In Section 3.6 we describe an algorithm that use an arbitrary number
of multi-view image sets and in Section 3.7 another algorithm that poses calibration as a
single large optimization.
The uncalibrated multi-view fitting algorithm of Chapter 3 uses the scaled orthographic
camera matrices Pnso in Equation (2.8) and optimizes over the N scale parameters σn. Using
Equation (3.4) instead of Equation (2.8) and optimizing over the focal lengths fn and origin
depths onz is ambiguous. Multiple values of fn and on
z yield the same value of σn = fn
onz +zn .
However, the values of fn and onz can be computed by applying (a slightly modified version
of) the uncalibrated multi-view fitting algorithm a second time with the face at a different
location. With the first set of images we compute in, jn, onu, on
v . Suppose that σn = σn1 is
the scale at this location. Without loss of generality we also assume that the face model is
25
at the world coordinate origin at this first time instant. Finally, without loss of generality
we assume that the mean value of x computed across the face model (both mean shape s0
and all shape vectors si) is zero. It follows that z is zero and so:
fn
onz
= σn1 . (3.6)
Suppose that at the second time instant the face has undergone a global 3D rotation R3 and
3D translation T. Both the rotation R and translation T have three degrees of freedom. We
then perform a modified multi-view fit, minimizing:
N∑n=1
∥∥∥∥∥A0(u) +l∑
i=1
λni Ai(u)− In(W(u;pn))
∥∥∥∥∥2
+ K·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
so
R
s0 +m∑
j=1
pj sj
+ T
∥∥∥∥∥∥2 (3.7)
with respect to the N sets of 2D shape parameters pn, the N sets of appearance parameters
λni , the one global set of 3D shape parameters p, the 3D rotation R, the 3D translation T,
and the N scale values σn = σn2 . In this optimization all of the camera parameters (in, jn,
onu, and on
v ) except the scale (σ) in the scaled orthographic model Pnso are held fixed to the
values computed in the first time instant. Since the object underwent a global translation
T then zn = kn ·T where kn = in × jn is the z-axis of camera n. It follows that:
fn
onz + kn ·T
= σn2 . (3.8)
Equations (3.6) and (3.8) are two sets of linear simultaneous equations in the 2∗N unknowns
(fn and onz ). Assuming that kn · T 6= 0 (the global translation T is not perpendicular to
any of the camera z-axes), these two equations can be solved for fn and onz to complete the
camera calibration. Note also that to uniquely compute all three components of T using the
3Note that in the case of calibrated camera(s) it is convenient to think of the relative motion betweenthe object and the camera(s) as the motion of the object R, T. In the single camera case (Equation 2.9)and the multiple cameras, single time instant case with uncalibrated camera matrix P (Equation 3.1) it isconvenient to think of the relative motion as camera motion.
26
optimization in Equation (3.7) at least one pair of the cameras must be verged (the axes (in,
jn) of the camera matrices Pnso must not all span the same 2D subspace.)
3.6 Multiple Time Instant Calibration Algorithm
Rarely are two sets of multi-view images sufficient to obtain an accurate calibration. The
approach just described can easily be generalized to T time instants. The first time instant
is treated as above and used to compute in, jn, onu, on
v and to impose the constraint on fn
and onz in Equation (3.6). Equation (3.7) is then applied to the remaining T − 1 frames to
obtain additional constraints:
fn
onz + kn ·Tt
= σnt for t = 2, 3, . . . , T (3.9)
where Tt is the translation estimated in the tth time instant and σnt is the scale of the face
in the nth camera at the tth time instant. Equations (3.6) and (3.9) are then re-arranged to
obtain an over-constrained linear system which can then be solved to obtain fn and onz .
3.7 Calibration as a Single Optimization
The algorithms in Sections 3.5 and 3.6 have the disadvantage of being two stage algorithms.
First they solve for in, jn, onu, and on
v , and then for fn and onz . It is better to pose calibration
as the single large non-linear optimization of:
N∑n=1
T∑t=1
∥∥∥∥∥A0(u) +l∑
i=1
λn,ti Ai(u)− In,t(W(u;pn,t))
∥∥∥∥∥2
+ K·
∥∥∥∥∥∥s0 +m∑
i=1
pn,ti si −Pn
wp
Rt
s0 +m∑
j=1
ptj sj
+ Tt
∥∥∥∥∥∥2 (3.10)
summed over all cameras n and time instants t with respect to the 2D shape parameters
pn,t, the appearance parameters λn,ti , the 3D shape parameters pt, the rotations Rt, the
27
translations Tt, and the calibration parameters in, jn, fn, onu, on
v , and onz . In Equation (3.10),
In,t represents the image captured by the nth camera in the tth time instant and the average
depth z = kn · Tt in Pnwp given by Equation (3.4). Finally, we define the world coordinate
system by enforcing R1 = I and T1 = 0.
The expression in Equation (3.10) can be optimized by iterating two steps: (1) The
calibration parameters are optimized given the 2D shape and (rotated translated) 3D shape;
i.e. the second term in Equation (3.10) is minimized given fixed 2D shape, 3D shape, Rt,
and Tt. This optimization decomposes into a separate 7 dimensional optimization for each
camera. (2) A calibrated multi-view fit (see Section 3.9) is performed on each frame in
the sequence; i.e. the entire expression in Equation (3.10) is minimized, but keeping the
calibration parameters in Pnwp fixed and just optimizing over the 2D shape, 3D shape, Rt,
and Tt. The entire large optimization can be initialized using the multiple time instant
algorithm in Section 3.6.
3.8 Empirical Evaluation of Calibration
We tested our calibration algorithms on a trinocular stereo rig. Two example images of the
1300 input images from each of the three cameras are shown in Figure 3.3. The complete
input sequence is included in the movie calib input.mov. We wish to compare our calibra-
tion algorithm with an algorithm that uses a calibration grid. In Sections 3.8.1 and 3.8.2
we present results for the epipolar geomtery. We compute a fundamental matrix from the
camera parameters in, jn, fn, onu, on
v , and onz estimated by our algorithm and use the 8-
point algorithm [22] to estimate the fundamental matrix from the calibration grid data. In
Section 4.5.3 we present results for the camera focal length and relative orientation of the
cameras, while also comparing the 3D model building algorithms.
28
Tim
e1
Tim
e2
Camera 1 Camera 2 Camera 3
Figure 3.3: Example inputs to our calibration algorithms: A set of simultaneously captured image
sets of a face at a variety of different positions and expressions. See calib input.mov for the
complete input.
3.8.1 Qualitative Comparison of Epipolar Geometry
In Figure 3.4 we show a set of epipolar lines computed by the algorithms. In Figure 3.4(a)
we show an input image captured by camera 1, with a few feature points marked on it.
In Figure 3.4(b) we show the corresponding points in the other image and the epipolar
lines. The solid dark colored epipolar lines are computed using the 8-point algorithm on
the calibration grid data. The dashed black epipolar lines are computed using the two stage
multiple time instant algorithm of Section 3.6. The solid light colored epipolar lines are
computed using the single large optimization algorithm of Section 3.7. Figures 3.4(d) and
(c) are similar for feature points marked in camera 3. While all three sets of epipolar lines
are very similar, the epipolar lines for the single large optimization algorithm are overall
closer to those for the 8-point algorithm than those of the two stage algorithm.
29
feature pointsgridtwo!stagesingle opt
(a) Camera 1 Image (b) Epipolar Lines in Camera 2
(c) Epipolar Lines in Camera 2 (d) Camera 3 Image
Figure 3.4: Qualitative comparison between our AAM-based calibration algorithms and the 8-
point algorithm [22]. (a) An input image captured by the first camera with several feature points
marked on it. (b) The corresponding points and epipolar lines of the other image. The solid dark
colored epipolar lines are computed using the 8-point algorithm, the dashed black epipolar lines
using the two stage multiple time instant algorithm, and the solid light colored epipolar lines are
computed using the optimization algorithm. (d) Shows the input image of the third camera, and
(c) the corresponding points and epipolar lines for the second camera.
3.8.2 Quantitative Comparison of Epipolar Geometry
In Figures 3.5 and 3.6 we present the results of a quantitative comparison of the fundamental
matrices by extracting a set of ground-truth feature point correspondences and computing
the RMS distance between each feature point and the corresponding epipolar line predicted
30
! " # $ % & ' ( ) !**
"
$
&+a-! ! +a-"
! " # $ % & ' ( ) !**
"
$
&+a-! ! +a-#
/01
2pip
olar
resid
ual <
=i>e
ls?
! " # $ % & ' ( ) !**"$&
+a-" ! +a-#
@-age nu-Cer
DaliCration Frid G (!=oint HlgIa+e Jata G Kwo 1tage HlgIa+e Jata G 1ingle Mpti-iNation
Figure 3.5: Quantitative comparison between our AAM-based calibration algorithms and the
8-point algorithm [22] using a calibration grid. The evaluation is performed on 10 images of a
calibration grid (data similar to, but not used by the 8-point algorithm). The ground-truth is
extracted using a corner detector. We plot the RMS distance error between epipolar lines and the
corresponding feature points for each of the 10 images.
by the fundamental matrix. In Figure 3.5 we present results on 10 images of a calibration
grid, similar (but not identical) to that used by the calibration grid algorithm. The ground-
truth correspondences are extracted using a corner detector. In Figure 3.6 we present results
on 1400 images of a face at different scales. The ground-truth correspondences are extracted
by fitting a single-view AAM independently to each image (i.e. no use of the multi-view
geometry is used.)
31
!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*& ! (a*!
!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*& ! (a*'
,-.
/pip
olar
resid
ual 9
:i;e
ls<
!"" #"" $"" %"" &""" &!"" &#"""&!'# (a*! ! (a*'
=*age nu*@er
Aali@ration Crid D %!:oint ElgFa(e Gata D Hwo .tage ElgFa(e Gata D .ingle Jpti*iKation
Figure 3.6: Quantitative comparison between our AAM-based calibration algorithms and the 8-
point algorithm [22] using a calibration grid. The evaluation is performed on 1400 images of a
face. The ground-truth is extracted using a single-view AAM fitting algorithm. We plot the RMS
distance error between epipolar lines and the corresponding feature points for each of the 1400
images.
Although the optimization algorithm of Section 3.7 performs significantly better than
the two stage algorithm in Section 3.6, both AAM-based algorithms perform slightly worse
than the 8-point algorithm on the calibration grid data in Figure 3.5. The main reason is
probably that the ground-truth calibration grid data covers a similar volume to the data
used by the 8-point algorithm, but a much larger volume than the face data used by the
AAM-based algorithms. When compared on the face data in Figure 3.6 (which covers a
32
similar volume to that used by the AAM-based algorithm), the 8-point algorithm and the
optimization algorithm of Section 3.7 perform comparably well.
33
3.9 Calibrated Multi-View Fitting
Once we have calibrated the cameras and computed in, jn, fn, onu, on
v , and onz we can then use
a weak perspective calibrated multi-view fitting algorithm to fit a given AAM to multiple
images. We optimize:
N∑n=1
∥∥∥∥∥A0(u) +l∑
i=1
λni Ai(u)− In(W(u;pn))
∥∥∥∥∥2
+ K·
∥∥∥∥∥∥s0 +m∑
i=1
pni si −Pn
wp
R
s0 +m∑
j=1
pj sj
+ T
∥∥∥∥∥∥2
with respect to the N sets of 2D shape parameters pn, the N sets of appearance param-
eters λni , the one global set of 3D shape parameters p, the global rotation R, and the global
translation T. In this optimization, Pnwp is defined by Equation (3.4) where z = kn ·T. It is
also possible to formulate a similar scaled orthographic calibrated algorithm in which Pnwp
is replaced with Pnso defined in Equation (2.8) and the optimization is also performed over
the additional N scales σn. Note that in these calibrated fitting algorithms, the calibration
parameters in, jn, fn, onu, on
v , and onz are constant and not optimized. As shown below, this
leads to a lower dimensional optimization and more robust fitting.
3.10 Empirical Evaluation
3.10.1 Qualitative Results
An example of using our calibrated multi-view fitting algorithm to track by fitting a single
2D+3D AAM to three concurrently captured images of a face is shown in Figure 3.7. The
complete fitting sequence is included in the movie calib fitting.mov. The top row of the
figure shows the tracking result for one frame. The bottom row shows the tracking result
for a frame later in the sequence. In each case, all three input images are overlaid with the
34
Tim
e1
Tim
e2
Left Camera Center Camera Right Camera
Figure 3.7: An example of using our calibrated multi-view fitting algorithm to fit a single 2D+3D
AAM to three simultaneously captured images of a face. Each image is overlaid with the corre-
sponding 2D shape for that image in dark dots. The single 3D shape p for the current triple of
images is displayed in the top right of the center image. This 3D shape is also projected into each
image using the corresponding Pn, and displayed as a white mesh. The single head pose (extracted
from the rotation matrix R) is displayed in the top left of the center image as roll, pitch, and yaw.
This should be compared with the algorithm in Chapter 3 in which there is a separate head pose
for each camera. See the movie calib fitting.mov for the complete fitting sequence.
2D shape pn plotted in dark dots. The view-dependent camera projection of this 3D shape
is also plotted as a white mesh overlaid on the face. The single 3D shape p at the current
frame is displayed in the top-right of the center image. We also display the recovered roll,
pitch, and yaw of the face (extracted from the global rotation matrix R) in the top left of the
center image. The three cameras combine to compute a single head pose, unlike Figure 3.2
where the pose is view dependent.
35
3.10.2 Quantitative Results
In Figure 3.8 we show quantitative results to demonstrate the increased robustness and
convergence rate of our calibrated multi-view fitting algorithms. In experiments similar
to those in [28], we generated a large number of test cases by randomly perturbing from a
ground-truth obtained by tracking the face in the multi-view video sequences. The global 3D
shape parameters p, global rotation matrix R, and global translation T were all perturbed
and projected into each of the three views. This ensures the initial perturbation is a valid
starting point for all algorithms. We then run each algorithm from the same perturbed
starting point and determine whether they converged or not by computing the RMS error
between the mesh location of the fit and the ground-truth mesh coordinates. The algorithm
is considered to have converged if the total spatial error is less than 2.0 pixels. We repeat
the experiment 20 times for each set of 3 images and average over all 300 image triples in
the test sequences. This procedure is repeated for different values of perturbation energy.
The magnitude of the perturbation is chosen to vary on average from 0 to 4 times the 3D
shape standard deviation. The global rotation R, and global translation T are perturbed
by a scalar multiples α and β of this value. The values of α and β were chosen so that the
rotation and translation components introduce the same amount of pertubation energy as
the shape component [28].
In Figure 3.8(a) we plot a graph of the likelihood (frequency) of convergence against the
magnitude of the random perturbation for the 2D+3D single-view fitting algorithm [44] ap-
plied independently to each camera, the uncalibrated multi-view fitting algorithm described
in Chapter 3 and the two calibrated multi-view fitting algorithms: scaled orthographic and
weak perspective. The results clearly show that the calibrated multi-view algorithms are
more robust than the uncalibrated multi-view algorithm, which is more robust than the
2D+3D single-view algorithm. Overall, the weak perspective calibrated multi-view fitting
36
! !"# $ $"# % %"# & &"# '!
$!
%!
&!
'!
#!
(!
)!
*!
+!
$!!,-./-0123-45647.829:4/50;-.3-<
=23081><-4564,-.1>[email protected]>0<!7.>1B0 5 10 15 20 25 30
0
5
10
15
20
RM
S p
oin
t lo
ca
tio
n e
rro
r
Iteration
2D+3D single viewuncalibrated mv
scaled orthographicweak perspective
(a) Frequency of Convergence (b) Rate of Convergence
Figure 3.8: (a) The likelihood (frequency) of convergence plot against the magnitude of a ran-
dom perturbation to the ground-truth fitting results computed by tracking through a trinocular
sequence. The results show that the calibrated multi-view algorithms are more robust than the
uncalibrated multi-view algorithm discussed in Chapter 3, which itself is more robust than the
2D+3D single-view algorithm [44]. (b) The rate of convergence is estimated by plotting the aver-
age error after each iteration against the iteration number. The results show that the calibrated
multi-view algorithms converge faster than the uncalibrated algorithm, which converges faster than
the single-view 2D+3D algorithm.
algorithm performs the best. The main source of the increased robustness of the calibrated
multi-view fitting algorithms is imposing the constraint that the head pose is consistent
across all N cameras. We also compute how fast the algorithms converge by computing
the average RMS mesh location error after each iteration. Only trials that actually con-
verge are used in this computation. The results for two different magnitudes of perturbation
(0.8 and 2.0) to the ground-truth are included in Figure 3.8(b). The results indicate that
the calibrated multi-view algorithms converge faster than the uncalibrated algorithm, which
converges faster than the single-view 2D+3D algorithm.
37
Algorithm Time per frame Iterations per frame Time per iteration
2D+3D single-view 33.808 2.5209 13.401
uncalibrated multi-view 152.33 3.2915 46.247
scaled orthographic 152.94 3.2178 47.534
weak perspective 125.94 2.6131 48.158
Table 3.1: This table shows the timing results for our Matlab implementations of the four fitting
algorithms evaluated in Section 3.10.2 in milliseconds. The results were obtained on a dual 2.5 GHz
Power Mac G5 machine and were averaged over 600 image triples with VGA (640 x 480) resolution.
Each algorithm was allowed to iterate until convergence over each image triple. Note that the
results for the single-view algorithm is just the cost of processing one image from the image triple.
We include the movie compare.mov to demonstrate a few examples of the perturba-
tion experiments. The movie illustrates how the calibrated multi-view algorithms impose a
consistent head pose (c.f. uncalibrated algorithm) and a single 3D face shape (c.f. 2D+3D
algorithm). As a result, the calibrated algorithms sometimes converges when the other
algorithms diverge. The speed of convergence is also visibly faster.
In Table 3.1 we include timing results for our Matlab implementations of the four fitting
algorithms compared in this section. The results were obtained on a dual 2.5 GHz Power
Mac G5 machine and were averaged over 600 image triples with VGA (640 x 480) resolution.
Each algorithm was allowed to iterate until convergence over each image triple. Note that
the results for the single-view algorithm4 is just the cost of processing one image from the
image triple. The multi-view algorithms are all therefore approximatley 3 times slower than
the single-view algorithm, as should be expected. Also note that since the weak perspective
algorithm is more constrained it converges more quickly than the uncalibrated and scaled
4The single-view algorithm can be implemented in real-time (approximately 60Hz) in C [29].
38
orthographic multi-view algorithms. The single-view algorithm requires slightly fewer itera-
tions than all of the multi-view algorithms because it does not have to impose consistency
on the 2D shapes in the different views.
39
Chapter 4
Multi-View 3D Model Construction
In the previous chapter we have shown that the performance of AAM fitting can be improved
by using multiple views and calibration information. Similarly, a 3D AAM can be constructed
more reliably using multiple calibrated cameras. In this chapter, we outline a calibrated
multi-view motion-stereo algorithm for 3D AAM construction and compare its performance
with other existing single-view and multi-view non-rigid structure-from-motion algorithms.
4.1 Non-Rigid Structure-from-Motion
One way to build a deformable 3D face model is to use 3D range data. In [5], the 3D
mesh vertices s are first located in a set of “training” 3D range scans. Principal Component
Analysis is then used to extract the base (or mean) shape s0 and the m dominant shape
modes sj. More recently, however, the task of building deformable face models from a video
captured by a single camera using non-rigid structure-from-motion has received a great deal
of attention [9, 8].
Suppose that we have a sequence of images I t of a face captured across time t = 1, . . . , T .
Either the face, the camera, or both may be moving. Assume we can track K 2D feature
40
points in the 2D images I t. Denote the tracking results:
ut =
ut1 ut
2 . . . utK
vt1 vt
2 . . . vtK
Also denote the camera matrix of the camera at time t by Pt. Non-rigid structure-from-
motion can then be posed as minimizing:
T∑t=1
∥∥∥∥∥∥Pt
s0 +m∑
j=1
ptjsj
− ut
∥∥∥∥∥∥2
(4.1)
with respect to the base shape s0, the shape modes sj, the shape parameters ptj and the
camera matrices Pt. If Pt is a perspective camera model, the above optimization is non-
linear, but can be solved using an appropriate non-linear optimization algorithm [46]. If
Pt is a linear camera model, such as the scaled orthographic model (P = Pso), the above
optimization can be solved using a linear algorithm [9, 8, 45].
4.2 Multi-view Structure-from-Motion
The single-view non-rigid structure-from-motion (NR-SFM) paradigm can be extended to
include information from multiple views/cameras to yield a multi-view non-rigid structure-
from-motion algorithm [40] (MV-SFM.)
Suppose we have a set of N > 1 cameras that simultaneously capture videos In,t for
n = 1, . . . , N across time t = 1, . . . , T . Denote the unknown camera matrices by Pn for
n = 1, . . . , N and the global 3D rotation and translation of the face across time by Rt and
Tt. Assume that we can track K feature points across time in the videos In,t. Denote the
tracking results as:
un,t =
un,t1 un,t
2 . . . un,tK
vn,t1 vn,t
2 . . . vn,tK
(4.2)
41
The problem then becomes one of minimizing:
N∑n=1
T∑t=1
∥∥∥∥∥∥Pn
Rt
s0 +m∑
j=1
ptjsj
+ Tt
− un,t
∥∥∥∥∥∥2
(4.3)
with respect to the base shape s0, the shape modes sj, the shape parameters ptj, the camera
matrices Pn, the global 3D rotation Rt and translation Tt of the face across time.
4.3 Stereo
Both the single-view and multi-view structure-from-motion algorithms suffer from the Bas
Relief ambiguity [47, 38, 36, 23]. The Bas Relief ambiguity is an ambiguity between the
motion (translation or small rotation) of the cameras and the depths of the points in the
scene. In both the single-view and multi-view cases, the camera matrices must be solved
for as well as the structure of the scene. So, the ambiguity can manifest itself in the form
of scaled depths and motion between the cameras. If we have multiple calibrated cameras,
however, it is possible to derive better algorithms that do not suffer from the Bas-Relief
ambiguity. As we now describe, the simplest approach is to use stereo to fulfill the same role
as a range-scanner.
Suppose now that we have a calibrated stereo rig with N > 1 cameras in it. Denote
the known (calibrated) camera matrices Pn for n = 1, . . . , N . Suppose that the nth camera
captures the images In,t across time t = 1, . . . , T as the face (and possibly the stereo rig)
move. Assume that we can track K feature points across time in the videos In,t, and also
compute correspondence between the cameras. Denote the tracked feature points as:
un,t =
un,t1 un,t
2 . . . un,tK
vn,t1 vn,t
2 . . . vn,tK
(4.4)
A stereo algorithm (similar to those in [10, 18]) to compute the deformable model is then as
follows:
42
1. Perform stereo at each time t by minimizing:
N∑n=1
‖Pn(st)− un,t‖2
with respect to the 3D static shape st.
2. Align the 3D static shapes st with a transformation consisting of a 3D rigidity trans-
formation (6 degrees of freedom) and a single scale (1 degree of freedom); i.e. perform
a 3D “Procrustes” alignment.
3. Compute s0, sj using Principal Component Analysis.
4.4 Motion-Stereo
The above stereo algorithm can be improved upon by posing the problem as a single large
optimization, a generalization of the non-rigid structure-from-motion formulation in Equa-
tion (4.1). The input to the motion-stereo algorithm is the same as the stereo algorithm,
namely the camera matrices Pn and the tracked feature points un,t. Denote the global 3D
rotation and translation of the face across time by Rt and Tt. In the stereo algorithm above,
Rt and Tt are computed by the 3D similarity Procrustes algorithm. The model construction
problem can then be posed as minimizing:
N∑n=1
T∑t=1
∥∥∥∥∥∥Pn
Rt
s0 +m∑
j=1
ptjsj
+ Tt
− un,t
∥∥∥∥∥∥2
(4.5)
with respect to the base shape s0, the shape modes sj, the shape parameters ptj, the global
rotations Rt, and the global translations Tt. The construction goal in Equation (4.5) can
be minimized using the following motion-stereo algorithm:
1. Initialize using the stereo algorithm in Section 4.3
43
(a) 3D similarity Procrustes → Rt,Tt.
(b) Principal Components Analysis → s0, sj, ptj.
2. Iterate the following two steps until convergence:
(a) Fix s0, sj, solve for Rt,Tt, ptj.
(b) Fix ptj,R
t,Tt, solve for s0, sj.
3. Project out any scale, rotation, or translation components left in the 3D shape modes
sj.
In Step 2a, the optimization can be broken down into separate optimizations for each time
t; i.e. for each t minimize:
N∑n=1
∥∥∥∥∥∥Pn
Rt
s0 +m∑
j=1
ptjsj
+ Tt
− un,t
∥∥∥∥∥∥2
with respect to Rt,Tt, ptj. In Step 2b, we break the optimization down in m + 1 sub-steps.
We first solve for the mean shape s0 and then for each shape mode sj in turn.
4.5 Experimental Evaluation
4.5.1 Input
The input to our four face model construction algorithms consists of a set of 2D tracked facial
feature points un,t (see Equation (4.4)) in 312 images captured by n = 1, 2, 3 synchronized
cameras at t = 1, . . . , 104 time instants. We tracked 68 feature points independently in
each video sequence using a 2D Active Appearance Model (AAM) [13, 28]. Example results
for 9 images (3 cameras × 3 time instants) are shown in Figure 4.1. We also include the
movie 2d track.mov showing the complete tracked input sequence. Note that the head pose
44
Fram
e1
Fram
e2
Fram
e3
Camera 1 Camera 2 Camera 3
Figure 4.1: Three example frames from each of three synchronized stereo cameras. In total, we
tracked the head independently through 104 frames in each camera using a 68 point 2D AAM
[13, 28]. The pose variation in the three sequences is the most that a single 2D AAM can cope
with before it fails. See the movie 2d track.mov for the complete tracked input sequence.
variation is substantial, but not too extreme. None of the videos contain any full profiles. The
input sequences were carefully chosen to maximize the head pose variation, while not causing
the 2D AAM to fail. In our experience, the head pose variation shown in Figure 4.1 is the
most that a single 2D AAM can cope with. While more sophisticated tracking algorithms,
which can cope with occlusions, severe foreshortening, and non-Lambertian reflectance have
45
been proposed, the pose variation in Figure 4.1 is about the most that can be tracked using
the basic algorithm.
4.5.2 Qualitative Multi-View Model Construction Comparison
The results of applying each of the four algorithms: 1) non-rigid structure-from-motion
(NR-SFM) [45], 2) multi-view non-rigid structure-from-motion (MV-SFM) [40], 3) stereo,
and 4) motion-stereo are summarized in Figure 4.2. Note that the input to the NR-SFM is
generated by stacking together the image sequences from each of the three cameras. All four
algorithms therefore use exactly the same set of input image data.
For each model, we display the mean shape (s0) and the first two shape modes (s1,
s2) from two viewpoints to help the reader visualize the 3D structure. The main thing
to note in Figure 4.2 is how “stretched” the NR-SFM and the MV-SFM models are. The
depth (z) values of all of the points in the mean shape appear to have been scaled by a
constant multiplier. The underlying cause of this stretching is the Bas-Relief ambiguity which
occurs when applying (non-rigid) structure-from-motion to data with little pose variation
[47, 38, 36, 23]. The problem manifests itself for both linear (NR-SFM) [9, 8, 45] and non-
linear (MV-SFM) [40] algorithms. The MV-SFM model is slightly better than the NR-SFM
model but the ambiguity persists as the problem is in the data. (Because the problem is
an ambiguity, it is possible that by chance the scale may be chosen more accurately. The
chance of accurate estimation of scale increases the more pose variation there is, and the
less noise there is [47, 38, 36, 23].) The motion-stereo and stereo models do not suffer from
this problem. In the next section we present a quantitative comparison using the calibration
algorithm derived in Section 3.3.
46
NR
-SFM
MV
-SFM
Ster
eoM
otio
n-St
ereo
Mean Shape s0 Shape Mode s1 Shape Mode s2
Figure 4.2: This figure shows the mean shape and first two shape modes of the single-view and
multi-view non-rigid structure-from-motion models, the stereo model and the motion-stereo model.
The main thing to note is that the non-rigid structure-from-motion models are “stretched” in the
depth direction.
4.5.3 Quantitative Comparison using Camera Calibration
In this section we quantitatively compare the performance of the four 3D face model con-
struction algorithms in terms of how well the resulting models can be used to perform camera
calibration using the algorithm in Section 3.7. One possible way of obtaining quantitative
results might be to capture range data as ground-truth. This approach, however, requires
(1) calibrating and (2) aligning the range data to the image data. Static range data also
cannot be used to evaluate the deformable 3D shape modes. Ideally, we would like a way of
evaluating the 3D fidelity of the face models using video data of a moving face.
47
!"#
!
"
$
%
&
'
(
)
*
+,-
./012345206
3378
!"#
$
%
&
'
!#
!$
!%
!&
()*
+,-./012/-3
0045
!"#
$
!
"
%
&
'
(
)
*
+,-
./012345206
3378
Relative Yaw Between Each Pair of Cameras
10
1000
2000
3000
4000
5000
6000
Cam
Fo
ca
l L
en
gth
GT
!"
#"""
!"""
$"""
%"""
&"""
'"""
()*
+,-)./012345
//67
!"
#"""
$"""
!"""
%"""
&"""
'"""
()*
+,-)./012345
67!8+9
9:!8+9
841;1,
9!841;1,
//<=
Focal Length of Each Camera
Figure 4.3: A quantitative evaluation of the 3D fidelity of the models, obtained by using the models
to calibrate the cameras using the algorithm in Section 3.7. The results show the motion-stereo
algorithm to perform the best. The single-view non-rigid structure-from-motion model results in
estimates of the yaw and focal length that are both off by a large factor. The two error factors
are roughly the same. Using multi-view non-rigid structure-from-motion does help in reducing the
errors to a significant degree, but the results are still not as good as the motion-stereo model. GT
refers to the ground truth values computed using the Matlab camera calibration toolbox [7].
48
The algorithm in Section 3.7 is used to calibrate weak perspective camera matrices for a
set of stereo cameras using a 3D face model. By comparing the results of this algorithm with
ground-truth calibration data, we can indirectly measure the 3D fidelity of the face models.
The relative orientation component of the calibration primarily measures the pose estimation
accuracy of the algorithms, without any absolute head pose ground-truth. Estimating the
focal lengths and the epipolar geometry requires more than the relative orientation. Accurate
focal lengths and epipolar geometry requires the accurate non-rigid 3D tracking of the face
in an extended sequence.
We implemented the multi-view single optimization calibration algorithm in Section 3.7
and compared the results with a calibration performed using a calibration standard grid and
the Matlab Camera Calibration Toolbox [7]. In Figure 4.3 we present results for the yaw
rotation (about the vertical axis) between each pair of the three cameras and for each of the
three focal lengths. The yaw between each pair of the three cameras was computed from
the relative rotation matrices of the three cameras. We include results for each of the four
models, and compare them to the ground-truth. The results in Figure 4.3 clearly show the
motion-stereo algorithm to perform the best. The results for the NR-SFM model are a long
way off. The yaw1 is underestimated by a large factor, and the focal length overestimated by
a similar factor. Based on the results in Figure 4.2, this is to be expected. The face model is
too deep, so a medium amount of parallax is generated by a too small yaw angle. Similarly,
a scaling of the model is interpreted as a too large motion in the depth direction and so too
large a focal length. The MV-SFM model also suffers from the same problem due to the
scaled nature of the model albeit generating better results than the NR-SFM model. Overall,
the motion-stereo2 algorithm clearly out performs both these algorithms and gives estimates
1The results for the pitch and roll between each pair of cameras are omitted. The pitch and roll are veryclose to zero and so there is little difference between any of the algorithms.
2Since the motion-stereo algorithm is the best among the four algorithms that we compared, we used themotion-stereo model for all the fitting and calibration experiments described in the previous sections.
49
Relative Yaw Focal Length
Cam 12 Cam 13 Cam 23 Cam 1 Cam 2 Cam 3
NR-SFM 62.1% 66.2% 68.9% 193.5% 201.7% 214.8%
MV-SFM 8.6% 18.8% 25.7% 30.9% 35.5% 41.2%
Stereo 30.2% 15.4% 5.5% 23.9% 18.2% 15.1%
Motion-Stereo 21.7% 7.8% 1.5% 8.7% 3.0% 1.1%
Table 4.1: This table summarizes the results presented in Figure 4.3. For each 3D model we
compute the percentage deviation of the relative “yaw” between each pair of cameras and focal
length of each camera from the ground-truth data (computed using the Matlab camera calibration
toolbox [7].) The motion-stereo model results in estimates of yaw and focal length that are both
comparable to the ground-truth values whereas the estimates from the non-rigid structure-from-
motion (NR-SFM) model are both off by a large factor. The multi-view non-rigid structure-from-
motion (MV-SFM) model performs better than the NR-SFM model but overall the motion-stereo
model performs the best.
of yaw and focal lengths that are comparable to ground-truth calibration data (computed
using the Matlab camera calibration toolbox [7].) To further emphasize this observation, we
compute the percentage deviation of the yaw and focal length estimates of each 3D model
from the ground-truth data. Although the bar graphs in Figure 4.3 may look similar, the
motion-stereo results for the focal length are several times better than the stereo or MV-SFM
results by the relative error measure in Table 4.1.
50
Chapter 5
Dense Face Model Construction
In this chapter we outline an algorithm to build dense Active Appearance Models (AAMs) [12,
14, 13, 17, 26]. Our algorithm builds a dense model by iteratively building a face model,
fitting the model to image data and then refining the model. In the following section we
detail the refinement process of the algorithm.
5.1 Model Densification
In this section we describe our algorithm to construct a dense AAM. There are two main
reasons why we work with AAMs rather than 3D Morphable Models (3DMMs) [5, 25, 33,
41, 8]: (1) it allows us to avoid the issue of 3D data and instead focus on the core model
refinement algorithm (2) we already have an implementation of AAMs in our lab. With some
work, our algorithm could be extended to 3DMMs. However, no conceptual advancement is
required to do so; just a re-application of the same ideas to the 3D range scan and texture
map data.
The input to our algorithm can be from two different sources: (1) They could be the
vertices of a sparse AAM or (2) They could be the vertices output by a rigid tracker or a face
51
detector. Our algorithm then constructs a dense AAM by iterating three important steps: (1)
Model Construction (Chapter 2.1), (2) Model Fitting (Section 2.2) and (3) Model Refinement
(described in this section.) A flow diagram of our algorithm is given in Figure 5.1. The model
refinement step is the key part of the algorithm. We refine the model in three different ways:
(1) we add more mesh vertices to the AAM, (2) we improve the mesh connectivity by re-
triangulating the mesh, and (3) we refine the shape modes of the AAM. We give a detailed
description of each of these steps in the following sections.
1. Adding mesh vertices: The first step in the iterative refinement process is to add
more mesh vertices. There are a number of ways to choose a mesh triangle and also the
location within the triangle to add the points. We adopt a simple but effective way to
ensure that we end up with similarly sized triangles. See Figure 5.2. At each iteration, we
look at the current mesh triangulation and choose the mesh triangle with the longest edge.
Once we choose the triangle a new point is added on the mid-point of the longest edge.
By making sure that the longest edge keeps being reduced we avoid the formation of “long
thin” triangles. Figure 5.3 illustrates the addition of two points to the mesh. To maintain
symmetry we add a pair of points simultaneously to both halves of the face mesh at each
step.
One extension of this algorithm might be to explore other heuristics to choose where to
add the new points such as choosing the triangle with the largest average coding error, and
trying to place points on structural discontinuities. However, it should be noted that as the
mesh gets more and more dense, the choice of a specific heuristic becomes less important as
there are vertices close to any point on the face.
2. Image Consistent Re-Triangulation: Once we have the new points in place we
improve the mesh connectivity by doing an image consistent re-triangulation. This step is
inspired by the work in [30]. We look at each pair of adjacent triangles and flip the common
52
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Shape Mode Refinement
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Figure 5.1: An overview of our deformable dense model construction algorithm. The algorithm
is initialized using a set of sparse hand-labeled mesh points. The algorithm then iterates through
model building, model fitting and model refinement steps to produce the dense model. The refine-
ment step is further split into refining the shape modes, adding mesh vertices and image consistent
re-triangulation.
edge. We look at the RMS model reconstruction error:√√√√√ 1T
T∑t=1
[∑u∈s0
[A0(u) +
l∑i=1
λtiAi(u)− It(W(u;p))
]]2
(5.1)
across the training data to determine whether the flip was optimal or not. We repeat this
step for each pair of adjacent triangles formed by the newly added points. In Figure 5.5
53
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
ICCV#****
ICCV#****
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
ICCV#****
ICCV#****
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 2. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 2 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm below:
(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0
(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 3. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 4. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum imageRMS error
Flip the edge, check for thin or flipped triangles
3
Figure 2. The algorithm to add mesh vertices.
Figure 3. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 3 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.
(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 4. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 5. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum image
3
Figure 5.2: The algorithm to add mesh vertices.
Figure 5.3: A pair of images showing the mesh before and after adding two new mesh points to
the longest edges. The newly added mesh points and the edges are highlighted. Note that adding
the new vertices causes two adjacent triangles to split.
we show the mesh before and after performing image consistent re-triangulation. Note that
in order to make the mesh look better we make sure that the symmetry of the mesh is
maintained. The algorithm for image consistent re-triangulation is presented in Figure 5.1.
3. Shape Mode Refinement: The third step of the refinement process is to refine the
shape modes. Since we are iteratively refining the model and building a new one, shape
mode refinement is equivalent to refining the locations of the mesh vertices in the training
data. The model fit step of our algorithm allows the mesh vertices to move around but the
movement is limited to the shape subspace of the face model. If we allow the mesh vertices to
54
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
ICCV#****
ICCV#****
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 4 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face. We present thepseudo code for our mesh vertex addition algorithm in Fig-ure 3.
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 2. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This step
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
ICCV#****
ICCV#****
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323
ICCV#****
ICCV#****
ICCV 2007 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 2. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 2 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm below:
(1) Initialize using a sparse triangulated mesh.(2) Choose the triangle with the largest edge in the meanshape s0
(3) Add new point to the mid point of the longest edge(4) Propagate the new points to all the images from themean shape using the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newly
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 3. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 4. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
added points. In Figure 4 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum imageRMS error
Flip the edge, check for thin or flipped triangles
3
Figure 2. The algorithm to add mesh vertices.
Figure 3. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points are highlighted. Note that adding the new ver-tices causes the triangles to split.
are a number of ways to choose a mesh triangle and alsothe location within the triangle to add the points. We adopta simple but efficient strategy to mesh densification. Ateach iteration, we look at the current mesh triangulationand choose the mesh triangle with the longest edge. Thisallows sub-division of triangles based on their sizes andtherefore ensures that no areas on the face are with far toofew mesh points. Once we choose the triangle a new pointis added on the mid-point of the longest edge. By makingsure that the longest edge keeps being reduced we avoid theformation of “long thin” triangles. Figure 3 illustrates theaddition of two points to the mesh. To maintain symmetrywe add a pair of points simultaneously to both halves of theface mesh at each step.
One extension of this algorithm might be to explore otherways to choose where to add the new points such as choos-ing the triangle with the largest average coding error, andtrying to place points on structural discontinuities. How-ever, it should be noted that as the mesh gets more and moredense, the choice of how the mesh is subdivided gets lessimportant because the final mesh is very dense and so thereare vertices close to any point on the face.
We present the pseudo code for our mesh vertex additionalgorithm in Figure 2.
(1) Initialize using a sparse triangulated mesh.(2) Choose triangle with longest edge in mean shape s0.(3) Add new point to the mid point of the longest edge.(4) Propagate new point to all training images from meanshape s0 using current estimate of the warp W(u;p).
2. Image Consistent Re-Triangulation: Once we havethe new points in place we improve the mesh connectivityby doing an image consistent re-triangulation. This stepis inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determine
Initial Training Images
Sparse Landmark Points
Model Build
Model Fit
Free Fit
Add Points
Image Consistent
Triangulation
Final Dense Models
Model Refinement
Iterate
Iterate
Figure 4. A representation of our deformable dense model con-struction algorithm. The algorithm is initialized using a set ofsparse hand labeled mesh points. The algorithm then iteratesthrough model building, model fitting and model refinement stepsto produce the dense model. The refinement step is further splitinto refining the shape modes or free fit, adding mesh vertices andimage consistent re-triangulation.
Figure 5. A pair of images showing the mesh before and afteradding a number of mesh points and performing image consistentre-triangulation. The final triangulation is optimal with respecte tothe training data.
whether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
Initialize with the current mesh topologyRepeat:
Get initial image RMS errorRepeat (For each pair of adjacent triangles that include
the new point)Flip the common edge of the quadrilateralNote the image RMS error
Get the edge flip corresponding to the minimum image
3
Figure 3. The algorithm to add mesh vertices.
Figure 4. A pair of images showing the mesh before and afteradding two new mesh points to the longest edges. The newlyadded mesh points and the edges are highlighted. Note that addingthe new vertices causes two adjacent triangles to split.
Figure 5. A pair of images showing the mesh before and after per-forming an image consistent re-triangulation of the mesh. The newpoints as well as the edges that were flipped are highlighted. Theresulting triangulation is optimal with respect to the image data.is inspired by the work in [15]. We look at each pair ofadjacent triangles and flip the common edge. We look atthe image RMS error across the training data to determinewhether the flip was optimal or not. We repeat this stepfor each pair of adjacent triangles formed by the newlyadded points. In Figure 5 we show a dense mesh afteradding a number of points and performing image consistentre-triangulation. Note that in order to make the meshlook better we make sure that the symmetry of the meshis maintained. The pseudo code for the image consistentre-triangulation step is presented below:
(1) Initialize with the current mesh topology.(2) For each pair of adjacent triangles with new point:
(2.1) Flip the common edge of the quadrilateral.(2.2) Note the image RMS error.
(3) Get the edge flip with minimum image RMS error.(4) Flip the edge, check for thin or flipped triangles.
3. Shape Modes Refinement: The third and the mostimportant step of the refinement process is to refine theshape modes. The shape mode refinement can be thoughtof as refining the locations of the mesh vertices. The modelfit step of our algorithm allows the mesh vertices to movearound but the movement is limited to within the shapesubspace of the face model. If we allow the mesh vertices
3
Figure 5.4: The algorithm for image consistent re-triangulation.
Figure 5.5: A pair of images showing the mesh before and after performing an image consistent
re-triangulation of the mesh [30]. The new points as well as the edges that were flipped are
highlighted.
move outside the shape subspace of the model then we can possibly learn new deformations
and hence the current set of mesh vertices can better explain the face data. For this we
perform a model fit step similar to the one described in Section 2.2 except that we replace
the shape modes with identity bases that span the entire 2D space. As indicated before, the
optimization equation is similar to Equation 2.3 except that the 2D shape s is now defined
using these basis vectors that allow all the points to move in both x and y directions:
[s1 . . . s2M ] =
55
1 0 0 . . .
0 1 0 . . .
0 0 1 . . .
......
...
2M×2M
where M is the number of mesh vertices. Even though the shape mode refinement step
is initialized by the model fit at the previous density, it is still very high dimensional and
so prone to local minima. Hence we regularize this step with two priors. The first is a
smoothness constraint. The second is a constraint that the initial sparse vertices cannot
move too far from the input (hand-marked) locations.
The smoothness constraint restricts the movement of the newly added points to be not too
far away from their initial position with respect to the initial triangle that they were added
in. Figure 5.6 illustrates the mesh vertices that go into the optimization. The minimization
goal is given by:
∀v4‖v1 + λ (v2 − v1) + µ (v3 − v1)− v4‖2 (5.2)
with respect to all newly added mesh vertices v4. The λ and µ coefficients are the barycentric
coordinates [6] with respect to the base triangle in the mean shape s0.
The second constraint restricts the movement of the initial points and enforces that
they do not move too much from their initial hand specified locations. This constraint is
represented for a single image as:
‖s0 +m∑
i=1
pi si − s‖2 (5.3)
where the mean shape s0, 2D shape parameters p and the eigenvectors si are all defined over
the number of initial vertices and s is the initial hand-labeled mesh vertex locations [x y]
for a given image I.
The final optimization equation is a combination of the terms in Equations 2.3 , 5.2
56
V1
V2 V3V4
New Point
Figure 5.6: This figure shows the triangle vertices used to impose the smoothness constraints. The
new vertex is constrained by Equation 5.2 based on the location of the other three vertices.
and 5.3. The minimization goal is thus given by:
T∑t=1
∑u∈s0
[I t(W
(u;pt
))−(A0 (u) +
l∑i=1
λtiAi (u)
)]2
+
K1 ·T∑
t=1
‖s0 +m∑
i=1
pti si − st‖2+
K2 ·T∑
t=1
∀vt4‖vt
1 + λ(vt
2 − vt1
)+ µ
(vt
3 − vt1
)− vt
4‖2 (5.4)
with respect to the 2D shape p and appearance λi parameters. The optimization is done
using the algorithm in [4] but with a different prior. The weights K1 and K2 are chosen
after running the algorithms for different values of the weights and choosing the weighting
that gives the best performance. We also need to specify a suitable stopping criteria for
the algorithm to avoid the model getting unnecessarily dense. We choose a stopping criteria
that is based on the image coding error; if the coding error stops decreasing we terminate
the algorithm and output the dense model obtained at that particular iteration. Since our
model construction technique is an iterative offline algorithm, a typical iteration run of our
MATLAB implementation of the algorithm on a Mac PowerPC G5 2.4 GHz machine takes
approximately 30 minutes.
57
5.2 Experimental Results
In this section we present results from a number of experiments to show improved perfor-
mance of our densification algorithm. There are two ways to initialize our algorithm. One
way is to start with an existing sparse AAM and then increase the mesh density. In Sec-
tions 5.2.1, 5.2.2 and 5.2.3 we present results for this case. We could also automatically
construct a dense AAM using the output of a rigid tracker as initialization. In Section 5.2.4
we present tracking results using this approach.
5.2.1 Quantitative Evaluation
In this section we present quantitative comparisons to demonstrate the improved perfor-
mance of our algorithm in building dense models. We evaluate our model construction
algorithm using the implicitly computed dense correspondence by comparing it with with
those estimated by standard optical flow techniques [24, 27, 31]. We perform our comparison
with AAMs, but similar results could be obtained with 3DMMs.
5.2.1.1 Ground-Truth Data Collection
We collected high-resolution face data using Canon EOS SLR cameras capable of capturing
6 megapixel images. We obtained facial ground-truth data using a form of hidden markers
on the face. See [39] for a different way of embedding hidden ground-truth. These ground-
truth points have to be small so as to not interfere with the working of the algorithms. To
solve this we mark a number of very small black dots on the face. We then record the facial
deformations along with the marked ground-truth points using the high-resolution cameras.
Figure 5.7 shows one such high resolution image with ground-truth points marked on it along
with a zoomed in version highlighting the ground-truth point locations. The input data to
58
Figure 5.7: On the left is an example of the high resolution image obtained using the experimental
setup described in Section 5.2.1.1. The hand-marked ground-truth points on the face are highlighted
using dark circles. On the right are two examples of the down sampled images. Notice that the
ground-truth points are almost invisible in the down sampled images.
all algorithms consists of all the high resolution images (3072 x 2040) down sampled to one
fourth their size (768 x 510.) The ground-truth points are no longer visible in these low
resolution images and hence do not influence the working of the algorithms. Two example
down sampled images are also shown in Figure 5.7.
Note that we use only a single person’s ground-truthed data for the quantitative com-
parisons. The reason for this is that the notion of corresponding points is not well defined
across different people. We cannot estimate where a point on the face of person A should
correspond to a point on the face of person B. Also note that we cannot use range data
to help with this process since the important aspect of the ground-truth is the non-rigid
mapping from frame to frame. Knowing the perfect 3D depth from range data does not
provide us with this information.
59
5.2.1.2 Images used for Optical Flow Computation
Optical flow can be particularly hard when the motion is large. In our case, the head moves
around quite a bit in the input image. Our algorithm keeps track of where the original
head locations were and so implicitly avoids this large search space. We provide this same
information to the optical flow algorithms by warping all the input images into the coordinate
frame of the mean face for the initial sparse model. This means that the maximum flow for
all of the images is of the order of 3-4 pixels, well within the search ranges of most optical flow
algorithms. Another issue that can cause difficulty for optical flow algorithms is boundary
effects. We avoid this by also warping a boundary region around the face. We present
examples of the face mesh and the original and warped images in Figure 5.9. Observe that
the warped images are closer to each other and hence makes it easier for the optical flow
algorithms.
5.2.1.3 2D Ground-Truth Points Prediction Results
We compare the performance of four different algorithms: 1) our densification algorithm, 2)
optical flow algorithm by Horn and Schunck [24], 3) optical flow algorithm by Lucas and
Kanade [27], and 3) optical flow with diffused connectivity (Openvis3D) [31] based on their
ability to generate accurate feature point locations which are used to predict ground-truth
data point locations. We use the OpenCV implementations [1] for algorithms (2) and (3).
The evaluation methodology we adopt is based on ground-truth prediction. We use the
dense correspondence obtained from our algorithm and the optical flow algorithms to predict
the locations of the ground-truth points in all other images, given their position in one image.
We repeat this procedure for each image and finally average the predicted locations of the
ground-truth points. Once we have the predicted locations of the ground-truth points in
all images we compute the RMS spatial error between the predicted and the actual ground-
60
truth point locations. To perform a fair comparison among different algorithms we do all
the above computations in the mean shape by warping all images and correspondence onto
the mean shape.
We present the results of our algorithmic comparisons in Figures 5.8 for two different
people. We plot the RMS ground-truth prediction error vs the number of mesh points (al-
gorithm iterations.) The number of ground-truth points used for evaluation in the first case
is 21 whereas for the second case is 13. The results indicate that the densification algorithm
produces dense correspondences that lead to greater accuracies in predicting ground-truth
data. The optical flow algorithms clearly perform worse. This validates our claim that many
standard optical flow techniques prove to be bad predictors of point locations given images
taken under varying illumination, involving significant object deformations and consisting of
sparsely textured data.
5.2.1.4 3D Ground-Truth Points Prediction Results
In this section we perform comparisons similar to the ones in the previous section to evaluate
the 3D consistency of the correspondence computed by our algorithm with respect to the
ground-truth. In this case we evaluate our algorithm on trinocular stereo data. We repeat
the experimental setup described in Section 5.2.1.1 except that now we have a stereo rig with
calibrated cameras [7]. We use the initial sparse correspondence (the input to our algorithm)
and the dense correspondence from our algorithm and triangulate them to obtain 3D point
locations. We also triangulate the 2D ground-truth points to obtain 3D ground-truth points.
We compare the 3D fidelity of the sparse and the dense correspondences by computing the
distance of each 3D ground-truth point from the corresponding triangular plane comprised
of the sparse and dense mesh vertices. We find that the ground-truth points are closer (in
the depth direction) to the dense triangular mesh planes than the sparse ones, indicating
61
0 69 79 88 98 108 118 128 138 148 158 1682.6
2.8
3
3.2
3.4
3.6
3.8
4
4.2
Number of Mesh Vertices
Gro
und
Trut
h Po
int L
ocat
ion
Erro
r (RM
S)
Hand−labeled landmarksOptical Flow − Openvis3DOptical Flow − Lucas and KanadeOptical Flow − Horn and SchunckDensification Algorithm Output
0 69 79 88 98 108 118 128 138 148 158 168 1781
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
Number of Mesh Vertices
Gro
un
d T
ruth
Po
int
Lo
ca
tio
n E
rro
r (R
MS
)
Person 1 Person 2
Figure 5.8: A comparison of the algorithms on their ability to generate landmarks that lead to
better ground-truth point locations prediction for two different people. On the x-axis we plot algo-
rithmic iterations (each iteration adds 10 mesh points) vs the RMS ground-truth point prediction
error on the y-axis. The densification algorithm clearly performs the best.
that our densification algorithm generated mesh vertices with higher 3D fidelity. We plot
the results of our quantitative comparison in Figure 5.10.
5.2.2 Fitting Robustness
In Figure 5.13 we show quantitative results to demonstrate the increased robustness of our
dense AAMs. In experiments similar to those in [28], we generated 1800 test cases (20
trials each for 90 images) by randomly perturbing the 2D shape model from a ground-truth
obtained by tracking the face in video sequences and allowing the algorithms to converge. The
2D shape and similarity parameters obtained from the dense AAM tracks were perturbed
and the perturbations were projected on to the ground-truth tracks of the sparse AAMs.
This ensures that the initial perturbation is a valid starting point for all algorithms. We
62
(a) (b) (c)
Figure 5.9: (a) An example of the mesh used to warp input images onto the mean shape for
computing optical flow. The face mesh is extended to eliminate boundary effects for optical flow
algorithms. (b) The original input images to our algorithm. Note that it is difficult for optical
flow algorithms to work on these images with varying head locations. (c) The two images from (b)
warped onto the mean shape using the mesh from (a). Note that by warping the images to mean
we make it easier for the optical flow algorithms.
then run each algorithm (one using the dense AAM and the other with the sparse AAM)
from the same perturbed starting point and determine their convergence by computing the
RMS error between the mesh location of the fit and the ground-truth mesh coordinates.
The algorithm is considered to have converged if the RMS spatial error is less than 2.0
pixels. The magnitude of the perturbation is chosen to vary on average from 0 to 4 times
the 2D shape standard deviation. The perturbation results were obtained on the trinocular
stereo data (Section 5.2.1.1) for each of the three camera views and the average frequency of
convergence is reported in Figure 5.13. The results show that the dense AAM converges to
ground truth more often than the sparse AAM. The increased robustness of the dense AAM
may be surprising given its apparent increased flexibility. But note that both the sparse and
dense AAMs have the same number of shape modes. The increased robustness of the dense
AAM is because it is a better (more compact) coding of the underlying phenomenon. Also
note that since both the sparse and the dense AAMs have the same number of parameters
63
1 2 3 4 5 6 10
15
20
25
30
35
Image
Dist
ance
in m
m
Sparse CorrespondenceDense Correspondence
Figure 5.10: The distance of the triangulated 3D ground truth points from the 3D mesh plane
for each 3-frame. The values were computed for six 3-frames. The smallest triangle in which the
ground-truth point lies in 2D was computed. The distance was computed between the triangular
plane (formed by the 3D mesh vertices) and the corresponding 3D ground truth points. This was
repeated for 21 ground-truth points and the sum of the distances was computed. The average
distance across images for the sparse correspondence (69 mesh points) is 27.84 mm whereas for
the dense correspondence (168 mesh points) it is 13.625 mm.
that are optimized during the fit, the dense AAM fitting is as fast as the sparse AAM fitting.
The additional overheads such as in computing the affine warp for composition [28] hardly
affect the speed of fitting. A typical dense AAM (168 points) fit iteration takes 0.25 secs
using a MATLAB implementation on a Mac PowerPC G5 2.4 GHz machine, while fitting to
an image of VGA (640 x 480) resolution.
64
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
Figure 5.11: Our algorithm can be applied to data of multiple people. Here we show a few frames
of a dense multi-person AAM being used to track three different people. See face track.mov for
the complete tracking sequences.
5.2.3 Face Tracking
In Section 5.2.1 we compared our algorithm on single-person data to allow a quantitative
comparison on ground-truth. Our algorithm can of course be applied to data of any number of
65
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
100 200 300 400 500 600 700
50
100
150
200
250
300
350
400
450
Figure 5.12: [Contd. from Figure 5.11] Here we show a few frames of a dense multi-person AAM
being used to track two other people. See face track.mov for the complete tracking sequences.
people. In this section we present a qualitative evaluation of the tracking ability of the dense
AAM constructed using our algorithm. We collect tracking data of five different subjects
using a video camera and use our algorithm to compute a dense multi-person AAM. We then
use the dense AAM thus computed to track test data with varied facial expressions across
multiple subjects. We find that the dense AAM can be used to reliably track the test data
that we presented. In particular, the dense model generalizes well to unseen expressions. We
include the movie face track.mov to illustrate this. A few snap shots of the tracking movie
are presented in Figures 5.11 and 5.12.
66
0 1 2 3 450
60
70
80
90
100
Avg. Shape Sigma
Pe
rce
nta
ge
of
Tria
ls C
on
ve
rge
d
Sparse model
Dense model
Figure 5.13: Model fitting robustness results comparing the sparse AAM and the dense AAM. We
see an increase in frequency of convergence using the dense AAM.
Figure 5.14: A few snapshots from the first tracking sequence. The first image shows the rigid
tracker initialization to our algorithm. We track using the automatically constructed dense AAM.
The dense AAM was built by running our algorithm initialized with a face tracker. Note that the
tracking is fairly accurate especially the mesh region around the mouth deforms well according to
the change in expression. The complete tracking sequence is included in the movie auto1.mov.
67
Figure 5.15: A few snapshots from the second tracking sequence. The first image shows the rigid
tracker initialization to our algorithm. We track using the automatically constructed dense AAM.
The complete tracking sequence is included in the movie auto2.mov.
5.2.4 Application to Rigid Tracker Output
In this section we illustrate how our densification algorithm can be used to perform un-
supervised dense AAM construction. Our algorithm is different from previous automatic
construction algorithm [4] in two important ways: 1) Our algorithm does not need a hand
specified mesh; the mesh topology is computed by our algorithm and 2) Our algorithm works
far better because of the progressive model refinement. The results obtained by previous
authors [4] were fairly limited and mostly consisted of simple rigid motions whereas we apply
our algorithm on face data with widely varying non-rigid deformations.
We use a rigid blob tracker which is based on a skin color model to detect the faces in
video sequences. We use the output of this rigid tracker (an affine warped planar grid) as
initialization for our algorithm. The dense deformable models generated by our algorithm
can be used for various tasks such as AAM fitting, 3D model construction or whatever
application thereof.
As a demonstration, we used the dense models to track the faces in two different video
sequences. The two video sequences were captured under different illumination conditions
68
and are presented for two different subjects. A few snaps shots from the tracking video
are shown in Figures 5.14 and 5.15. The complete tracking sequences are included in the
movies auto1.mov and auto2.mov. Observe that the tracking is reliable, with the face mesh
deforming according to the variation in expression and pose.
69
Chapter 6
Conclusion
6.1 Summary
In this thesis we have studied three important topics: (1) multi-view 3D AAM model fitting,
(2) multi-view 3D AAM model construction and (2) automatic dense face model construc-
tion. In Chapter 3 we have described an algorithm to fit a single 2D+3D AAM to N images
captured simultaneously by N uncalibrated cameras. In the process, our algorithm com-
putes: 2D shape parameters for each image, a single set of global 3D shape parameters, the
scaled orthographic camera matrix for each view, and appearance parameters for each image
(which may be different due to different camera response functions.) Our algorithm enforces
the constraints that all of these quantities are physically consistent in the 3D scene. The
algorithm operates approximately N times slower than the real-time single image 2D+3D
AAM fitting algorithm [29, 44]. We have shown our multi-view 2D+3D AAM algorithm to
be both slightly more robust and converge more quickly than the single-view 2D+3D AAM
algorithm, which is itself more robust than the single-view 2D AAM algorithm [28].
In Section 3.3 we have shown how the multi-view face model fitting algorithm can be
extended to calibrate a weak perspective (or full perspective) camera model. In essence, we
70
use the human face as a (non-rigid) calibration grid.
We demonstrated that the resulting calibration is of comparable accuracy to that obtained
using a calibration grid. We have also shown in Section 3.9, how the calibration algorithms
described in this thesis can be used to improve the performance of multi-view face model
fitting. The calibrated multi-view algorithms perform better than the uncalibrated multi-
view algorithm, which performs better than the 2D+3D single-view algorithm in terms of
frequency of convergence and rate of convergence towards ground-truth when perturbed from
the ground-truth data.
In Chapter 4 we proposed a calibrated multi-view 3D model construction algorithm that is
superior to existing single-view and multi-view algorithms. We have shown that constructing
a 3D face model using a single-view or multi-view non-rigid structure-from-motion algorithm
suffers from the Bas-Relief ambiguity that may result in a “scaled” (stretched/compressed)
model when applied to data containing pose variation typical of that which can be obtained
using a standard face tracker such as a 2D Active Appearance Model [13, 28]. We have
shown how using calibrated multi-view motion-stereo can eliminate this ambiguity and yield
face models with higher 3D fidelity. In Section 4.5.3 we quantitatively compared the fidelity
of the 3D models described in Chapter 4 using the calibration algorithm in Section 3.7 and
showed that calibrated multi-view motion-stereo algorithm performs the best for calibration
of camera relative orientations and focal lengths.
In Chapter 5 we have outlined an algorithm that can be used to construct dense de-
formable face models. Although we demonstrate our algorithm using AAMs, the concept
can equally be applied to 3DMMs as well. In Section 5.2 we perfomed experiments to show
that our algorithm results in dense AAMs that perform significantly better than those ob-
tained using off the shelf optical flow techniques in ground-truth prediction. We showed that
the dense AAMs obtained using our algorithm perform better in many model based tasks
71
such as fitting robustness and 2D tracking. We also showed that our algorithm can be used
to construct dense AAMs using the initialization from a rigid tracker.
6.2 Discussion
In this thesis we have shown how multi-view data can be used to improve both the fitting
and construction of face models. Multiple images always provide more information, but it
is not always obvious how best to take advantage of it. One of the interesting results is
that camera calibration considerably improves the performance of multi-view model fitting
and construction. In fact the results in Figures 3.8 and 4.3 show that the benefit of using
calibrated multi-view over uncalibrated multi-view is in most cases perhaps even bigger
than the benefit of using uncalibrated multi-view over single-view. As model construction
is typically performed offline it is not a problem to use calibrated cameras. However, in the
case of model fitting, assuming calibration is not so easy. The cameras may be moved, they
may be pan-tilt, or it may not be possible to enter the scene. So automatic calibration is
important in many applications and dramatically improves fitting performance.
In this thesis, we have also outlined an automatic dense model construction algorithm.
Note that although the primary goal of our algorithm is to compute a dense model of the
face, in the process it computes correspondence between the training images. Hence it can
be regarded as a batch optical flow algorithm that works by iteratively building a model of
the face (scene) and fitting the model to the images.
6.3 Future Work
In terms of multi-view 3D model construction, one limitation of our motion-stereo algorithm
is that it only computes the shape model for 68 points on the face. One area for future work
72
would be to extend our algorithm to compute dense 3D shape models. One possibility is to
use dense stereo to compute the 3D model, assuming calibrated cameras, followed by optical
flow methods [8, 25] or automatic construction methods [4] to find the relationship between
views.
In terms of multi-view fitting, one area of future work is batch fitting over time to a
video sequence. The main difference between a video sequence and a set of simultaneously
captured multi-view images is that the face cannot be assumed to have the same 3D shape
in all images. However, it is possible that the multi-view algorithms can be extended to
temporal sequences by imposing the constraint that the 3D shape does not change very fast;
i.e. impose soft constraints on the 3D shape over time instead of the hard constraint that it
is exactly the same in each of the views.
In terms of dense deformable face model construction, one area for future work is to
extend this approach to more general scenes.
73
Bibliography
[1] Intel Open Source Computer Vision Library. http://opencvlibrary.sourceforge.net, 2005.
[2] J. Ahlberg. Using the active appearance algorithm for face and facial feature tracking. In
Proc. International Conference on Computer Vision Workshop on Recognition, Analysis
and Tracking of Faces and Gestures in Real-Time Systems, pages 68–72, 2001.
[3] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. Interna-
tional Journal of Computer Vision, 56(3):221–255, 2004.
[4] S. Baker, I. Matthews, and J. Schneider. Automatic construction of active appear-
ance models as an image coding problem. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 26(10):1380–1384, October 2004.
[5] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In Proc.
SIGGRAPH, 1999.
[6] O. Bottema. On the area of a triangle in barycentric coordinates. Crux. Math, 8:228–
231, 1982.
[7] J.-Y. Bouguet. Camera calibration toolbox for Matlab.
http://www.vision.caltech.edu/bouguetj/calib doc, 2005.
[8] M. Brand. Morphable 3D models from video. In Proc. IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, volume 2, pages 456–463, 2001.
74
[9] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image
streams. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pages 690–696, 2000.
[10] T. Cootes, E. Di Mauro, C. Taylor, and A. Lanitis. Flexible 3D models from uncalibrated
cameras. Image and Vision Computing, 14:581–587, 1996.
[11] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. In Proceedings of
the European Conference on Computer Vision, volume 2, pages 484–498, 1998.
[12] T. Cootes, G. Edwards, and C. Taylor. A comparitive evaluation of active appearance
models algorithms. In Proc. of the British Machine Vision Conference, 1998.
[13] T. Cootes, G. Edwards, and C. Taylor. Active Appearance Models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.
[14] T. Cootes and P. Kittipanya-ngam. Comparing variations on the active appearance
model algorithm. In Proc. of the British Machine Vision Conference, volume 2, pages
837–846, 2002.
[15] T. Cootes, G. Wheeler, K. Walker, and C. Taylor. Coupled-view Active Appearance
Models. In Proc. of the British Machine Vision Conference, volume 1, pages 52–61,
2000.
[16] F. Dornaika and J. Ahlberg. Fast and reliable active appearance model search for 3D
face tracking. IEEE Transactions on Systems, Man and Cybernetics, 34:1838–1853,
2004.
[17] G. J. Edwards. Learning to Identify Faces in Images and Video Sequences. PhD thesis,
University of Manchester, Division of Imaging Science and Biomedical Engineering,
1999.
75
[18] S. Gokturk, J. Bouget, and R. Grzeszczuk. A data driven model for monocular face
tracking. In Proc. of the IEEE International Conference on Computer Vision, 2001.
[19] R. Gross, I. Matthews, and S. Baker. Appearance-based face recognition and light-fields.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4):449–465, 2004.
[20] R. Gross, I. Matthews, and S. Baker. Active Appearance Models with occlusion. Image
and Vision Computing, 24(6):593–604, 2006.
[21] G. Hager and P. Belhumeur. Efficient region tracking with parametric models of geome-
try and illumination. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20:1025–1039, 1998.
[22] R. Hartley. In defence of the 8-point algorithm. In Proc. International Conference on
Computer Vision, pages 1064–1070, 1995.
[23] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge
University Press, 2000.
[24] B. Horn and B. Schunck. Determining Optical Flow. Technical report, Massachusetts
Institute of Technology, Cambridge, MA, USA,, April 1980.
[25] M. Jones and T. Poggio. Multidimensional morphable models: A framework for repre-
senting and matching object classes. In Proc. of the IEEE International Conference on
Computer Vision, pages 683–688, 1998.
[26] A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face
images using flexible models. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(7):743 – 756, 1997.
[27] B. Lucas and T. Kanade. An iterative image registration technique with application to
stereo vision. In DARPA Image Understanding, pages 121–130, 1981.
76
[28] I. Matthews and S. Baker. Active Appearance Models revisited. International Journal
of Computer Vision, 60(2):135–164, 2004. Also appeared as Carnegie Mellon University
Robotics Institute Technical Report CMU-RI-TR-03-02.
[29] I. Matthews, J. Xiao, and S. Baker. On the Dimensionality of Deformable Face Models.
International Journal of Computer Vision, Under review 2006.
[30] D. D. Morris and T. Kanade. Image-consistent surface triangulation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1:332 – 338, 2004.
[31] A. Ogale and Y. Aloimonos. Shape and the stereo correspondence problem. Interna-
tional Journal of Computer Vision, 65(1):147–162, 2005.
[32] F. H. Pighin, R. Szeliski, and D. Salesin. Resynthesizing facial animation through 3d
model-based tracking. In Proc. International Conference on Computer Vision, pages
143–150, 1999.
[33] S. Romdhani and T. Vetter. Efficient, robust and accurate fitting of a 3D morphable
model. In Proc. of the International Conference on Computer Vision, pages 59–66,
2003.
[34] S. Sclaroff and J. Isidoro. Active blobs. In Proc. of the IEEE International Conference
on Computer Vision, pages 1146–1153, 1998.
[35] S. Sclaroff and J. Isidoro. Active blobs: region-based, deformable appearance models.
Computer Vision and Image Understanding, 89(2/3):197–225, Feb. 2003.
[36] S. Soatto and R. Brockett. Optimal structure from motion: local ambiguities and
global estimates. In Proc. of the IEEE Conference on Computer Vision and Pattern
Recognition, 1998.
77
[37] J. Sung and D. Kim. Extension of aam with 3d shape model for facial shape tracking. In
Proc. IEEE International Conference on Image Processing, volume 5, pages 3363–3366,
2004.
[38] R. Szeliski and S.-B. Kang. Shape ambiguities in structure from motion. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 19(5), 1997.
[39] M. F. Tappen, E. H. Adelson, and W. T. Freeman. Estimating intrinsic component
images using non-linear regression. In Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, volume 2, pages 1992–1999, 2006.
[40] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-
rigid objects with rank constraints. In Proc. IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 493–500, 2001.
[41] T. Vetter and T. Poggio. Linear object classes and image synthesis from a single example
image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):733–
742, 1997.
[42] A. Waxman and J. Duncan. Binocular image flows: Steps toward stereo-motion fusion.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):715–729, 1986.
[43] Z. Wen and T. S. Huang. Capturing Subtle Facial Motions in 3D Face Tracking. In
Proc. International Conference on Computer Vision, page 1343, 2003.
[44] J. Xiao, S. Baker, I. Matthews, and T. Kanade. Real-time combined 2D+3D Active
Appearance Models. In Proc. IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, volume 2, pages 535–542, 2004.
[45] J. Xiao, J. Chai, and T. Kanade. A closed-form solution to non-rigid shape and motion
recovery. In Proc. European Conference on Computer Vision, pages 573–587, 2004.
78
[46] J. Xiao and T. Kanade. Uncalibrated perspective reconstruction of deformable struc-
tures. In Proc. of the IEEE International Conference on Computer Vision, 2005.
[47] Z. Zhang and O. Faugeras. 3D Dynamic Scene Analysis. Springer-Verlag, 1992.
[48] Z. Zhang and O. Faugeras. Estimation of displacements from two 3-D frames ob-
tained from stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,
14(12):1141–1156, 1992.
79