LIPREADING ACROSS MULTIPLE VIEWS...1.1 Block diagram of an AVASR system, which is a combination of...
Transcript of LIPREADING ACROSS MULTIPLE VIEWS...1.1 Block diagram of an AVASR system, which is a combination of...
Speech, Audio, Image and Video Technology Laboratory
School of Engineering Systems
LIPREADING ACROSS MULTIPLE VIEWS
Patrick Joseph Lucey
B.Eng(Hons)
SUBMITTED AS A REQUIREMENT OF
THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE, QUEENSLAND
6 SEPTEMBER 2007
Keywords
Audio-visual automatic speech recognition, lipreading, frontal pose, profile pose,
multi-view, visual front-end, visual feature extraction, pose-invariance, multi-
stream fusion
i
ii
Abstract
Visual information from a speaker’s mouth region is known to improve automatic
speech recognition (ASR) robustness, especially in the presence of acoustic noise.
Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal
images of the speaker’s face, which is a rather restrictive human-computer inter-
action (HCI) scenario. The lack of research into AVASR across multiple views has
been dictated by the lack of large corpora that contains varying pose/viewpoint
speech data. Recently, research has concentrated on recognising human be-
haviours within “meeting” or “lecture” type scenarios via “smart-rooms”. This
has resulted in the collection of audio-visual speech data which allows for the
recognition of visual speech from both frontal and non-frontal views to occur.
Using this data, the main focus of this thesis was to investigate and develop vari-
ous methods within the confines of a lipreading system which can recognise visual
speech across multiple views. This reseach constitutes the first published work
within the field which looks at this particular aspect of AVASR.
The task of recognising visual speech from non-frontal views (i.e. profile) is in
principle very similar to that of frontal views, requiring the lipreading system to
initially locate and track the mouth region and subsequently extract visual fea-
tures. However, this task is far more complicated than the frontal case, because
the facial features required to locate and track the mouth lie in a much more lim-
ited spatial plane. Nevertheless, accurate mouth region tracking can be achieved
by employing techniques similar to frontal facial feature localisation. Once the
mouth region has been extracted, the same visual feature extraction process can
take place to the frontal view. A novel contribution of this thesis, is to quantify
the degradation in lipreading performance between the frontal and profile views.
In addition to this, novel patch-based analysis of the various views is conducted,
and as a result a novel multi-stream patch-based representation is formulated.
iii
Having a lipreading system which can recognise visual speech from both
frontal and profile views is a novel contribution to the field of AVASR. How-
ever, given both the frontal and profile viewpoints, this begs the question, is
there any benefit of having the additional viewpoint? Another major contribution
of this thesis, is an exploration of a novel multi-view lipreading system. This
system shows that there does exist complimentary information in the additional
viewpoint (possibly that of lip protrusion), with superior performance achieved
in the multi-view system compared to the frontal-only system.
Even though having a multi-view lipreading system which can recognise visual
speech from both front and profile views is very beneficial, it can hardly consid-
ered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e.
front or profile). In an effort to make the lipreading system more realistic, a uni-
fied system based on a single camera was developed which enables a lipreading
system to recognise visual speech from both frontal and profile poses. This is
called pose-invariant lipreading. Pose-invariant lipreading can be performed on
either stationary or continuous tasks. Methods which effectively normalise the
various poses into a single pose were investigated for the stationary scenario and
in another contribution of this thesis, an algorithm based on regularised linear
regression was employed to project all the visual speech features into a uniform
pose. This particular method is shown to be beneficial when the lipreading sys-
tem was biased towards the dominant pose (i.e. frontal). The final contribution
of this thesis is the formulation of a continuous pose-invariant lipreading system
which contains a pose-estimator at the start of the visual front-end. This system
highlights the complexity of developing such a system, as introducing more flex-
ibility within the lipreading system invariability means the introduction of more
error.
All the works contained in this thesis present novel and innovative contribu-
tions to the field of AVASR, and hopefully this will aid in the future deployment
of an AVASR system in realistic scenarios.
iv
Contents
Keywords i
Abstract iii
List of Tables ix
List of Figures xi
Acronyms & Abbreviations xix
Authorship xxi
Acknowledgements xxiii
1 Introduction 1
1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Original Contributions of Thesis . . . . . . . . . . . . . . . . . . . 6
1.5 Publications Resulting from Research . . . . . . . . . . . . . . . . 8
1.5.1 Book Chapters . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 International Conference Publications . . . . . . . . . . . . 9
2 A Holistic View of AVASR 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The History of AVASR . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Anatomy of the Human Speech Production System . . . . . . . . 15
2.4 Linguistics of Visual Speech . . . . . . . . . . . . . . . . . . . . . 17
v
2.5 Visual Speech Perception by Humans . . . . . . . . . . . . . . . . 18
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Classification of Visual Speech 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Classifiers for Lipreading . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . 25
3.3.1 Viterbi Recognition . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 HMM Parameter Estimation . . . . . . . . . . . . . . . . . 28
3.4 Stream Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Feature Fusion Techniques . . . . . . . . . . . . . . . . . . 33
3.4.2 Decision Fusion Techniques . . . . . . . . . . . . . . . . . 34
3.5 HMM Parameters Used in Thesis . . . . . . . . . . . . . . . . . . 37
3.5.1 Measuring Lipreading Performance . . . . . . . . . . . . . 38
3.6 Current Audio-Visual Databases . . . . . . . . . . . . . . . . . . . 39
3.6.1 Review of Audio-Visual Databases . . . . . . . . . . . . . 39
3.6.2 IBM Smart-Room Database . . . . . . . . . . . . . . . . . 42
3.6.3 CUAVE Database . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Visual Front-End 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Front-End Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Visual Front-End Challenges . . . . . . . . . . . . . . . . . . . . . 51
4.4 Brief Review of Visual Front-Ends . . . . . . . . . . . . . . . . . . 53
4.5 Viola-Jones algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.3 Cascading the Classifiers . . . . . . . . . . . . . . . . . . . 62
4.6 Visual Front-End for Frontal View . . . . . . . . . . . . . . . . . 64
4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
5 Visual Feature Extraction 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Review of Visual Feature Extraction Techniques . . . . . . . . . . 74
5.2.1 Appearance Based Representations . . . . . . . . . . . . . 75
5.2.2 Contour Based Representations . . . . . . . . . . . . . . . 77
5.2.3 Combination of Features . . . . . . . . . . . . . . . . . . . 78
5.2.4 Appearance vs Contour vs Combination . . . . . . . . . . 79
5.3 Cascading Appearance-Based Features . . . . . . . . . . . . . . . 81
5.3.1 Static Feature Capture . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Dynamic Feature Capture . . . . . . . . . . . . . . . . . . 91
5.4 Lipreading from Frontal Views . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Static Feature Analysis . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Dynamic Feature Analysis . . . . . . . . . . . . . . . . . . 96
5.5 Making use of ROI Symmetry . . . . . . . . . . . . . . . . . . . . 98
5.5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . 101
5.6 Patch-Based Analysis of Visual Speech . . . . . . . . . . . . . . . 104
5.6.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . 105
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Frontal vs Profile Lipreading 111
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Visual Front-End for Profile View . . . . . . . . . . . . . . . . . . 113
6.3 Profile vs Frontal Lipreading . . . . . . . . . . . . . . . . . . . . . 119
6.4 Patch-Based Analysis of Profile Visual Speech . . . . . . . . . . . 122
6.5 Multi-view Lipreading . . . . . . . . . . . . . . . . . . . . . . . . 126
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7 Pose-Invariant Lipreading 129
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Pose-Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . 131
7.2.1 Linear Regression for Pose-Invariant Lipreading . . . . . . 132
7.2.2 The Importance of the Regularisation Term (λ) . . . . . . 135
7.3 Stationary Pose-Invariant Experiments . . . . . . . . . . . . . . . 138
vii
7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 138
7.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 139
7.3.3 Biased Towards Frontal Pose . . . . . . . . . . . . . . . . . 142
7.3.4 Inclusion of Additional Pose . . . . . . . . . . . . . . . . . 145
7.3.5 Limitations of Pose-Normalising Step . . . . . . . . . . . . 147
7.4 Continuous Pose-Invariant Lipreading . . . . . . . . . . . . . . . . 147
7.4.1 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 152
7.4.3 Pose Estimate Results . . . . . . . . . . . . . . . . . . . . 153
7.4.4 Multi-Pose Localisation Results . . . . . . . . . . . . . . . 154
7.4.5 Continuous Pose-Invariant Lipreading Results . . . . . . . 156
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8 Conclusions and Future Research 161
8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 161
8.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Bibliography 166
A Dynamic Parameter Analysis 191
viii
List of Tables
2.1 The mapping of the 44 phonemes from the HTK set, to 13 visemes
used in the John Hopkin’s University summer workshop [127]. . . 18
4.1 Facial feature point detection accuracy results for frontal pose . . 68
5.1 Lipreading performance of the various regions of the ROI . . . . . 106
5.2 Lipreading performance of fusing the various side patches of the
ROI together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 Lipreading performance of the smaller 16× 16 pixel patches of the
ROI (overlapping by 50%) . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Lipreading performance of the each individual patch fused together
with the holistic representation of the ROI using the SMSHMM . 109
6.1 Facial feature localisation accuracy results on the validation set of
profile images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Lipreading performance of the various regions of the profile ROI . 123
6.3 Lipreading performance of fusing the various side patches of the
profile ROI together. . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Lipreading performance of the smaller 16× 16 pixel patches of the
profile ROI (overlapping by 50%) . . . . . . . . . . . . . . . . . . 125
6.5 Lipreading performance of the each individual patch fused to-
gether with the holistic representation of the profile ROI using
the SMSHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Multi-view lipreading performance compared against the single
view performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
ix
7.1 Lipreading results in WER (%) showing the effect that an addi-
tional pose has on performance for Q = 20. As the left and right
profile WER were the same, profile refers to both poses. The
combined(80-10-10) test set refers to frontal (80%), right (10%)
and left (10%) profile poses. . . . . . . . . . . . . . . . . . . . . . 147
7.2 Pose Estimate results on the CUAVE validation which consisted
of 39 images for each pose. . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Facial feature localisation accuracy results for all poses on the
CUAVE validation set. . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4 The upper part of the table shows the average lipreading perfor-
mance for each individual task, whilst the bottom part compares
the performance for the combined individual, combined all and
pose normalised tasks, across the 10 different train/test sets. . . . 157
x
List of Figures
1.1 Block diagram of an AVASR system, which is a combination of an
audio-only and visual-only speech recognition (lipreading) system.
For this thesis, the modules within the lipreading system will be
focussed on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Schematic representation of the complete physiological mechanism
of speech production highlighting the externally visible area (taken
from Rabiner and Juang [147]). . . . . . . . . . . . . . . . . . . . 16
2.2 Examples showing that the phonemes /p/, /b/ and /m/ look
visemically similar. Each of these visemes are shown in images
(a), (b) and (c) respectively. . . . . . . . . . . . . . . . . . . . . 17
2.3 Examples showing that the visemes of the acoustically similar
phonemes /m/ and /n/ , look different in the visual domain. The
viseme /m/ is shown in (a) and /n/ is shown in (b). . . . . . . . . 17
3.1 Block diagram of a lipreading system. . . . . . . . . . . . . . . . 24
3.2 Discrete states in a Markov model are represented by nodes and
the transition probability by links. . . . . . . . . . . . . . . . . . 25
3.3 The IBM smart room developed for the purpose of the CHIL
project. Notice the fixed and PTZ cameras, as well as the far-
field table-top and array microphones. . . . . . . . . . . . . . . . 43
3.4 Examples of image views captured by the IBM smart room cam-
eras. In contrast to the four corner cameras (two upper rows), the
two PTZ cameras (lower row) provide closer views of the lecturer,
albeit not necessarily frontal (see also Figure 3.3). . . . . . . . . . 44
xi
3.5 Examples of synchronous frontal and profile video frames of four
subjects from the IBM smart-room database. . . . . . . . . . . . . 45
3.6 Examples of sequences from the CUAVE database, which consists
of 36 individual speakers and 20 group speakers. The top line
give examples of the individual sequences, whilst the bottom gives
examples of the group speaker sequences. . . . . . . . . . . . . . . 46
3.7 Examples of the CUAVE individual sequences. The top three rows
give examples of the speaker rotating from left profile to right pro-
file. The bottom three rows give examples of the speaker moving
whilst in the frontal pose. . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Block diagram of a visual front-end for a lipreading system. It is
essentially a three-step process, face localisation being step 1 and
step 2 consisting of located the mouth ROI. Step 3 is tracking the
ROI over the video sequence. . . . . . . . . . . . . . . . . . . . . 50
4.2 Depiction of the cascading front-end effect. . . . . . . . . . . . . . 51
4.3 Comparison of the feature sets used by: (a) Viola and Jones with
the original 4 haar-like features; and (b) Lienhart and Maydt with
their extended set of 14 haar-like features including their rotated
features. It is worth noting that the diagonal line feature in (a) is
not utilised in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Example of how the integral image can be used for computing
upright rectangular features. . . . . . . . . . . . . . . . . . . . . . 60
4.5 Example of how the rotated integral image can be used for com-
puting rotated features. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Example of the first feature selected by AdaBoost. It has selected
the feature across the eye, nose and cheek areas, possibly due to
the contrast in colour. . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Example of a face localiser based on a boosted cascade of 20 simple
classifiers. If the hit rate for each classifier is 0.9998 and the false-
alarm rate is set to 0.5 then the overall localiser should be able
to yield a hit rate of 0.999820 = 0.9960 and a false-alarm rate of
0.520 = 9.54× 10−7. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xii
4.8 Points used for facial feature localisation on the face: (a) right eye,
(b) left eye, (c) nose, (d) right mouth corner, (e) top mouth, (f)
left mouth corner, (g) bottom mouth, (h) mouth center, and (i) chin. 65
4.9 Example of the 16 × 16 frontal faces from the IBM smart-room
database used for this thesis. . . . . . . . . . . . . . . . . . . . . . 66
4.10 Example of the negative images used for training of the face classifier. 67
4.11 Example of the templates used for the training of the frontal facial
features. The ROI shown on the right is an example of the mouth
center template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.12 Example of negative images used for the training of the frontal
facial feature classifiers. . . . . . . . . . . . . . . . . . . . . . . . . 69
4.13 Block diagram of the visual front-end for the frontal pose. . . . . 70
4.14 Mouth ROI extraction examples. The upper rows show examples
of the localised face, eyes, mouth region and mouth corners. The
lower row shows the corresponding normalised mouth ROI’s (32×32 pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Appearance based features utilise the entire ROI given on the left.
Contour based features require further localisation to yield features
based on the physical shape of the mouth, such as mouth height
and width which is depicted on the right. . . . . . . . . . . . . . . 77
5.2 Block diagram depicting the cascading approach used by Potmi-
anos et al. [145] to extract appearance based features from the
mouth ROI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Block diagram showing the capturing of the static features of a
ROI frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Diagram showing the zig-zag scheme used in reading in the coeffi-
cients from an encoded the two-dimensional DCT image. . . . . . 84
5.5 Examples showing the reconstructed ROI’s using the top M coef-
ficients from the DCT: (a) original, (b) M = 10, (c) M = 30, (d)
M = 50 and (e) M = 100. . . . . . . . . . . . . . . . . . . . . . . 84
5.6 Plot showing the speaker information contained within the features
without normalisation, for the digits “zero”, “one” and “two”. . . 85
xiii
5.7 Block diagram showing the feature mean normalisation (FMN)
step of the cascading process, resulting in yIIt . . . . . . . . . . . . 86
5.8 Plot showing that with FMN the unwanted speaker information
contained within the features is effectively removed, for the digits
“zero”, “one” and “two”. . . . . . . . . . . . . . . . . . . . . . . . 87
5.9 Block diagram showing the augmented static feature capture sys-
tem using the FMN in the image domain rather than the feature
domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.10 Block diagram showing the capturing of the dynamic features cen-
tered at each ROI frame. . . . . . . . . . . . . . . . . . . . . . . . 91
5.11 Plot showing the effect that FMN has on the lipreading performance. 94
5.12 Plot comparing the lipreading performance of both the image based
and feature based FMN methods. . . . . . . . . . . . . . . . . . . 95
5.13 Plot of the lipreading results showing the effect that LDA has
on improving speech classification on the final static features over
various values of N . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.14 Plots of the lipreading results for the dynamic and final features
on the MRDCT (a) and MRDiff (b) features using various values
for J and P using N = 30 input features. . . . . . . . . . . . . . 97
5.15 Examples showing the reconstructed ROI’s using the top M coef-
ficients for: (a) original, (b) M = 10, (c) M = 30, (d) M = 50
and (e) M = 100. The images on top refer to the reconstructed
ROI’s using MRDCT coefficients. The images on bottom refer to
the reconstructed ROI’s using the MRDCT with the odd frequency
components removed (MRDCT-OFR). . . . . . . . . . . . . . . . 100
5.16 Examples showing the reconstructed half ROI’s using the top M
coefficients from the MRDCT for each side: (a) original, (b) M =
10, (c) M = 30, (d) M = 50 and (e) M = 100. The top refers to
the reconstructed images of the right side of the ROI. The bottom
refers to the reconstructed images of the left side of the ROI. These
images are all of size 16× 32 pixels . . . . . . . . . . . . . . . . . 101
xiv
5.17 Results showing that removing the odd frequency components of
the MRDCT features helps improve lipreading performance. . . . 102
5.18 Plot of the results showing that LDA effectively nullifies the benefit
of the MRDCT-OFR in the previous step. . . . . . . . . . . . . . 103
5.19 Examples of the ROI broken up into: (a) top, bottom, left and
right side patches; and (b) 9 patches, starting from the top, refer
to patches 1, 2 and 3; the middle band refer to patches 4, 5 and 6;
and the bottom band of patches refer to patches 7, 8 and 9. . . . 105
6.1 Synchronous (a) frontal and (b) profile views of a subject recorded
in the IBM smart room (see Chapter 3). In the latter, visible
facial features are “compacted” within approximately half the area
compared to the frontal face case, thus increasing tracking difficulty.112
6.2 Example of the points labeled on the face: (a) left eye, (b) nose,
(c) top mouth, (d) mouth center, (e) bottom mouth, (f) left mouth
corner, and (g) chin. The center of depicted bounding box around
the eye defines the actual feature location. . . . . . . . . . . . . . 114
6.3 Examples of the facial feature templates of the profile view used
to train up the respective facial feature classifiers. . . . . . . . . . 115
6.4 Examples of the profile face templates used to train up the profile
face classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5 Block diagram of the face and mouth localisation and tracking
system for profile views. . . . . . . . . . . . . . . . . . . . . . . . 117
6.6 (a) An example of face localisation. (b) Based on the face lo-
calisation result, a search area to located the left eye and nose
is obtained. The face box is lengthened or shortened according to
metric1. (c) The left mouth corner is located within the generalised
mouth region. The ratio (metric2 ) is then used for normalising the
ROI. (d) An example of the scaled normalised located ROI of size
(48× 48) ·metric2 pixels. . . . . . . . . . . . . . . . . . . . . . . 118
xv
6.7 Examples of accurate (a-d) and inaccurate (e,f) results of the lo-
calisation and tracking system. In (f), it can be seen that the
subject exhibits a somewhat more frontal pose compared to the
profile view of the other subjects. . . . . . . . . . . . . . . . . . . 119
6.8 Results comparing the front and profile lipreading performance at
various stages of the static feature capture. . . . . . . . . . . . . . 120
6.9 Comparison of the lipreading performance between the frontal (a)
and profile (b) dynamic and final features using various values for
J and P using M = 30 input features. . . . . . . . . . . . . . . . 121
6.10 Examples of the ROI broken up into: (a) top, bottom, left and
right side patches; and (b) 9 patches, starting from the top, refer
to patches 1, 2 and 3; the middle band refer to patches 4, 5 and 6;
and the bottom band of patches refer to patches 7, 8 and 9. . . . 123
6.11 Block diagram depicting the various lipreading systems that can
function when 2 cameras are synchronously capturing a speaker
from different views. The lipreading system can use only one view
(either frontal or profile in this case), or combine both views to form
a multi-view lipreading system (which is depicted by the dashed
lines and bold typeface). The multi-view features can either be
fused at an early stage using feature fusion or in the intermediate
level via a synchronous multi-stream HMM (SMSHMM). . . . . . 127
7.1 Given one camera, the lipreading system has to be able to lipread
from any pose. In this example, those poses are either frontal or
profile poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 Schematic of the proposed pose-invariant lipreading scheme: Vi-
sual speech features xn extracted from an undesired pose (e.g. pro-
file) are transformed into visual features tn in the target pose space
(e.g. frontal) via a linear regression matrix W, calculated offline
based on synchronised multi-pose training data T and X of fea-
tures extracted from the different poses. . . . . . . . . . . . . . . 132
xvi
7.3 Given one camera, the lipreading system has to be able to lipread
from any pose. In this example, those poses are either frontal or
profile poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Plots showing the impact that normalising the pose has on lipread-
ing performance for the: (a) frontal and combined(50-50) systems;
and (b) profile and combined(50-50) systems. These systems are
tested across various numbers of features Q = 10 − 60. In the
legend, the first label refers to the test set and the label within the
bracket denotes the system’s name. . . . . . . . . . . . . . . . . . 139
7.5 Plot showing the impact that normalising the pose has on lipread-
ing performance for the frontal, profile and combined(50-50) sys-
tems. These systems are tested across various numbers of features
Q = 10−60. In the legend, the first label refers to the test set and
the label within the bracket denotes the system’s name. . . . . . . 141
7.6 Plot showing the impact that biasing the system to the frontal pose
has on the lipreading performance for the frontal and combined(80-
20) systems. These systems are tested across various numbers of
features Q = 10 − 60. In the legend, the first label refers to the
test set and the label within the bracket denotes the system’s name.143
7.7 Plot showing the impact that normalising the pose has on lipread-
ing performance for the frontal, profile and combined(50-50) sys-
tems. These systems are tested across various numbers of features
Q = 10−60. In the legend, the first label refers to the test set and
the label within the bracket denotes the system’s name. . . . . . . 144
7.8 In these experiments, the lipreading system has to lipread from the
frontal, right and left profile poses, instead of just the frontal and
profile (right) poses. . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.9 Block diagram of the continuous pose-invariant lipreading system. 148
7.10 Block diagram of the pose estimator which incorporates the pose
estimation with the face localisation. . . . . . . . . . . . . . . . . 151
7.11 Example showing the function of the nearest neighbour variable in
the face localiser. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
xvii
7.12 Examples of results from the pose estimator. The first two rows
give results for the frontal pose. The third and fourth rows give
the results for the right profile pose and the last two rows give the
results for the left profile pose. The last column gives examples of
false estimates and miss estimates. . . . . . . . . . . . . . . . . . 154
7.13 Examples of face and facial feature localisation from the multi-pose
visual front-end. The bottom row gives the associated examples of
the extracted 32× 32 ROI’s . . . . . . . . . . . . . . . . . . . . . 156
A.1 Plots of the lipreading results for the dynamic and final features
on the MRDCT (a) and MRDiff (b) features using various values
for J and P using N = 10 input features. . . . . . . . . . . . . . 191
A.2 Plots of the lipreading results for the dynamic and final features
on the MRDCT (a) and MRDiff (b) features using various values
for J and P using N = 20 input features. . . . . . . . . . . . . . 192
A.3 Plots of the lipreading results for the dynamic and final features
on the MRDCT (a) and MRDiff (b) features using various values
for J and P using N = 30 input features. . . . . . . . . . . . . . 192
A.4 Plots of the lipreading results for the dynamic and final features
on the MRDCT (a) and MRDiff (b) features using various values
for J and P using N = 40 input features. . . . . . . . . . . . . . 193
xviii
Acronyms & Abbreviations
AAM Active appearance model
ANN Artificial neural network
ASM Active shape models
ASR Automatic speech recognition
AVASR Audio-visual automatic speech recognition
CMS Cepstral mean subtraction
CUAVE Clemson university audio-visual experiments database
DBN Dynamic Bayesian Network
DCT Discrete cosine transform
Diff Discrete cosine transform of difference images
DTW Dynamic time warping
DWT Discrete wavelet transfrom
EI Early integration
EM Expectation-maximisation
FA False alarm
FMN Feature mean normlisation
GMM Gaussian mixture model
HiLDA Hierarchical linear discriminant analysis
HMM Hidden Markov model
HTK Hidden Markov model toolkit
LDA Linear discriminant analysis
LI Late integration
MI Middle integration
MRDCT Mean removed discrete cosine transform
xix
MRDiff Mean removed discrete cosine transform on difference images
PCA Principal component analysis
ROI Region of interest
SMSHMM Synchronous multi-stream hidden Markov model
SNR Signal to noise ratio
WER Word error rate
xx
Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher education institution. To the best of my knowledge
and belief, the thesis contains no material previously published or written by
another person except where due reference is made.
Signed:
Date:
xxi
xxii
Acknowledgements
It is not possible to thank everybody who has had an involvement with me during
the course of my PhD. However, there are some people who must be thanked.
Firstly and most importantly, I would like to thank my parents who have been,
and still are my biggest supporters. They have sacrificed so much to give me every
opportunity to succeed in life. Their unwavering belief in my ability, their never
ending support, as well as their guidance, comfort, compassion and perspective
have allowed me to achieve more than I ever thought I could. I am forever
indebted to them for everything they have done for me and they will never know
how much of a positive influence they have been on my life. I should be so lucky
to turn out to be half the people they are.
I would also like to thank my principal supervisor, Professor Sridha Sridharan
for his guidance and encouragement throughout my course of study. The research
environment he has created in the SAIVT laboratory, as well as the many oppor-
tunities to visit foreign institutions and international conferences, is testimony to
his commitment to excellence in research and development, and for that I am very
thankful. It should also be mentioned that part of this PhD was was supported
by Australian Research Council Grant No. LP0562101.
During my PhD, I was fortunate to visit two overseas research institutions. I
would like to thank Dr. Gerasimos “Makis” Potamianos for giving me the chance
to work with him at IBM’s T.J. Watson Research Center in New York in 2006.
My time at IBM proved to be one of the best experiences in my life and proved
to be a turning point in my research career. His constant feedback and flexibility
in allowing me to focus on various aspects of visual speech has been a major
reason why this thesis could be completed. I would also like to thank Professor
Tsuhan Chen for giving me the opportunity to come and study at the prestigious
xxiii
Carnegie Mellon University in 2005. This was an invaluable experience, which
really opened up my eyes to how research should really be conducted.
The past and present members of the SAIVT laboratory must also be ac-
knowledged, for the great atmosphere they created as well as their expertise in
research which has made it a pleasure to be their colleague and friend. I would
particularly like to thank my colleague David Dean for his help throughout my
thesis as he was so often my co-pilot in trying to disambiguate the many prob-
lems associated with audio-visual speech processing. Special mention must also
go to Terry Martin, Brendan Baker, Chris McCool, Robbie Vogt, Jason Dowling,
David Dean, Frank Lin, Simon Denman, Jamie Cook, Michael Mason, Tristan
Kleinschmidt, Eddie Wong, Clinton Fookes, Ivan Drago and Ruwan Lakemond. A
lasting memory will be the endless days myself and Jason spent playing, talking,
debating, living and reliving our various cricket dreams.
The person I would like to thank the most though, is my big brother Simon.
He has most certainly being the biggest help through my PhD and words can’t
describe how thankful I am for having such a brilliant and helpful mentor. How-
ever, despite his brilliance it still baffles me to this day how he lacks the ability
to bowl a standard orthodox leg break. I would also like to thank my brother
Owen for the laughter and support you have given my over the years. You will
never know how proud I am of the man and father you have become. Your son
Caelin is the most amazing person I have encountered, which is a true reflection
of you. I would also like to thank my brother Jedrow who is simply an unique
and loyal being.
Finally, I would like to acknowledge my extended family and friends who have
put up with me over the years. Sorry for regurgitating so many Simpsons quotes,
I promise I will come up with some unique material one day.
xxiv
Chapter 1
Introduction
1.1 Motivation and Overview
As computer technology is becoming more and more advanced, consumers are
seeking ways to interact with it to make their lives more comfortable. One of
the key technologies which allows human-to-computer interaction (HCI) to take
place is automatic speech recognition (ASR). ASR has the lofty goal of allowing
a user to interface with a computer by understanding the content of the users
instructions and then carrying them out. Probably the best example of ASR in
action is the car KITT 1 from the 1980’s television series “Knight Rider”. In this
show, KITT is capable of conducting a natural conversation with the driver as
well as enacting on any command given to it. Unfortunately, current ASR systems
like KITT are a long way off as nearly all of them rely solely on the audio channel
for input, which is often corrupted by a number of environmental factors, most
notably acoustic noise. As most “real-world” applications involve some type of
noise, these ASR systems are of limited use in these applications due to their poor
performance. Invariably, these audio-only systems fail to make use of the bimodal
nature of speech. As visual speech is immune to these acoustic environmental
factors, utilising this visual information in conjunction with the ASR system has
the potential to make systems like KITT a very real possibility in the future. This
area of research is called audio-visual automatic speech recognition (AVASR).
1KITT stands for Knight Industries Two Thousand and is the name of a fictional computerthat controls the high-tech Knight 2000, a black Pontiac Firebird Trans Am T-top automobilein the science fiction television series Knight Rider [183].
1
2 Chapter 1. Introduction
AVASR is by no means a new research field. In actual fact, the first work in the
field was conducted over fifty years ago and continuous research in this field has
been ongoing for the past twenty years with notable progress being made. Over
this period of time, the need for the visual modality in ASR systems has been
established theoretically and most of the issues involved with AVASR have been
identified. Prototype systems have been built that have demonstrated improved
performance over audio-only systems under laboratory conditions. However, the
practical deployment of AVASR systems which will be useful in a variety of “real-
world” applications, have not yet emerged. As the main benefit of using the visual
modality in ASR systems is to counteract the problems associated with “real-
world” environments, it is quite interesting to see that the majority of research
conducted in AVASR neglected this fact.
The major reason behind the lack of progress in getting a “real-world” AVASR
system deployed, is that most research that has been conducted has neglected
addressing variabilities in the visual domain such as viewpoint, with nearly all of
the present work being conducted on video of a speaker’s fully frontal face. This
is mainly due to the lack of any large corpora that can accommodate poses other
than frontal. However, as more work is being concentrated within the confines of
a “meeting room” or “smart room” environment [52, 131], data is now becoming
available that allows visual speech recognition or lipreading from multiple views
to become a viable research avenue. This point has provided the motivated for
the work in this thesis.
The implications of having a system which can lipread from any viewpoint or
pose is of major benefit to AVASR. By loosening the constraint on the speaker’s
pose, it allows a more pervasive or “real-world” technology to develop, which
would be of major benefit to many applications. Other than the smart room
scenario, this type of technology would be of benefit for; in-car AVASR, video
conferencing (via the internet or video phone) and transcribing speech data. How-
ever, allowing more flexibility in the system by including non-frontal visual speech
data introduces more complexity. All aspects of developing a lipreading system
which can cope with these added complexities are investigated in this thesis.
1.2. Scope of Thesis 3
Acoustic FeatureExtraction (MFCCs)
Acoustic FeatureClassification
Audio-OnlySpeech Recognition
tVideo In
Audio SignalIn
Visual Front-End
Visual FeatureClassification
Visual-OnlySpeech Recognition
(Lipreading)
Visual Feature Extraction
Audio-Visual FeatureClassification
Audio-VisualSpeech Recognition
Lipreading System
Figure 1.1: Block diagram of an AVASR system, which is a combination of anaudio-only and visual-only speech recognition (lipreading) system. For this thesis,the modules within the lipreading system will be focussed on.
1.2 Scope of Thesis
An AVASR system is the combination of an audio-only speech recognition system
and a lipreading system, as depicted in Figure 1.1. A major reason stymiing the
full deployment of an AVASR system in “real-world” applications, is the lack of
research being conducted in the field of AVASR that focuses on the unwanted
variabilities that lie within the visual domain, most notably head pose. In an at-
tempt to remedy this situation, the work in this thesis has solely concentrated on
researching and developing methods within the lipreading portion of an AVASR
system to allow visual speech to be recognised across multiple views. Within this
multi-faceted problem, the scope of this thesis was constrained to the following
objectives:
1. Recognise visual speech from profile views and compare it to its synchronous
counterpart in the frontal view,
2. Determine if there is any complimentary information within the profile view-
point by combining both frontal and profile features together to form a
multi-view lipreading system, and
3. Develop a pose-invariant lipreading system which can recognise visual speech
regardless of the head pose from a single camera.
4 Chapter 1. Introduction
All the work contained in this thesis is designed to address each of these novel
and previously unsolved problems.
1.3 Outline of Thesis
The remainder of this thesis is organised as follows:
Chapter 2 gives a high-level overview on the various topics of AVASR, detail-
ing its history as well as the physiological, linguistic and psychological as-
pects. The many questions pertaining to why the visual modality is useful
to recognising speech, as well as what visual representations are effective
for lipreading are addressed. This formulates the motivation behind the
lipreading system presented in this thesis.
Chapter 3 provides an in-depth review of current classifier theory. In this chap-
ter the topic of classifying visual speech is broached, with the hidden Markov
model (HMM) being detailed as the classifier of choice. Various integration
strategies that can be employed for combining synchronous visual features
together using feature fusion methods or decision fusion methods are also
discussed. The chapter also gives a relatively thorough review of the current
audio-visual databases which are currently available. Specifically, the IBM
smart-room and CUAVE databases, which are the two databases used in
this thesis are described as well as their respective protocols.
Chapter 4 gives a comprehensive evaluation of various visual front-ends, which
can automatically locate and track a speaker’s mouth ROI. This task is
shown to be difficult due to the many variations the visual front-end has
to deal with such as pose, illumination, appearance and occlusion. These
variations can effect the overall lipreading performance due to the front-
end effect. With these variations in mind, the visual front-end is developed
using the Viola-Jones algorithm and this system is presented for the frontal
pose scenario. This method is shown to be extremely rapid, and accurate
which is imperative for a real-time application such as lipreading.
1.3. Outline of Thesis 5
Chapter 5 gives an detailed review of all visual feature extraction techniques
for lipreading. From this review it is shown that the appearance based fea-
tures are the representation of choice, and the cascade of appearance based
features are revealed as the current state-of-art technique. Novel analysis
of each stage of the cascade is then conducted on the frontal view data,
which shows the impact each stage of the cascade has on the lipreading
performance. This analysis includes an observation on the effect that the
feature mean normalisation (FMN) step has on lipreading performance, as
well as the dimensionality of input feature vectors. A variant of the FMN
step is then introduced, showing that performing the normalisation step in
the image domain rather than the feature domain is slightly advantageous.
As the ROI for the frontal-pose is symmetrical, an algorithm presented by
Potamianos and Scanlon [144] is implemented making use of this charac-
teristic. It is shown that making use of this characteristic can improve
lipreading at an early level within the cascading framework. Motivated by
this work, analysis of the various regions of the ROI is then conducted using
patches, which is the first analysis of its type. As a means of making use
of this prior knowledge, a novel patch-based multi-stream representation of
the ROI is introduced.
Chapter 6 develops a lipreading system which is capable of extracting and
recognising visual speech information from profile views. These results are
compared to their synchronous counterparts in the frontal view. This consti-
tutes the first published work which quantifies the performance degradation
of lipreading in the profile view compared to the frontal view. In the exper-
iments, it is demonstrated that the profile view contains significant visual
speech information. However, it is less pronounced than the frontal view.
This profile information is not totally redundant to the frontal video, as the
multi-view lipreading system shows. The multi-view system presented is
unique to the field of AVASR as it is the first lipreading system published
which has more than one camera at its input. Patch-based analysis of the
profile ROIs is also conducted, and the pertinent regions of the ROIs are
fused together to gain a better representation of the profile speech.
6 Chapter 1. Introduction
Chapter 7 introduces the novel problem of pose-invariant lipreading. Two sce-
narios of the problem are visited, i.e. stationary and continuous. The first
part of the chapter deals with the stationary scenario. In the experiments it
is shown that when the features of one pose were tested on the other pose,
the train/test mismatch between the two is large and the lipreading per-
formance severely degrades as a consequence. To overcome this problem, a
pose-invariant or pose-normalising technique using linear regression is used
to project all the features of the unwanted pose into the wanted pose. This
technique is shown to reduce the train/test mismatch between the different
poses, and is shown to be of particular benefit when one pose is more preva-
lent than the other (i.e. frontal over right profile) due to over generalisation.
In the latter part of the chapter, the more realistic continuous scenario is
investigated. In this novel contribution, the pose-estimator is developed in
conjunction with the face localiser. For these experiments, it is shown that
the addition of the pose-estimator impacts on the lipreading results due to
the front-end effect.
Chapter 8 summarises the work contained in this thesis, highlighting major
research findings. Avenues for future work and development are also dis-
cussed.
1.4 Original Contributions of Thesis
In this thesis a number of original contributions are made to the field of lipreading
and AVASR in general. These are summarised as:
(i) Generic single-stream and multi-stream combination strategies using HMMs
for the novel task of fusing multiple sets of synchronous visual features
together are proposed in Chapter 3.
(ii) Protocols for the IBM smart-room and CUAVE databases which contain
frontal as well as non-frontal views of a speaker’s face are presented in
Chapter 3.
1.4. Original Contributions of Thesis 7
(iii) A comprehensive evaluation of various visual front-ends, specifically for
lipreading, along with the formation of a complete visual front-end using
the Viola-Jones algorithm on the frontal view is undertaken in Chapter 4.
(iv) Results showing the effect each stage of the cascade of appearance based
features, which is the current state-of-the-art visual feature extraction for
lipreading, are presented in Chapter 5. The performance is also compared
against the number of features used, which displays the problem of dimen-
sionality in lipreading using a HMM classifier.
(v) Analysis of the feature mean normalisation (FMN) step is undertaken in
Chapter 5, showing the effect a person’s appearance has on the lipreading
performance. In this analysis, a comparison of the FMN step in the image
domain to the feature domain is conducted, showing that the image-based
approach is slightly superior.
(vi) Determining the saliency of the various regions of the frontal ROIs to
lipreading is undertaken in Chapter 5 via patch-based analysis. In this
innovative analysis, it is shown that the middle patch containing the most
visible articulators such as lips, teeth, tongue are the most salient.
(vii) A new lipreading approach, fusing the more salient patches of the mouth
together via single and multi-stream HMMs is proposed in at the end of
Chapter 5.
(viii) A novel visual front-end which is able to locate and track a profile mouth
ROI using the Viola-Jones algorithm is presented in Chapter 6.
(ix) A comparison of the synchronous frontal and profile lipreading performances
is given in Chapter 6. This comparison is unique as it shows that reasonable
lipreading performance can be obtained from the profile view, however, it
is degraded when compared to its frontal counterpart.
(x) In Chapter 6, patch-based analysis of the profile ROIs is conducted and
the most informative patch is shown to be the middle patch containing
the center of the mouth and the protrusion of the lips. The more salient
patches are then combined to gain a better representation of the profile
visual speech.
8 Chapter 1. Introduction
(xi) A multi-view lipreading system is presented in at the end of Chapter 6. This
novel approach to lipreading shows that by fusing the synchronous frontal
and profile visual features together, improved performance over the frontal
only scenario can be obtained.
(xii) A unified approach to lipreading in Chapter 7 is presented, by normalising all
poses to a single uniform pose. Given only one camera, this pose-invariant
lipreading system uses a transformation matrix based on linear regression
to project the features of the unwanted pose (profile) into the wanted pose
(frontal). These experiments were performed for the stationary scenario,
where the speaker was fixed in one pose (i.e. frontal or profile) for the
entire utterance and the pose of the speaker was assumed. This technique
is shown to be of benefit when the speaker is in one dominant pose such as
the frontal pose. When more non-dominant poses are included, the pose-
normalising step also proves to be of benefit.
(xiii) A continuous pose-invariant lipreading system, which allows the speaker
to move their head during the utterance is proposed in the latter part of
Chapter 7. In this system, a novel pose-estimator is developed in conjunc-
tion with the face localiser, which then cues the visual front-end for the
respective pose. As the pose-estimation step is at the front of the lipread-
ing system, it introduces extra error which affects the overall lipreading
performance.
1.5 Publications Resulting from Research
The following fully-referred publications have been produced as a result of the
work in this thesis:
1.5.1 Book Chapters
(i) P.Lucey, G. Potamianos and S.Sridharan, “Visual Speech Recognition Across
Multiple Views”, to appear in Visual Speech Recognition: Lip Segmentation
and Mapping (A. Liew and S. Wang, eds.), IGI Global, 2007 [proposal ac-
cepted].
1.5. Publications Resulting from Research 9
1.5.2 International Conference Publications
(i) P. Lucey, G. Potamianos and S. Sridharan, “A Unified Approach to Multi-
Pose Audio-Visual ASR”, to appear in Proceedings of Interspeech, (Antwerp,
Belgium), August 2007 [awarded best student paper ].
(ii) P.Lucey, G. Potamianos and S.Sridharan, “An Extended Pose-Invariant
Lipreading System”, to appear in Proceedings of the International Workshop
on Auditory-Visual Speech Processing (AVSP), (Hilvarenbeek, The Nether-
lands), August 2007 [abstract].
(iii) D. Dean, P. Lucey, S. Sridharan and T. Wark, “Fused HMM-Adaptation of
Multi-Stream HMMs for Audio-Visual Speech Recognition”, to appear in
Proceedings of Interspeech, (Antwerp, Belgium), August 2007.
(iv) D. Dean, P.Lucey, S.Sridharan and T. Wark, “Weighting and Normalisation
of Synchronous HMMs for Audio-Visual Speech Recognition”, to appear in
Proceedings of the International Workshop on Auditory-Visual Speech Pro-
cessing (AVSP), (Hilvarenbeek, The Netherlands), August 2007 [abstract].
(v) P. Lucey and G. Potamianos, “ Lipreading Using Profile Versus Frontal
Views”, in Proceedings of the International Workshop on Multimedia and
Signal Processing (MMSP), (Victoria, Canada), pp. 24-28, 2006.
(vi) P. Lucey and S. Sridharan,“Patch-based Representation of Visual Speech”,
in HCSNet Workshop on the Use of Vision in Human-Computer Interaction
(VisHCI 2006)), (R. Goecke, A. Robles-Kelly, and T. Caelli, eds.), vol. 56
of CRPIT, (Canberra, Australia), pp. 79 -85, ACS, 2006
(vii) G. Potamianos and P. Lucey, “Audio-Visual ASR from Multiple Views in-
side Smart Rooms”, Proceedings of the International Conference on Multi-
sensor Fusion and Integration for Intelligent Systems (MFI), (Heidelberg,
Germany), pp. 35-40, 2006.
(viii) P. Lucey, S. Lucey and S. Sridharan,“Using a Free-Parts Representation for
Visual Speech Recognition”, in Proceedings of Digital Imaging Computing:
Techniques and Applications (DICTA), (Cairns, Australia), pp. 379-384,
2005.
10 Chapter 1. Introduction
(ix) P. Lucey, D. Dean and S. Sridharan,“Problems associated with current area-
based visual speech feature extraction techniques”, in Proceedings of Inter-
national Conference on Auditory-Visual Speech Processing (AVSP), (British
Columbia, Canada), pp. 73-78, 2005.
(x) S. Lucey and P. Lucey,“Improved speech reading through a free-parts rep-
resentation”, in Proceedings of the International Conference on Auditory-
Visual Speech Processing (AVSP), (British Columbia, Canada), pp. 85-86,
2005.
(xi) D. Dean, P. Lucey and S. Sridharan,“Audio-Visual Speaker Identification
using the CUAVE Database”, in Proceedings of the International Conference
on Auditory-Visual Speech Processing (AVSP), (British Columbia, Canada),
pp. 97-101, 2005.
(xii) D. Dean, P. Lucey, S. Sridharan and T. Wark,“Comparing audio and visual
information for speech processing”, in International Symposium of Signal
Processing and its Applications (ISSPA), (Sydney, Australia), pp. 58-61,
2005.
(xiii) P. Lucey, T. Martin and S. Sridharan,“Confusability of phonemes grouped
according to their viseme classes in noisy environments”, in Proceedings
of the International Conference on Speech, Science and Technology (SST),
Sydney, Australia, pp. 265-270, 2004.
Chapter 2
A Holistic View of AVASR
2.1 Introduction
AVASR is a very broad and diverse research field. Areas such as linguistics,
psychology and physiology in addition to the machine learning/computer vision
area are all incorporated under the same AVASR umbrella. Being such a broad
area of work, it is imperative for researchers to have some grasp of the key elements
in each of these individual areas, so as to optimise the best representation of the
visual signal. This is necessary so that the performance of the final lipreading
system for this thesis is maximised.
This chapter is intended to give a holistic view of the field of AVASR . The first
part of this chapter traces the history of AVASR, initially giving a brief timeline
covering the last half century, focussing more on the key papers and research
that has led to the development of the current state-of-the-art AVASR system.
The review then concentrates on the recent advances that have been made over
the past five or so years in terms of the application of this technology. The
chapter then focuses on the linguistics and speech production aspects of audio-
visual speech. The final part of this chapter details the various psychological and
cognitive facets associated with the human perception of audio-visual speech.
Having some kind of insight into these “non-machine learning” areas will aid in
the understanding of the final structure of lipreading system proposed in this
thesis.
11
12 Chapter 2. A Holistic View of AVASR
2.2 The History of AVASR
Understanding speech in noisy environments has been a topic of interest for engi-
neers since the 1890s [167]. This interest heightened in the war years, especially
during the 1940s and 1950s with the rapid growth in military and civil aviation.
An important application that was of interest to engineers working in this field
at the time was improving ways that air traffic controllers could communicate
with pilots. All of this interest in this particular field led to the first known work
on audio-visual speech processing, which was published by Sumby and Pollack
in 1954 [166]. In this work, Sumby and Pollack examined the contribution of
visual factors to oral speech intelligibility as a function of the “speech-to-noise”
ratio and the size of the vocabulary. Their motivation for the work came about
through the observation that humans can tolerate higher noise levels in speech
when using lip information in comparison when no lip information was used and
also the phenomenon when the message or vocabulary size increased, the speech
intelligibility diminished. The results from this work found that seeing the face
of the talker was equivalent to an effective improvement in the speech to noise
ratio of up to 15 dB.
From the point of view of speech intelligibility, Sumby and Pollack showed
that adding the visual information to the audio signal improved it greatly. But it
wasn’t yet known how the visual modality contributed to the audio signal, until
the work on McGurk and MacDonald in 1976 [118]. In their paper, McGurk
and MacDonald were able to aptly demonstrate the bimodal nature of speech
via the McGurk effect. The McGurk effect essentially shows that when humans
are presented with conflicting acoustic and visual stimuli, the perceived sound
may not exist in either modality. It demonstrates the phenomenon when a per-
son sees the repeated utterances of the syllable /ga/ with the sound /ba/ being
dubbed onto the lip movements. Often the person does not perceive either /ga/
or /ba/, but instead percieves the sound /da/. This work highlights that not only
does the visual signal improve speech intelligibility but it does it by providing
complementary information, which is the key motivation behind AVASR.
2.2. The History of AVASR 13
It must be said that over this period of time, it was commonly acknowledged
that the hearing impaired used visual speech to increase speech intelligibility but
these pieces of work were not significant in terms of helping the deaf directly.
It did however, give an indication of what role the visual modality has to play
in terms of providing complementary information to the acoustic channel. This
fact motivated the first actual implementation of an AVASR system developed by
Petajan in 1984 [132]. In this initial system, Petajan extracted simple black and
white images of a speaker’s mouth and took the mouth height, width, perime-
ter and area as his feature. The next major progress in AVASR was a decade
later with Bregler and Konig [13] published their work using eigenlips. Shortly
following this work Duchnowski et al. [45] extended this technique by employing
linear discriminant analysis for the visual feature extraction. In the mid to late
90’s, most of the pioneering work in AVASR was coming from the Institute de
la Communication Parlee (ICP) in Grenoble, France [142]. At ICP, they inves-
tigated the problem of fusing the audio and visual modalities together and this
resulted in many benchmark papers by Adjoudani and Benoıt [1] and Adjoudani
et al. [2].
Although considerable work on the topic of AVASR was published in the
1990’s, they were all of little significance in terms of getting an AVASR sys-
tem deployed in a “real-world” scenario. A major restriction of this stemmed
from the lack of a large audio-visual corpus which could be used to develop
AVASR systems for the task of speaker-independent, large vocabulary continu-
ous speech recognition. In a major effort to remedy this situation, IBM’s Human
Language Technologies Department at the T.J. Watson Research Center coor-
dinated a workshop at the John Hopkins University in Baltimore, USA, where
leading researchers from around the world converged in the summer of 2000 to col-
lect such as database and to further improve techniques associated with AVASR.
A full description of this workshop as well as the results are given in [127].
As AVASR is a technology which is driven by data, most of the recent progress
in the field has centered on the work conducted by IBM due to their ability to
capture high quality audio-visual data. Most of the recent notable research out-
comes have stemmed from the work spearheaded by Gerasimos Potamianos and
14 Chapter 2. A Holistic View of AVASR
his colleagues. In addition to the large vocabulary experiments, Potamianos et
al. in 2003 conducted AVASR experiments in challenging environments, where
data was captured in office and in-car scenarios [141]. In this work, they found
that the performance degraded in both modalities by more than twice their re-
spective word error rates, however, the visual modality still remained beneficial
in recognising speech [141].
In an effort to deploy a real-time AVASR system, the researchers at IBM
developed in 2003 a real-time prototype for small-vocabulary AVASR [32]. In
this work, they obtained real-time performance using a PentiumTM 4, 1.8GHz
processor. With the same goal in mind in 2004, IBM then produced an AVASR
system which used an infra-red headset [77]. As the extraction of visual speech
information from full-face videos is computationally expensive as well as being
difficult due to visual variabilities such as pose, lighting and background, the mo-
tive behind this work was to bypass these problems by using a special wearable
audio-visual headset which is constantly focussed on the speaker’s mouth [77].
The added benefit of using infra-red illumination was that it also provided ro-
bustness to severe lighting variations. In this work they found that this approach
gave comparable results to normal AVASR systems, which suggested this was a
viable approach.
In the last couple of years, due to the reduction in cost of capturing and
storing audio-visual data, more databases are becoming publicly available for
researchers to use that contains data which resembles data that would be en-
countered in “real-world” noisy conditions. This is in stark contrast to the case
five years ago, where all data captured was in ideal laboratory conditions. This is
essential as “real-world” phenomenons which can greatly affect the performance
of AVASR systems such as the “Lombard effect” [86], which is the phenomenon
where a speaker attempts to communicate more effectively in noisy environments,
can be investigated. Such an investigation was carried out by Huang and Chen
[75]. Examples of recently collected “real-world” databases include; the in-car
audio-visual data of the AVICAR database [93], the stereo data of the AVOZES
database [58], speaker movement in the CUAVE database [130], and the smart-
room data contained in the IBM smart-room database [138]. The availability of
2.3. Anatomy of the Human Speech Production System 15
the latter two databases have allowed for work to be completed that forms the
basis of this thesis (see Chapter 3.6 for full description of various audio-visual
corpora).
In addition to using the visual modality to improve speech recognition, the
video signal has been used for various other applications such as speaker recogni-
tion [24, 42, 182], visual text-to-speech [23, 30, 35], speech event detection [39],
video indexing and retreival [76], speech enhancement [55, 58], signal separation
[53] and speaker localisation [16, 181]. Improvements in these areas will result in
more robust and natural speech recognition and human-computer interaction in
general [142].
To summarise, compared to the state of AVASR a decade ago, the field of
AVASR can be now said to be becoming a more mature and substantial field of
research. So much so, that there are now many review papers [24, 142] and books
[178] solely focussed on this topic. However, for the future success of AVASR to
be realised, large databases like the IBM Via Voice database, which is suitable
for large vocabulary continuous speech recognition, have to be collected for use
in scenarios where it is hoped to be employed such as in-car environments.
2.3 Anatomy of the Human Speech Production
System
A comprehensive understanding of the human speech production system is imper-
ative in creating a successful lipreading system, so that the final system extracts
all the pertinent visual speech information emanating from the visible articula-
tors. The components which make up the human speech production system are
depicted in Figure 2.1. The human speech signal starts when air is forced out of
the lungs into the vocal tract, which consists of the pharyngeal and mouth cav-
ities. As air comes out of the lungs, it passes through the bronchi and trachea.
Once it passes the bronchi and trachea, it flows pass the vocal cords, which de-
termine whether the sound produced is either voiced or unvoiced. Voiced sounds
are produced when the vocal cords are tensed causing a vibration through the air
flow. Unvoiced sounds are caused when no vibration of the vocal cords occurs,
such as the case of whispering.
16 Chapter 2. A Holistic View of AVASR
Figure 2.1: Schematic representation of the complete physiological mechanism ofspeech production highlighting the externally visible area (taken from Rabinerand Juang [147]).
After passing the vocal cords, the final sound is determined by the restrictions
placed in the vocal tract. The main components in the vocal tract responsible
for this are the velum, tongue, teeth, lips and jaw. Each of these can change very
quickly independently of each other, which allows for a large array of sounds to
be produced. From the overall speech production system depicted in Figure 2.1,
it is evident that the only visible articulators of this process are the; lips, teeth,
jaw and portion of the tongue, with the vocal cords, velum and full tongue shape
being unseen. As such, for the final lipreading system the area around the mouth
that contains these visible articulators should be extracted, which was the case
for this thesis (see Chapter 4 for details).
2.4. Linguistics of Visual Speech 17
(a) (b) (c)
Figure 2.2: Examples showing that the phonemes /p/, /b/ and /m/ look visem-ically similar. Each of these visemes are shown in images (a), (b) and (c) respec-tively.
(a) (b)
Figure 2.3: Examples showing that the visemes of the acoustically similarphonemes /m/ and /n/ , look different in the visual domain. The viseme /m/ isshown in (a) and /n/ is shown in (b).
2.4 Linguistics of Visual Speech
The basic unit of acoustic speech is called the phoneme [146]. In the visual do-
main, the basic unit of visual speech is called the viseme [22]. Generally speak-
ing, there is a many-to-one mapping between phonemes and visemes, with many
phones being assigned to a viseme. For example, phonemes /p/, /b/, /m/ all
look similar in the visual domain and as such are assigned to the same viseme
class as can be seen in Figure 2.2. By the same token there are many visemes
that are acoustically ambiguous. For example, phonemes /n/ and /m/ sound
similar in the acoustic domain, but when viewing their respective visemes they
look distinctly different, as shown in Figure 2.3. This last example shows another
benefit of utilising the visual channel.
At the audio-visual speech recognition summer workshop held at the John
Hopkin’s University in 2000 [127], the 44 phonemes in the HTK phone set [189]
were mapped to 13 visemes. These phoneme to viseme mappings are given in
Table 2.1. However, this mapping between the two domains is unnecessary ac-
cording to Potamianos et al. [142], as having different classes for the audio and
18 Chapter 2. A Holistic View of AVASR
Table 2.1: The mapping of the 44 phonemes from the HTK set, to 13 visemesused in the John Hopkin’s University summer workshop [127].
video components only complicates the fusion process, with unclear performance
gains. Because of this most of the research conducted in literature has used just
the acoustic phoneme classes for the visual modality. These different sub-word
classes did not affect the work in this thesis however, as word models were used.
2.5 Visual Speech Perception by Humans
Summerfield [167] cites the following three key reasons why lipreading benefits
human speech perception:
1. It helps speaker localisation
2. It provides complimentary information about the place of articulation, such
as tongue, teeth and lips
3. It contains segmental information that supplements the audio.
The first and second points are of particular benefit to those who have poor hear-
ing because they would normally use this lipreading information as the primary
source of speech information. Some people are so adept at lipreading, that they
can almost achieve perfect speech perception [168]. However, as the above three
points note and the McGurk effect shows [118], even people with normal hearing
use lipreading in conjunction with the audio signal to improve speech intelligi-
bility. This phenomenon is often heightened in the presence of acoustic noise as
first noted by Sumby and Pollack [166].
Using visual speech information to improve speech intelligibility is done by
humans at a very young age. Aronson and Rosenblum [4] noticed that infants as
2.5. Visual Speech Perception by Humans 19
young as 3 months old are aware of the bimodal nature of speech, while Dodd
[44] noticed that toddlers at the age of 19 months actually perform lipreading.
Mills [120] was able to show that blind children are slower in acquisition of speech
production for those sounds which have visible articulators than seeing children.
Even though all these facts make for some interesting reading, they really do not
provide any assistance in developing an automatic lipreading system. To obtain
such assistance, a number of questions need to be answered, such as:
• What parts of the face give the most speech information?
• How important is the temporal nature of the visual speech signal?
• How much of an impact does the integration of the audio and video signals
have on speech perception?
Each of these questions will be looked at in the following subsections with respect
to human perception studies. These findings will be of use when developing the
lipreading later on in the thesis.
Pertinent Areas of the Face for Lipreading
It is largely agreed that most information pertaining to visual speech stems from
the areas around the lips, even though visual speech is located throughout the
human face to some extent [92]. McGrath et al. [117] showed that the human
lips alone carry more than half the visual information provided by the face of an
English speaker. Benoıt et al. [7] found that the lips alone contain on average
two thirds of the speech intelligibility carried by a French speaker’s face. Benoıt
et al. [7] also showed that a combined lip/jaw model gave a noticeable gain in
performance over a lip only model. The combined lip/jaw model performance
was only slightly less than for the entire face model. Brooke and Summerfield
[14], found that the visible articulators such as teeth and tongue improved the
perception of vowels. Finn [48] found that for consonants the most important
features were the size and shape of the lips.
Brooke and Summerfield [14] performed perceptual tests using a synthetic
face, synthesising the outer and inner lip contours and the chin. Human speechread-
ing performance for vowels using the synthetic face proved to perform significantly
20 Chapter 2. A Holistic View of AVASR
worse than using a natural face. It was concluded that additional cues such as
the visibility of the teeth and tongue were required for more accurate recognition
of vowels. Finn [48] sought to determine the appropriate oral-cavity features for
consonant recognition. The most important features determined were height and
width of oral cavity opening, the vertical spreading of the upper and lower lips
and the cornering of the lips (puckering).
Temporal Nature of Visual Speech
As speech is a temporal signal, it is intuitive that the temporal features would be
of most use. This is certainly the case for audio-only speech recognition where
delta and acceleration coefficients are appended to the static features to improve
speech recognition. In human perception studies, many experiments have been
carried on testing this theory on the visual features [17, 18, 36, 60, 66, 179]. In the
work carried out by Rosenblum and Saldana [151], this was found to be the case
where they found face kinematics to be more useful than shape parameters. The
frame rate of the visual speech is also important in lipreading as shown by Frowein
et al. [50]. In this work, they showed that speech recognition performance drops
markedly below 15Hz.
Impact of Audio-Visual Integration
Lavagetto [92] demonstrated that acoustic and visual speech stimuli are not syn-
chronous, at least at a feature based level. It was shown that visible articulators
during an utterance, start and complete their trajectories asynchronously, ex-
hibiting both forward and backward coarticulation with respect to the acoustic
speech wave. Intuitively, this makes a lot of sense, as visual articulators (i.e. lips,
tongue, jaw etc.) have to position themselves correctly before and after the start
and end of an acoustic utterance. This time delay is known as the voice-onset-
time (VOT) [47], which is defined as the time delay between the burst sound,
coming from the plosive part of a consonant, and the movement of the vocal folds
for the voiced part of a voiced consonant or subsequent vowel. McGrath et al.
[117] also found an audio lead of less than 80ms or lag of less than 140ms could
not be detected during speech. However, if the audio was delayed by more than
2.6. Summary 21
160ms it no longer contributed useful information. It was concluded that, in
practice, delays of up to 40ms are acceptable. In normal PAL video sample rate
represents a single frame of asynchrony, signifying the importance of some degree
of asynchrony and synchrony in continuous audio-visual speech perception.
Seeing that this thesis is concerned only with the visual channel, the apparent
asynchrony between the two modalities could have implications for the lipreading
system. This is because the models for the visual modality are generally boot-
strapped from the time labeled transcriptions taken from the audio-only channel.
Even though there is some error associated with this particular method, this still
appears to be the best way of approximating the initial visual models due to the
fact that given clean conditions, the audio channel will be the most reliable source
of information.
2.6 Summary
This chapter has given insights into the various aspects associated with AVASR.
The history of AVASR was detailed, citing the major works which have influenced
this field of research over the past fifty years or so. One of these major works
described, was that of the McGurk effect [118], which highlights the bimodal
nature of speech. From the timeline presented, it was shown that the main
driving force behind the progress in AVASR has been the availability of quality
data to train and test the various systems. The basic mechanics of the human
production system, as well as the linguistics associated with the audio and visual
modalities, were then presented. The complementary nature of speech of the
acoustic and visual signals were then analysed, extending the work conducted
by McGurk and McDonald. Questions pertaining to what visual articulators
and representations are effective for lipreading were then raised. Each of these
questions were explored in turn, and the answers used to form the motivation
behind the presented lipreading system presented later on in this thesis.
22 Chapter 2. A Holistic View of AVASR
Chapter 3
Classification of Visual Speech
3.1 Introduction
A lipreading system can be considered as a sequential, modular system consisting
of the; visual front-end, visual feature extraction and visual feature classification,
as depicted in Figure 3.1. Even though the visual front-end and visual feature
extraction modules are the major focus for this thesis, classification remains a
central part of a lipreading system and this chapter is dedicated to this partic-
ular area. Classification is a very difficult task and is made even more difficult
by the addition of extra streams. Due to the complexity involved in adding ex-
tra streams, in AVASR literature almost all systems have utilised a two stream
approach, where one is dedicated to the audio stream and the other to the vi-
sual stream. However, such techniques are generic and can be applied to any
number of streams which is a property that will be made use of throughout this
thesis. While the first part of this chapter conducts a brief review on the clas-
sification techniques available to lipreading before detailing the hidden Markov
model (HMM), the second part of this chapter looks at the various integration
methods which can be used to fuse multiple streams of visual features together.
The final part of the chapter then conducts a thorough review of the currently
available audio-visual corpora, with particular emphasis placed on describing the
data and protocols of the the IBM smart-room and CUAVE databases, which are
the two databases which contain multi-view/pose visual speech data.
23
24 Chapter 3. Classification of Visual Speech
tVideo In
Visual Front-End
ClassificationVisual-Only
Speech Recognition(Lipreading)
Visual Feature Extraction
Figure 3.1: Block diagram of a lipreading system.
3.2 Classifiers for Lipreading
In literature, the most widely used classifier for modelling and recognising audio
and visual speech data has been the hidden Markov model (HMM). Discriminant
classifiers such as artificial neural networks (ANN) have also been used [90].
Heckmann et al. [69] developed a combination of both the ANN and HMM to form
a hybrid ANN-HMM classifier. Similarly, Bregler et al. [12] and Duchnowski et
al. [45] devised a hybrid ANN-DTW (dynamic time warping) classifier. Recently,
Gowdy et al. [62] and Saenko et al. [157] have proposed the use of Dynamic
Bayesian Networks (DBNs) for AVASR. It should be noted though, that the
DBN is not a classifier as such, but a unifying framework for combining different
classifiers together. In [62] and [157], the DBN was used as a framework for
combining the single streams HMMs for the respective streams together.
Even though all the classifiers mentioned above have enjoyed some success for
their respective AVASR 1 tasks, the vast majority of systems employ HMMs with
a continuous observation probability density, modeled as a mixture of Gaussian
densities [142]. The reason for this is due to the simple fact that HMMs have
proven themselves to be the best classifier to model and recognise the temporal
nature of speech for both the audio and visual signals. As such, the HMM will
be used as the classifier of choice for this thesis. The next section gives a detailed
description of the HMM and how it can be utilised for the task of lipreading.
1As lipreading is a subset of the overall task of AVASR, it should be noted that it is impliedthat these classifiers mentioned above have been used for the sole task of lipreading.
3.3. Hidden Markov Models (HMMs) 25
3.3 Hidden Markov Models (HMMs)
Hidden Markov models (HMMs) are a powerful statistical tool which model a
temporal signal based on observations which are assumed to be of a Markovian
process whose internal states are unknown or hidden. A Markov process may
be described at any time as being in one of a set of N distinct states [146] as
depicted in Figure 3.2. At regularly spaced, sampled intervals, the Markov process
1 2 3
a11 a22 a33
a12 a23
a21 a32
a31
a13
Figure 3.2: Discrete states in a Markov model are represented by nodes and thetransition probability by links.
undergoes a change of state according to a set of probabilities associated with
the state. So given the sequence of states q, defined as
q = {q1, . . . , qT}, qt ∈ [1, . . . , N ] (3.1)
where qt is the state at time t, a probabilistic description that the model λ
generated the sequence q would require specification of the current state at time
t, as well as all the previous states. However, for the first order Markov chain,
only the preceding state is used such that
Pr(qt = j|qt−1 = i, qt−1 = k, . . .) = Pr(qt = j|qt−1 = i) (3.2)
Equation 3.2 can be simplified as the right side is independent of time, leading
to the set of state transition probabilities A = {aij} of the form
aij = Pr(qt = j|q + t− q = i) (3.3)
26 Chapter 3. Classification of Visual Speech
with the following properties
aij ≥ 0 ∀j, iN∑
j=1
aij = 1 ∀i
At time t = 1, there has to be initial state probabilities πi
πi = Pr(q1 = i), 1 ≤ i ≤ N (3.4)
So given a state sequence q and a Markov model λ = (A, π), a posteriori proba-
bility that q was created by λ can be gained via
Pr(q|λ) =T∏
t=1
Pr(qt|λ)
= πq1aq1q2aq2q3 . . . aqT−1qT(3.5)
Normally however, there is no way of knowing the state sequence q. Instead,
some observation feature vectors within the observation set O defined by
O = {o1,o2, . . . ,oT} (3.6)
where ot is a n dimensional vector at time t, can be used to estimate the state
sequence from the data that is being modeled. from the data that is being
modeled can be used to estimate the state sequence. Therefore, the a posteriori
probability that λ generated the observation O can be given by
Pr(O|λ) =∑
all q
Pr(O|q, λ)Pr(q|λ) (3.7)
Due to the class independent nature of a HMM it is easier to evaluate Equation
3.7 using the conditional density functions for the case of continuous observations.
This is what is referred to as a continuous HMM and it can be defined as
p(O|λ) =∑
all q
p(O|q, λ)Pr(q|λ) (3.8)
Modelling continuous observations, rather than using discrete quantised observa-
tions, is preferred as they have been found to more effective for the task of lipread-
ing and other speech processing applications [142, 146]. To evaluate Equation 3.8,
3.3. Hidden Markov Models (HMMs) 27
a value for p(O|q, λ) which can be expressed, assuming statistical independence
of observations as
p(O|q, λ) =T∏
t=1
p(ot|qt, λ)
=T∏
t=1
bqt(ot) (3.9)
where B = {bj(ot)} is the compact notation expressing the likelihood of obser-
vation ot lying in state j. The form of bj(ot) can be described as a mixutre of
Gaussians in the form of bjm(ot) by
bjm(ot) =
Mj∑m=1
cjmbjm(ot)
=
Mj∑m=1
cjmN (ot, µjm,Σjm) (3.10)
where c is the mixture weight, µ is the mixture mean and Σ is the mixture
covariance matrix for mixture m and state j. These are known as Gaussian
mixture models (GMMs). Now that parameters A, B and π have been defined,
the HMM λ can be represented by the compact parameter set of
λ = (A,B, π) (3.11)
Using the model parameters in Equation 3.11, the likelihood of the observation
O can be recognised using Equation 3.7. This is most often impossible to do
due to the complexity involved, so an approximation using the Viterbi decoding
algorithm is used. This is described in the next subsection.
3.3.1 Viterbi Recognition
Recognising the most likely hidden state sequence, q∗, using Equation 3.7 is
computationally prohibitive as it requires the likelihood of every possible possible
path to calculated. There are several ways to solve the problem of finding the
optimal path q∗ associated with a given observation sequence. The problem arises
28 Chapter 3. Classification of Visual Speech
in the definition of the optimal state sequence. The most common optimality
criterion [46, 146] is to chose states qt that are individually most likely at each
time t. Although this criterion is optimal locally there is no guarantee that the
path is a valid one, as it might not be consistent with the underlying model λ.
However, it has been shown [46, 146] that this locally optimal solution works
effectively in practice and can be formalised into what is known as the Viterbi
algorithm [46, 146, 147], which is given as follows
1. Initialisation:
δi(1) = πibi(ot), 1 ≤ i ≤ N
ψi(1) = 0 (3.12)
2. Recursion:
δj(t) = bj(ot)maxNi=1δi(t− 1)ai,j, 2 ≤ t ≤ T, 1 ≤ j ≤ N
ψj(t) = arg maxNi=1δi(t− 1)ai,j, 2 ≤ t ≤ T, 1 ≤ j ≤ N (3.13)
3. Termination:
p(O|q∗, λ) = maxNi=1δi(T )
q∗T = arg maxNi=1δi(T ) (3.14)
4. Path backtracking:
q∗t = ψq∗t+1(t + 1), t = T − 1, T − 2, . . . , 1 (3.15)
where δi(t) is the best score along a single path, ψi(t) is an array to keep track
of the argument that has the maximum value, all at time t. In practice a closely
related algorithm using logarithms [146] is employed, thus negating the need for
any multiplications reducing computation load considerably.
3.3.2 HMM Parameter Estimation
The parameters of a HMM, λ = (A,B, π), can be learnt from a set of training
observations Or, 1 ≤ r ≤ R, where R is the number of training sequence observa-
tions. Even though there is no method to analytically solve for the model param-
eter set that globally maximises the likelihood of the observation in a closed form
3.3. Hidden Markov Models (HMMs) 29
[146], a good solution can be determined by a straightforward technique known
as the Baum-Welch or forward-backward algorithm [46]. The Baum-Welch train-
ing algorithm is an instance of the generalised expectation maximisation (EM)
algorithm [43], which essentially chooses a λ that locally maximises the likeli-
hood p(O|λ). Since this algorithm requires a starting guess for λ, some form of
initialisation must be performed, which is generally done with Viterbi training.
Viterbi Training
Viterbi training for HMMs places a hard boundary on the observation sequence
O. When a new HMM is initialised, the Viterbi segmentation is replaced by
a uniform segmentation (i.e. each training observation is divided into N equal
segments) for the first iteration. After the first iteration, each training sequence
Or is segmented using a state alignment procedure which results from using the
Viterbi decoding algorithm described in the previous subsection to get the optimal
state sequence qr∗. If Aij represents the total number of transitions from state i
to state j in qr∗ for all R observation sequences, then the transition probabilities
can be estimated from the relative frequencies
aij =Aij∑N
k=1 Aik
(3.16)
Within each state, a further alignment of observations to mixtures components
is made by associating each observation ot with the mixture component with
the highest likelihood. On the first iteration unsupervised k-means clustering is
employed to gain an initial estimate for bj(ot). Viterbi training is repeated until
there is minimal change in the parameter model estimate λ.
Baum-Welch Re-Estimation
The Baum-Welch re-estimation algorithm uses a soft boundary denoted by L
representing the likelihood of an observation being associated with any given
Gaussian mixture component. This soft boundary segmentation replaces the
hard boundary used in the Viterbi training. The new likelihood is known as
the occupation likelihood [189] and is computed from the forward and backward
30 Chapter 3. Classification of Visual Speech
variables. The forward variable αi(t) is defined as
αi(t) = p(o1o2 . . .ot, qt = i|λ) (3.17)
that is, the likelihood of the partial observation sequence o1o2 . . .ot and state i
at time t, given the model λ. Which can be solved inductively as
1. Initialisation:
αi(1) = πibi(o1), 1 ≤ i ≤ N (3.18)
2. Induction:
αj(t + 1) =
[N∑
i=1
αi(t)aij
]bj(ot+1), 1 ≤ i ≤ N, 1 ≤ t ≤ T − 1 (3.19)
In a similar manner, the backward variable βi(t) is defined as
βi(t) = p(ot+1ot+2 . . .oT |qt = i, λ) (3.20)
that is, the likelihood of the partial observation sequence ot+1ot+2 . . .oT and state
i at time t, given the model λ. Again, this can be solved inductively by
1. Initialisation:
βi(T ) = 1, 1 ≤ i ≤ N (3.21)
2. Induction:
βi(t) =N∑
j=1
aijbj(ot+1)βj(t + 1), 1 ≤ i ≤ N, 1 ≤ t ≤ T − 1 (3.22)
From Equations 3.17 and 3.20 it can be seen that αi(t) is a joint likelihood whereas
βi(t) is a conditional likelihood such that
p(O, qt = j|λ) = p(o1o2 . . .ot, qt = j|λ)p(ot+1ot+2 . . .oT , qt = j|λ)
= αj(t)βj(t) (3.23)
using this result the likelihood of qt = j can be defined as Lj(t) in terms of αj(t),
βj(t) and λ via
3.3. Hidden Markov Models (HMMs) 31
Lrj(t) = p(qr
t = j|Or, λ)
=p(Or, qr
t = j|λ)
p(Or|λ)
=1
Pr
αrj(t)β
rj (t) (3.24)
the likelihood of qrt = j for mixture component m can also be defined as
Lrjm(t) =
1
Pr
[N∑
i=1
αri (t− 1)aij
]cjmbjm(or
t )βrj (t) (3.25)
where Pr is the total likelihood p(Or|λ) of the r’th observation sequence, which
can be calculated as
Pr = αrN(T ) = βr
1 (3.26)
The transition probabilities A = {aij} can now be re-estimated using
aij =
∑Rr=1
1Pr
∑Tr−1
t=1 αri (t)aijbj(o
rt+1)β
rj (t + 1)
∑Rr=1
1Pr
∑Tr−1
t=1 αri (t)β
ri (t)
(3.27)
Given Equations 3.24, 3.25 and 3.26, the mixture components of B = {bj(ot) =
(cjm, µjm,Σjm)} can be re-estimated by
µjm =
∑Rr=1
∑Tr
t Lrjm(t)or
t∑Rr=1
∑Tr
t Lrjm(t)
(3.28)
Σjm =
∑Rr=1
∑Tr
t Lrjm(t)(or
t − µjm)(ort − µjm)′∑R
r=1
∑Tr
t Lrjm(t)
(3.29)
cjm =
∑Rr=1
∑Tr
t Lrjm(t)∑R
r=1
∑Tr
t Lrj(t)
(3.30)
Equations 3.27 to 3.30 are iterated until convergence occurs.
32 Chapter 3. Classification of Visual Speech
3.4 Stream Integration
Fusing the audio and visual streams within an AVASR system is a prime example
of the general classifier combination problem [79]. As such, techniques used for
the integration of the audio and visual streams can generalise to any particular
problem, such as combining two or more visual streams of visual data. Combining
multiple visual streams together within a lipreading system, may have the ben-
efit of improving the visual speech representation, inturn improving the overall
lipreading performance. This particular facet is focussed on with the patch-based
and multi-view experiments conducted later on in this thesis (see Chapters 5 and
6).
The goal of combining various streams or classifiers together, is to give superior
performance over each of the single-stream classifiers. However, great care has to
be taken when combining the classifiers together due to the risk of catastrophic
fusion [101]. Catastrophic fusion occurs when the performance of an ensemble of
combined classifiers is worse than any of the classifiers individually.
According to Potamianos et al. [142], stream integration can be performed
either by feature fusion or decision fusion methods 2. Feature fusion methods are
based on concatenating the features of the various streams into a single feature
vector. The main benefit of using the feature fusion method is that only one
classifier is used and therefore can be employed easily into an already existing
lipreading system. However, if one data stream is not as reliable or as informative
as the other(s), these features can not be weighted accordingly. Also, the feature
fusion methods assume that all streams are synchronous and therefore can not
model any asynchrony between the various streams. This is not an issue for
lipreading however, as all streams of data should be synchronous in the visual
domain.
In comparison, decision fusion methods model each stream separately and use
their respective classifier outputs to recognise the given speech. Even though the
decision fusion methods are more complex than the feature fusion methods, they
2It should be noted that these groupings given by Potamianos et al. [142], refer to theintegration of both audio and video streams. However, these can be generalised to any numberof different data streams with the same temporal nature.
3.4. Stream Integration 33
do allow for the weighting of the various streams, which is a very useful char-
acteristic when there are varying levels of information contained in each stream.
Decision fusion methods also allow for the the various streams to be somewhat
asynchronous, although this is not a requirement for lipreading as previously
mentioned. The following subsections describe algorithms for both the feature
fusion and decision fusion methods.
3.4.1 Feature Fusion Techniques
Feature fusion methods can be either implemented by just using the plain con-
catenated feature vector [1], or by transforming the concatenated feature vector
into a more compact representation [143]. Both these techniques are discussed
shortly. However, it is worth noting that feature fusion methods can also be used
to convert features of one modality into another. Examples of this are of the audio
enhancement work performed by Girin et al. [54] and Barker and Berthommier
[5], where they used visual features to estimate the audio features. Goecke et al.
[58] and Girin et al. [55] later extended this work by converting the concatenated
audio-visual features into plain audio features. Although, the audio enhancement
work falls outside the scope of this thesis, this work was a motivating factor for
the pose-invariant experiments conducted in Chapter 7.
Concatenative Feature Fusion
Given the time synchronous observation vectors of the various input streams,
i.e. o{I}t , . . . ,o
{M}t , with dimensionalities D{I}, . . . , D{M} respectively, the joint
concatenated visual feature vector at time t becomes
o{C}t = [o
{I}t , . . . ,o
{M}t ]T ∈ RD (3.31)
where D = D{I} + . . . + D{M}. As with all feature fusion methods these features
are fed into the single-stream HMM. Concatenative feature fusion constitutes
a simple approach for combining different features together, and can be imple-
mented without much change to existing systems. However, the dimension of
D can be rather high (especially if M > 2) causing inadequate modeling of the
34 Chapter 3. Classification of Visual Speech
HMM, due to the curse of dimensionality [8]. The curse of dimensionality essen-
tially refers to the inability of classifiers such as HMMs to converge when given
a observations which are high in dimensionality. To overcome this problem, the
dimensionality of such observations have to be kept low (<= 60). This can be
done by employing the following feature fusion technique.
Hierarchical Linear Discriminant Analysis (HiLDA) Feature Fusion
To overcome dimensionality constraints, it is desirable to get a lower dimensional
representation of Equation 3.31. Potamianos et al. [139] proposed the use of
linear discriminant analysis (LDA) to obtain such a reduction on the concate-
nated audio-visual feature vector (see Chapter 5.3.1 for full explanation). In this
work, they combined the LDA step with a maximum likelihood linear transfor-
mation (MLLT) step. As these steps were performed in the audio and visual
streams prior to concatenation, they termed this feature fusion technique hierar-
chical linear discriminant analysis or HiLDA. In this thesis, the HiLDA technique
was implemented without the MLLT step, so the resulting combined observation
vector can be expressed as
o{HiLDA}t = W × o
{C}t (3.32)
where W is the LDA transformation matrix. The dimensionality of 3.32 can be
set to a value which ensures that the single stream HMM converges.
3.4.2 Decision Fusion Techniques
Even though feature fusion techniques have shown themselves to work well [126],
they cannot take into account the relative reliabilities or usefulness of the various
streams. Such a thing is very important as the information contained in the
various streams can vary somewhat [142]. Decision fusion techniques however,
provide a framework for imparting such information on the respective streams.
According to Potamianos et al. [142], the most common used decision fusion
techniques for AVASR model each stream in parallel, using adaptive weights and
combine the log likelihoods linearly. The most likely speech class or word are then
3.4. Stream Integration 35
derived using the appropriate weights as was done in [1, 47, 68, 70, 85, 126, 135].
This task is relatively easy for isolated speech recognition, however, for continuous
speech recognition this is a poses a very difficult problem as the sequences of
classes (such as HMM states or words) need to be estimated [142]. As such, there
are three possible temporal levels of combining the various stream likelihoods
together. These are:
1. Early integration (EI), combines the stream likelihoods at the HMM state
level, which gives rise to the multi-stream HMM classifier [10], [189]. At the
EI level of integration, synchrony between the various streams is enforced.
2. Middle integration (MI), is implemented by means of the product HMM
[177], or coupled HMM [11, 125], which force HMM synchrony at the phone,
or word boundaries. The MI approach has been used to good affect in
AVASR as it can compensate for the slight lag between the audio and
visual streams due to the voice onset time (VOT) [92]. Examples of this
integration strategy that has been used for AVASR can be found in [29, 47,
63, 75, 107, 123, 125, 126, 175].
3. Late Integration (LI), is typically where a number of n-best hypotheses are
rescored by the log-likelihood combination of the various streams. During
the classification process there is no interaction between the various streams
with only the final classifier likelihood scores being combined. In this ap-
proach temporal information between speech modalities is lost. In AVASR,
this integration strategy has been used in [1, 37, 68].
As mentioned previously, in lipreading all the visual streams should be syn-
chronous as the problem is constrained to one modality, not two as is the case for
AVASR. As this is the case, integrating the streams according to the EI strategy
would be best for lipreading. The following describes the multi-stream HMM ,
with special reference to the synchronous case, which was used in this thesis.
Multi-Stream HMMs
Multi-stream HMMs use separate independently trained HMMs and combine
them into a single HMM in such a way that one stream may have some temporal
36 Chapter 3. Classification of Visual Speech
dependence on the other during decoding, without the disadvantage of training
both sequences together. Multi-stream HMMs can be used to model the MI inte-
gration strategy as they provide relative independence between streams statically
with a loose temporal dependence dynamically. There are two main ways to
build a multi-stream HMM, namely synchronously or asynchronously. Although,
the asynchronous multi-stream HMM is useful for applications for AVASR, the
synchronous multi-stream HMM (SMSHMM) is of more benefit to lipreading due
to its synchronous nature. Even though the SMSHMM is more complicated than
its single stream cousin, it can be implemented as a similarly structured joint
HMM. When decoding an SMSHMM state transitions must occur synchronously
between HMM streams. A necessary condition of this type of multi-stream HMM
is for all HMM streams to have the same number of states N . The i’th initial
state distribution πi, observation emission likelihood bj(ot) and the transition
probability ai,j for the joint SMSHMM can be expressed in terms of
πi = (π{I}i )α{I}
(π{II}i )α{II}
. . . (π{M}i )α{M}
(3.33)
bj(ot) = (b{I}j (o
{I}t ))α{I}
(b{II}j (o
{II}t ))α{II}
. . . (b{M}j (o
{M}t ))α{M}
, 1 ≤ j ≤ N
(3.34)
ai,j = (a{I}i,j )α{I}
(a{II}i,j )α{II}
. . . (a{M}i,j )α{M}
, 1 ≤ i, j ≤ N (3.35)
where {I}, {II}, . . . , {M} refer to the respective streams 3. The weighting factor
α is an exponential weighting factor reflecting the usefulness or reliability of
the respective streams, and is constrained to lie between zero and one, with∑M
i=I αi = 1. Once the new emission likelihoods and transition probabilities have
been found, decoding can take place using the Viterbi algorithm [146] to gain an
estimate of p(OC |λi).
The SMSHMM has been considered in audio-only and visual-only speech
recognition, where the given static features as well as their first and second or-
der derivatives have been assigned their own stream [102, 189]. For AVASR,
3It is worth noting that the number of streams in HTK [189] is restricted to four, which wasthe HMM decoder used for this thesis
3.5. HMM Parameters Used in Thesis 37
researchers have used the SMSHMM as a two stream HMM with one for the au-
dio stream and the other for visual [47, 85, 107, 126, 135]. In this thesis, a novel
extension to the SMSHMM will be investigated by using different visual features
from different poses, as well as different parts of the mouth region for each stream.
Full description of these experiments and results are given in Chapters 5 and 6.
It is worth noting that recently Dean et al. [41] have proposed the use of the
fused HMM, which is similar to the SMSHMM. In this work, they use the just the
most reliable stream (i.e. the audio) to train the various streams of the HMMs
to obtain the best possible state sequence. Decoding is then performed on the
combined streams. This presents another option for integrating the various visual
streams for the lipreading system.
3.5 HMM Parameters Used in Thesis
A HMM can be employed to represent a sub-word unit like a phoneme (or viseme),
tri-phone, a word or a sentence. For large vocabulary tasks, tri-phone models are
used as the number of word models needed becomes very large and the training
set is typically not large enough to build good enough models for all word classes
of the vocabulary. However, this was not a concern for this thesis as the task
of connected word recognition was performed 4. As such, all word HMMs were
modeled using 9 states in a left-to-right topology, with 7 Gaussian mixtures per
state using the hidden Markov model toolkit HTK [189]. This HMM configura-
tion was used as experimental and heuristic evidence showed that this was the
optimal configuration. A silence and short-pause model were also employed. All
models were bootstraped from a segmentation of the parallel audio channel, ob-
tained by an audio-only HMM with identical topology. The audio-only HMMs
were trained on 39-dimensional acoustic features which were extracted to rep-
resent the acoustic signal at the rate of 100 Hz. These were perceptual linear
prediction (PLP) based cepstral features, obtained using a 25 ms Hamming win-
dow, and augmented by their first and second derivatives. The HTK toolkit was
utilized for all training and testing. These HMMs were designed to recognise
4Connected speech recognition refers to connecting whole word HMMs together in sequencecompared to continuous speech recognition which connects sub-word HMMs [189]
38 Chapter 3. Classification of Visual Speech
the connected-digit sequences (ten-word vocabulary with no grammer), and they
were based on single-stream HMMs of a variable dimension (dimensionality of
the visual features is described in Chapter 5.4).
Using HTK, each word HMM was built using HInit and HRest, which are tools
which estimate the initial parameters and re-estimates the parameter using the
Baum-Welch algorithm respectively. The initial soft boundaries for this process
were gained from the hand labeled time transcriptions given with the databases
(descriptions of the databases are given in the next section). This process is
regarded as a the bootstrap operation, and it is extremely good at determining
the isolated word models. As connected words were wanted to be recognised
for this thesis, the additional HTK tool HERest performed embedded training.
The embedded training uses the same Baum-Welch procedure as for the isolated
case but rather than training each model individually all models are trained in
parallel, using the full transcriptions [189]. In the recognition phase, the Viterbi
algorithm is then used to decode the likely state sequence and the words. In
HTK, this was performed using the HVite command.
3.5.1 Measuring Lipreading Performance
As mentioned previously, in this thesis all lipreading experiments conducted were
for the small-vocabulary task of connected-digit recognition. Lipreading results
were reported as a word-error-rate (WER) percentage, calculated by
Accuracy =
(1− N −D − S − I
N
)× 100% (3.36)
where N is the total number of actual words, D is the number of words deleted,
S is the number of words substituted and I is the number of incorrectly inserted
words.
3.6. Current Audio-Visual Databases 39
3.6 Current Audio-Visual Databases
A major restriction to the progress of lipreading and AVASR in general, has been
the availability to researchers of a large audio-visual database which contains
many speakers across numerous different environmental conditions with respect
to both the audio and video modalities. This can be attributed to the large cost
associated with collecting, storing and distributing the audio-visual data. For
example, the storage requirement for capturing video at the full size, frame rate,
and high image quality is enormous, making the widespread database distribution
a non-trivial task [142]. Even in cases where there is a large audio-visual database
with many speakers, a large problem lies in the fact that these are often developed
by corporations which limits their accessibility to the research community due to
proprietary issues. Consequently, most existing databases appear to have been
produced with a specific application in mind by a small group of researchers with
limited resources, rather than attempting to solve the many variants associated
with audio-visual speech. However, as the capturing and storing of audio-visual
data gets more and more affordable, more databases are becoming available which
investigate the many variants of audio-visual speech. In this section, a review of
the currently available audio-visual corpora is conducted. As part of this review,
particular emphasis will be placed on the two databases which contain multi-view
visual speech data that were used in this thesis.
3.6.1 Review of Audio-Visual Databases
Coinciding with the first ever AVASR system, Petajan [132] collected a database
consisting of a single subject uttering 2-10 repetitions of 100 isolated English
words. Since then, similar single subject databases have been developed for re-
searchers conducting experiments of limited size for different languages [1, 27, 40,
60, 68, 150, 164, 165, 172, 175]. In addition to these single subject databases,
many multiple speaker databases have been collected over the past decade. Due
to the cost of capturing and storing video data, these have only been concerned
with small vocabulary tasks, such as isolated or connected digit, letter or word
recognition. One of the first was the Tulips 1 database [122], which contains
40 Chapter 3. Classification of Visual Speech
recordings of 12 subjects uttering digits “one” to “four”. A 10 subject isolated
letter dataset for English has also been collected which has been used by Matthews
et al [113] and Cox et al. [37]. Chen [22] collected a 10-subject isolated word
database with a vocabulary of 78 words which was collected at the AMP labo-
ratory at Carnegie Mellon University. The AMP/CMU database has been freely
available for researchers to use and as such, has been extensively used in literature
[22, 29, 75, 125, 191].
With the improvement in computer technology and the reduction in cost of
video capturing and storage devices in recent years, the multi-speaker databases
have been extended to include many more speakers (i.e. > 35 speakers compared
to ≤ 10). However, most these databases are still concerned with small vocabu-
lary tasks. One of the first of these databases was the AT & T [135] database,
which is a 49 subject database based on connected letters. The University of Illi-
nois at Urbana-Champaign has also collected a 100 subject database for the task
of connected digit recognition [28].Due to it being available to all researchers, one
of the most popular databases in the late 1990’s was the M2VTS database [133].
The M2VTS database, consisted of 37 speakers saying the digits “zero” to “nine”
in French. The work involving the M2VTS database was later extended to form
the XM2VTS database [119], which contains 295 subjects. Four sessions of each
subject were taken to account for natural changes in appearance of speakers. In
each session, three sequences of audio-visual speech were taken, where two were
of the digits “0” to “9” in different order. The third sequence was the utterance
“Joe took father’s green shoe bench out”, which was designed to maximise visible
articulatory movements. The XM2VTS database is currently the largest publicly
available audio-visual database in terms of speakers but is still only constrained
to small vocabulary tasks. However, due to its size it is rather expensive to
purchase.
The VidTIMIT database [159] has been recently released, which consists of
43 speakers reciting 10 TIMIT sentences each, and has been used for multi-modal
speaker verification [160]. The VidTIMIT was collected in an effort to collect a
more phonetically balanced dataset. Motivated by this work, Saenko collected
the AVTIMIT database, which consisted of 223 speakers who spoke 20 different
3.6. Current Audio-Visual Databases 41
TIMIT sentences [156]. To cater for Australian English, Goecke and Millar [57]
collected the AVOZES audio-visual data corpus which consisted of 20 speakers
uttering phonemes and visemes of Australian English. AVOZES was also the first
audio-video database with stereo-video recordings, which enables 3D coordinates
of facial features to be recovered accurately [57].
Recently, in-car audio-visual data has been collected as part of the AVICAR
database [93]. The AVICAR database is the first publicly available audio-visual
corpora of in-car data, and constitutes the first publicly available database which
resembles noisy or “real-world” conditions. It consists of 100 speakers (50 male
and 50 female) and is captured at five different noise conditions; i.e. idling, driving
at 35mph with windows open and close, and driving at 55mph with windows open
and closed. The speech data is of isolated digits, isolated letters, phone numbers
and sentences, all in English. The AVICAR database is a multi-sensory database
consisting of a eight microphone array to collect the audio signal and four video
cameras on the dashboard (all of speakers in frontal pose).
To date, there is only one database which exists for large vocabularies tasks.
This database is the IBM Via Voice audio-visual database [127] and it consists of
290 fully frontal subjects uttering continuous speech with mostly verbalised punc-
tuation, dictation style [142]. The duration of the database is approximately 50
hours and it contains 24325 transcribed utterances with a 10403 word vocabulary.
In addition to this, a 50 subject connected digit database was collected that con-
tained 6689 utterances which relates to approximately 10 hours of speech data.
The Via Voice database was recorded in a quiet studio like environment, with
uniform lighting and background. In an effort to capture visual data in more
challenging environments, additional office data was captured using 109 subjects
and in-car data was also captured using 87 subjects, consisting of 6295 and 1485
utterances respectively [141]. These additional databases were of connected digits
only in the same format of the digits portion of the Via Voice database. However,
due to commercial constraints this database is not available publicly.
The databases presented above were all of a speaker’s frontal face. As the
motivation for this thesis was to determine the effect that pose variation had
on lipreading performance, datasets which facilitated this type of variation had
42 Chapter 3. Classification of Visual Speech
to be used. Fortunately, the IBM smart-room [138], CUAVE [130] and DAVID
[25] databases all have such pose variation. For this thesis, the IBM smart-room
and CUAVE databases were used and are described in detail in the following
subsections. The DAVID database was not used however, as it was relatively
small in size and was produced for the task of audio-visual speaker recognition
in mind. It is also worth noting that smaller datasets such as the “CMU Audio-
Visual Profile and Frontal View” database [91] and the profile speech database
collected by Yoshi et al. [188] have just been collected which would also be useful
for such tasks.
3.6.2 IBM Smart-Room Database
With the ever decreasing cost associated with collecting synchronous audio and
video data, collecting data of meeting or lecture events inside a smart room [52,
131] that are equipped with a number of far-field audio-visual sensors, including
microphone arrays, fixed and pan-tilt-zoom (PTZ) cameras is becoming viable.
This scenario is of central interest in the “Computers in the Human Interaction
Loop” (CHIL) integrated project currently funded by the European Union [26].
A schematic diagram of one of the smart rooms developed for this project, in
particular the one located at IBM Research, which has been termed the IBM
smart-room database is depicted in Figure 3.3 5.
Clearly, audio-visual speech technologies, such as speech activity detection,
source separation, and speech recognition, are of prime interest in this scenario,
due to overlapping and noisy speech, typical in multi-person interaction, cap-
tured by far-field microphones. Data from the smart room fixed cameras are of
insufficient quality to be used for this purpose, as they typically capture the par-
ticipants’ faces in low resolution (see also Figure 3.4). On the other hand, video
captured by the PTZ cameras can provide high resolution data, assuming that
successful active camera control is employed, based on tracking the person(s) of
interest [192]. Nevertheless, since the PTZ cameras are fixed in space, they can-
not necessarily obtain frontal views of the speaker. Clearly therefore, lipreading
5This work was supported by the European Commission under the integrated project CHIL,Computers in the Human Interaction Loop, contract number 506909.
3.6. Current Audio-Visual Databases 43
CounterCounter
DOOR
CONF.TABLE
CHAIRS
PROJ. SCREEN Whiteboard 1
Wh
iteb
oa
rd
2
64-chmkIII_1
(390 , 575 , 175)
64-ch mkIII_2
(695 , 265 , 175)
4-channelArray_C
(230 , 5 , 175)
4-channelArray_D
(480 , 5 , 175)
(175 , 585 , 175)
4-channelArray_A
(425 , 585 , 175)
4-channelArray_B
Table-1Mic
Table-2Mic
Table-3Mic
Presenter MicCTM-0
CTM-1CTM-2
CTM-3CTM-4
(322 , 212 , 74)
cam8 PTZcam7PTZ
cam6 PTZ
cam9PTZ
cam5
cam1
cam2
cam3
cam4
~ 55 cm
~ 110 cm
y
x
z
(x , y , z) [cm]FLOOR
(0 , 0 , 0)
(715 , 590 , 270)
CEILING
L E G E N D
Fixed Camerasat Room Corners (4)
PanoramicCeiling Camera (1)
Pan-Tilt-Zoom (PTZ)Cameras (4)
Close-Talking Microphones (4)
Table-Top Microphones (3)
64-channel LinearMicrophone Arrays (2)
4-channel T-shapedMicrophone Arrays (4)
Figure 3.3: The IBM smart room developed for the purpose of the CHIL project.Notice the fixed and PTZ cameras, as well as the far-field table-top and arraymicrophones.
from non-frontal views is required in this scenario, as well as fusion of multiple
camera views, if available. This scenario is the prime focus for this thesis, and
methods to solve these problems will be looked at throughout this document 6.
A total of 38 subjects uttering connected digit strings have been recorded
inside the IBM smart room, using two microphones and three PTZ cameras.
Of the two microphones, one is head-mounted (close-talking channel – see also
Figure 3.5) and the other is omni-directional, located on a wall close to the
recorded subject (far-field channel). The three PTZ cameras record frontal and
6The IBM smart-room database was able to be used as part of this thesis, due to thecandidates internship with the IBM T.J. Watson Research Center
44 Chapter 3. Classification of Visual Speech
cam1 cam2
cam3 cam4
cam6 cam7
FIXED
ROOM CORNER
CAMERA VIEWS
PTZ CAMERA
VIEWS
Figure 3.4: Examples of image views captured by the IBM smart room cameras.In contrast to the four corner cameras (two upper rows), the two PTZ cameras(lower row) provide closer views of the lecturer, albeit not necessarily frontal (seealso Figure 3.3).
two side views of the subject, and feed a single video channel into a laptop via
a quad-splitter and an S-video–to–DV converter. As a result, two synchronous
audio streams at 22kHz and three visual streams at 30 Hz and 368×240-pixel
frames are available.
Among these available streams, the far-field audio channel and two video
views, i.e. the frontal and right profile (which was the one that closest to the
profile pose, see Figure 3.5), were used in this thesis. Unless specified otherwise,
a total of 1661 utterances were used in this thesis, partitioned using a multi-
speaker paradigm into 1247 sequences for training (1 hr 51 min in duration), 250
for testing (23 min), and 164 sequences (15 min) that are allocated to a held-out
set.
Ideally, a speaker-independent paradigm would be used, however, this is very
hard to do due to the lack of speakers in this dataset (38). In audio-only speech
recognition experiments, a all-in one-out type arrangement can be developed to
3.6. Current Audio-Visual Databases 45
Figure 3.5: Examples of synchronous frontal and profile video frames of foursubjects from the IBM smart-room database.
overcome this problem which would result in 38 different train/test sets. This
can also be carried out for the task of lipreading, however, this is much more dif-
ficult due to the fact that a different visual front-end has to be developed for each
training set. This requirement is prohibitive due to the amount of time and com-
putation required to optimise each visual front-end to each set of training data. As
such, the multi-speaker paradigm in lipreading and AVASR experiements have
been preferred as only one visual front-end is required. In the multi-speaker
paradigm, all speakers in the data set are represented both in the training and
test sets, although different sequences are used for both sets. This ensures that
a global model is obtained which provides good information about the lipreading
performance.
3.6.3 CUAVE Database
Another audio-visual database which contains speakers talking in non-frontal
poses, is the Clemson University Audio-Visual Experiments or CUAVE database
[130]. The main motivation behind the creation of the CUAVE database was
to create a flexible, realistic and easily distributable database that allows for
representative and fairly comprehensive testing [130]. The CUAVE database
consists of two sections, with the first being the individual and the second being
the group section (see Figure 3.6). The individual section is designed to give
realistic conditions such as speaker movement, whilst the group section is included
to look at pairs of simultaneous speakers, which is the first data of its kind. As
the scope of this thesis is constrained to lipreading of a single speaker across
46 Chapter 3. Classification of Visual Speech
Figure 3.6: Examples of sequences from the CUAVE database, which consistsof 36 individual speakers and 20 group speakers. The top line give examples ofthe individual sequences, whilst the bottom gives examples of the group speakersequences.
different poses, only the individual section was used as part of this thesis 7.
The individual section of the CUAVE database was broken into 2 parts. The
first was for isolated-digits and the second was the connected-digits. As no profile
data was included in the connected-digits section, only the isolated-digits portion
was used. Each isolated-digits sequence was broken into the following four tasks:
1. Normal, where each speaker spoke 50 digits whilst standing still naturally,
2. Moving, where each speaker was asked to move side-to-side, back-and-forth,
or tilt the head while speaking 30 digits,
3. Right profile, where each speaker utters 10 digits in the right profile pose,
and
4. Left profile, where each speaker utters 10 digits in the left profile pose.
Examples of these tasks are given in Figure 3.7. In addition to performing ex-
periments on these four individual tasks, experiments on the combination of all
the tasks were also undertaken (see Chapter 8.4). As such, continuous video data
across all these tasks were required (i.e. speaker in shot at all times). Unfortu-
nately, only 33 of the 36 speakers were able to be used for this task. As only one
sequence was available per speaker, a multi-speaker paradigm was not able to be
used for these experiments. As such, a quasi speaker-independent paradigm was
7Special mention must go to Clemson University for freely supplying their CUAVE databasefor this work.
3.7. Chapter Summary 47
Figure 3.7: Examples of the CUAVE individual sequences. The top three rowsgive examples of the speaker rotating from left profile to right profile. The bottomthree rows give examples of the speaker moving whilst in the frontal pose.
used, which consisted of 10 different train/test sets, consisting of 25 speakers for
training and 8 speakers for testing for each set. The reason why this is termed
quasi speaker-independent is because it is not a fully speaker-independent task,
as only one visual front-end was developed.
3.7 Chapter Summary
In this chapter the topic of classifying visual speech was broached. The theory
behind the HMM was heavily documented, with details of the training and de-
coding of the models being given via the Baum-Welch and Viterbi algorithms
respectively. Both the feature and decision fusion strategies for combining vari-
ous streams together were then analysed. Feature fusion methods were described
as being the easier to implement as they are just a concatenation of the various
48 Chapter 3. Classification of Visual Speech
streams of data, although they can not model the usefulness or reliability of the
various streams. Contrastingly, this can be done using the decision based meth-
ods. Even though there exist many decision based methods, it was found that
the synchronous multi-stream HMM (SMSHMM) was the fusion technique best
suited for the combining of various visual streams. The details of the HMM pa-
rameters used for this thesis were then given. A brief mention of the commands
used to train and decode the HMMs using HTK [189] were also given.
This chapter then concluded by giving a relatively thorough review of the
current audio-visual databases which are currently available to use. From this
review, it was found that due to the cost of capturing, storing and distributing
audio-visual databases, nearly all available corpora is restricted to fully frontal
data for a small number of speakers for small vocabulary tasks such as connected
digit recognition. However, as the cost of capturing and storing such databases
is becoming cheaper, collecting audio-visual data with various visual variabilities
is not such an major issue that is once was. An indication of this comes in
the form of the recently developed databases such as the IBM smart-room and
CUAVE databases, which contain variabilities with respect to the speaker’s pose.
As this type of variability in the visual domain is the focus of this thesis, these
two databases were described as a prelude to the experiments being conducted
in this thesis.
Chapter 4
Visual Front-End
4.1 Introduction
For a lipreading system to be of use, it has to be able to locate and track the
visible articulators which cause human speech. It is widely agreed upon that the
majority of these visible articulators emanate from the region around a speaker’s
mouth, otherwise known as the region-of-interest (ROI) [92]. The visual front-end
is responsible for locating, tracking and normalising a speaker’s ROI and can be
considered the most important module of a lipreading system. The reason for this
is that if the visual front-end does not accurately locate and track the speaker’s
ROI, then this error will filter throughout the system and cause erroneous results.
This effect is known as the front-end effect. There are many different factors which
can heighten this phenomenon such as pose, occlusion and illumination. In this
chapter, all the various aspects of a visual front-end for a lipreading system are
reviewed. As part of this review, a survey of current algorithms which can be used
as part of the visual front-end are examined and the algorithm chosen for this
thesis is fully described. This algorithm is known as the Viola-Jones algorithm
which is based on a boosted cascade of simple classifiers. The chapter concludes
by evaluating the implemented visual front-end on frontal pose data.
Before proceeding, it would be prudent to give a high-level description of
the visual front-end as their is some conflict in literature as to what it actually
constitutes. Some researchers [142] consider the visual front-end to consist of
49
50 Chapter 4. Visual Front-End
Video InFace Localisation
Define Search Regionsfor Facial Features
. ..
Facial FeatureLocalisation
Normalise for Scaleand Rotation
Repeat for all frames
Check and Update ROI Location
Smooth using Temporal Filter
To Lipreading System
Step 2
Step 1
Step 3
Figure 4.1: Block diagram of a visual front-end for a lipreading system. It isessentially a three-step process, face localisation being step 1 and step 2 consistingof located the mouth ROI. Step 3 is tracking the ROI over the video sequence.
locating and tracking the ROI and then extracting features from the ROI. For
this thesis however, the visual front-end refers to just locating and tracking the
ROI. A depiction of this process is shown in Figure 4.1. As can be seen in
this figure, the visual front-end is essentially a three-phase, hierarchical process,
starting with locating a speaker’s face. Once the face has been located, facial
features such as the eyes, nose and mouth corners are then located. Based on the
positions of these facial features, the ROI is then defined. Once the ROI has been
found, tracking can be performed and the ROI can be extracted from each video
frame. Ideally, this process would be conducted on each video frame to give the
most accurate and up to date positions but depending on the visual front-end
algorithm, this can prove to be too computationally expensive to perform. All
these issues are discussed in this chapter.
4.2 Front-End Effect
For the task of lipreading, the front-end effect can be formally defined as the
dependence the lipreading system has on having the ROI successfully located. The
impact of the front-end effect is best illustrated in Figure 4.2. From this figure it
can be seen, that if the ROI is located poorly, then this noisy or corrupt input will
cascade throughout the system and will most likely recognise the visual speech
4.3. Visual Front-End Challenges 51
Classification based on located ROI
Recognisedspeech
Input image Located ROI
ηd ηc
ηo
Figure 4.2: Depiction of the cascading front-end effect.
incorrectly. This effect can be expressed mathematically as
ηo = ηd × ηc (4.1)
where ηd is the probability that the ROI has been successfully located, ηc is the
probability that a correct decision is made given the ROI has been successfully
located and ηo is the overall probability that the system will recognise the correct
speech. Inspecting Equation 4.1, it can be seen that the performance of the
overall classification process ηo can be severely affected by the performance ηd of
the visual front-end.
In an ideal scenario, ηd = 1, so that more effort can be concentrated on im-
proving the performance of ηc, thus improving the overall lipreading performance.
A very simple way to ensure ηd approaches unity is through manual labeling of
the ROI. Unfortunately, due to the amount of visual data needing to be dealt
within a lipreading application, manual labeling is not a valid option. The re-
quirement for manually labeling the ROI also brings the purpose of any lipreading
system into question due to the need for human supervision. With these thoughts
in mind an integral part of any lipreading application is the ability to make ηd
approach unity via a highly accurate visual front-end.
4.3 Visual Front-End Challenges
Unfortunately, getting ηd towards unity is a very difficult to achieve due to the
man variants that a visual front-end has to encounter. According to the survey
conducted by Yang et al. [186], the challenges associated with the visual front-end
can be attributed to the following six factors:
52 Chapter 4. Visual Front-End
• Pose: The images of a face vary due to the relative camera-face pose
(frontal, profile etc.), and some facial features such as an eye or the nose
may become partially or fully occluded.
• Presence or absence of structural components: Facial features such
as beards, moustaches, and glasses may or may not be present and there is
a great deal of variability among these components including shape colour
and size.
• Facial expression: The appearance of a person’s face can vary due to
their expression (happy, sad, angry etc.).
• Occlusion: Faces may be partially occluded by other objects. In an image
with a group of people, some faces may partially occlude other faces.
• Image orientation: Face images directly vary for different rotations about
the camera’s optical axis.
• Imaging conditions: When the image is formed factors such as lighting
and camera characteristics affect the appearance of a face.
As the factors listed above show, the task of the visual front-end is quite complex.
As this is the case, some of the work conducted in lipreading has neglected the
visual front-end by manually locating the ROI or have artificially located the ROI
via chroma key methods [142]. Most of the work has focussed on data which has
been conducted in ideal laboratory conditions. Almost all the work has neglecting
the variants of pose and orientation. The lack of work in these areas has stymied
the full deployment of a lipreading system, as the visual front-end can not deal
with a whole array of difficult conditions.
However, as mentioned in the first chapter, this thesis is focussed on reme-
dying this situation by attempting to overcome the problems of pose as well as
normalizing for the various speaker structural components and image conditions.
As such, their is only one restriction placed on this work. That is:
• there is only one speaker in each video sequence and he/she is present
during the entire sequence
4.4. Brief Review of Visual Front-Ends 53
As there is only one speaker in shot in any video sequence, this is referred to
“localisation” as the position of the face and subsequent facial features have to be
found [186]. This is in contrast to the term “detection” which refers to the much
more difficult task of first determining how many faces are in a video sequence
and then determine the location of the faces and their facial features [186].
As shown in the survey by Yang et al., there are over 150 published articles in
the well established field of face and facial feature localisation/detection. Unfor-
tunately, from all this research there is still no one technique that works best in
all circumstances. In the next section, a small review of the most popular visual
front-ends are given.
4.4 Brief Review of Visual Front-Ends
Yang et al. [186] categorize locating a person’s face and facial features into four
broad groups. These being:
1. Knowledge-based methods: These rule-based methods encode human
knowledge of what constitutes a typical face. Usually, the rules capture the
relationships between facial features. An example of such a method is the
multiresolution rule-based method [185].
2. Feature invariant approaches: The aim in this approach is to find struc-
tural features that exist even when the viewpoint, illumination or pose of
the person varies. Such features include colour, texture and edge informa-
tion. A variety of shape models can be employed from very simple expert
geometric models [190], to snakes [27], B-splines [148] or point distribution
models [34].
3. Template matching methods: These methods share characteristics of
the feature invariant and rigid template paradigms. In a similar fashion to
the rigid template approach, the geometric form and intensity information
of the object are dependent on each other. In this approach, however, the
template used to evaluate the intensity information of an object is non-
rigid. The intensity model is evaluated by gaining a cost function of how
54 Chapter 4. Visual Front-End
similar the intensity values around or within the template are to the in-
tensity model describing the object. Unlike the rigid template approach,
an exhaustive search of the image using a deformable template is compu-
tationally intractable due to the exponential increase in the search caused
by the template being allowed to vary in both shape and position. Such
an operation can be made computationally tractable by employing quicker
minimisation techniques such as steepest descent [89, 190], downhill sim-
plex [89, 107] and genetic algorithms [33], allowing this approach to be
computationally feasible.
4. Appearance-based methods: In contrast to template matching, the
models (or templates) are learnt from a set of training images which should
capture the representative variability of facial appearance. These learned
models are then used for detection. These methods are designed mainly
for face detection. (e.g. eigenface [176], distribution method [170], neural
network [154], SVM [98], Naive Bayes classifier [163], hidden Markov model
[124], information theoretical approach [31, 95]).
The choice of visual front-end is dependent on the type of application it is being
used for and the conditions under which the video was captured. In lipreading
literature, appearance based approaches have been widely used achieving good
results [96, 121, 154, 184]. A major reason for this is that they are well suited
to many different objects (face, eyes, nose, mouth corners etc.), under many dif-
ferent conditions due to their probabilistic nature. These techniques are good at
finding a crude ROI which is all that is required for the appearance-based visual
feature extraction process (see Chapter 5.2.1). Feature invariant approaches have
been widely used for lip contour localisation and tracking. Under this approach,
methods based on colour [20, 99, 148, 174], edges [149] as well as localised tex-
ture [99] have been used to gain a geometric model of the lips. However, these
approaches require extremely precise localisation of lip features and are highly
susceptible to errors in conditions of poor illumination and speaker movement.
The template matching method was first applied by Yuille et al. [190] for
mouth and eye localisation using expert based appearance and shape models. In
4.4. Brief Review of Visual Front-Ends 55
this approach an expert deformable template of the eyes and labial contour is
fitted to an intensity model, by calculating a cost function based on the grayscale
intensity edges, valleys and peaks around the template boundary. The search
strategy uses the steepest descent algorithm to fit the template. Unfortunately,
due to the heuristic nature of the shape models and intensity models the approach
has poor performance when applied across a large number of subjects. Cootes et
al. [34] devised a similar technique for building a deformable template incorporat-
ing texture and shape models through exemplar learning. The technique used a
deformable template known as an active shape model (ASM). The ASM was able
to statistically learn allowable variations in shape of an object from pre-labeled
object shapes in a point distribution model (PDM) [33, 34]. Intensity informa-
tion about the object was also statistically learnt. In this approach a number
of grayscale profile vectors were extracted normal to set points around the de-
formable template. All these vectors were concatenated into a matrix known as
global profile vectors from which variations in intensity were statistically modeled
as a grey level profile distribution model (GLDM) [33, 115]. Luettin [107] ap-
plied ASMs to lip contour localisation, using the downhill simplex minimisation
technique to fit the lip shape model described by a PDM to an image containing
a mouth.
Matthews et al. [115] used another type of statistically learnt deformable
template approach to fit a lip shape model to an image containing a mouth. This
type of deformable template is referred to as an active appearance model (AAM)
and was first developed in [33]. This approach, similar in many respects to ASMs,
uses a PDM to statistically learn the shape variations of the object. The inten-
sity model for the object is learnt by warping the intensity information contained
within the deformable template back to the mean shape position. This warped
intensity information is then used to statistically model the distribution of inten-
sity values of an object whose shape has been normalised. The statistical nature
of the intensity model allows AAMs to be used for detection as well as location
purposes. AAMs have been applied to the task of lip contour detection/location,
using a genetic algorithm for minimisation [115]. ASMs and AAMs have been
used with much success in whole facial feature location, where an entire model
56 Chapter 4. Visual Front-End
of the face (i.e. including eyes, lips, nose and jawline) are located. Unfortu-
nately, the minimisation technique required to fit ASMs and AAMs are highly
sensitive to initialisation and do not guarantee convergence to an acceptable mini-
mum. Although deformable template approaches, namely ASMs and AAMs, have
been shown to be useful for face/eye detection and mouth location/tracking for
lipreading applications in literature [107, 108, 115], the problems associated with
searching for a minimum make detection/location performance largely unreliable
and need a massive amount of annotated training data. This was highlighted in
the large vocabulary experiments conducted by Matthews et al. [116], where this
was suspected to be the case.
All these methods just presented have assumed a single camera. Recently
however, Goecke [56] has presented a novel real-time lip localisation and track-
ing algorithm based on video data captured on a stereo camera using colour
information and prior knowledge of the mouth area. By using stereo vision in a
calibrated camera system, the 3D coordinates of object points could be recovered
which enabled speakers to act normally and move freely within a constrained
environment. This approach is extremely attractive as it lends itself to tracking
of a speaker’s mouth ROI across multiple views. A caveat on this however, is
that the video must be captured via a stereo camera which greatly limits the use
of this approach.
As mentioned previously, no one visual front-end has shown itself to be su-
perior to the others for the task of lipreading. This may because they are not
robust across of different conditions, only useful for certain speakers or that they
are too computationally expensive to run in real-time. Recently however, Viola
and Jones [180] introduced an algorithm based on a boosted cascade of simple
classifiers. Through this novel technique, they were able to obtain extremely
high accuracy in real-time. Seeing that this framework is extremely quick, and
generic, it is amendable to having multiple visual-front ends running in parallel
from any type of visual data which allows a multiple pose face and facial feature
localisation to take place [82]. Even though there are a small number of visual
front-ends which can handle non-frontal data [97, 155, 163], the Viola-Jones algo-
rithm provides a framework that can localise faces and facial features regardless
4.5. Viola-Jones algorithm 57
of pose and in real-time. For these reasons, this particular method was chosen as
the visual-front end for use in this thesis as it satisfied all the requirements for
the objectives of this thesis. The Viola-Jones algorithm is described in the next
section.
4.5 Viola-Jones algorithm
In 2001, Viola and Jones [180] proposed a rapid object detection scheme based
on a boosted cascade of simple “haar-like” features. Since then, this work has
revolutionised the field of computer vision, as it has provided a object detec-
tion/localisation framework that is extremely quick and accurate which is imper-
ative for real-time tasks. As it is based on a set of training examples, it can be
used for any object detection/localisation task. This is especially beneficial for
the case of lipreading where extremely quick face and facial feature localisation
is required as well as being mendable to the variations associated with pose and
environmental conditions. This section is devoted to giving a brief description
of the algorithm and how it is applicable to a lipreading visual front-end. The
Viola-Jones algorithm is essentially a three step process. Each of these steps
listed below are described in the following subsections:
1. Feature representation of images using “haar-like” features;
2. Selecting “weak” classification functions using a learning algorithm known
as “boosting”; and
3. Cascading the “weak” classifiers into a final “strong” classifier.
4.5.1 Features
The Viola-Jones algorithm employs a feature representation of the images instead
of pixels. This is done for two reasons. Firstly, it is much quicker than using pixels
and secondly, the features encode knowledge within the image which is difficult
to learn using a finite amount of training data [180]. The feature representation
used in this algorithm are termed “haar-like” because they are similar to the over-
complete Haar basis functions used by Papageorgiou et al. [129]. The original
58 Chapter 4. Visual Front-End
(a) Original set of haar-like features
(b) Extended set of haar-like features
Figure 4.3: Comparison of the feature sets used by: (a) Viola and Jones with theoriginal 4 haar-like features; and (b) Lienhart and Maydt with their extended setof 14 haar-like features including their rotated features. It is worth noting thatthe diagonal line feature in (a) is not utilised in (b).
set of four features were later extended by Lienhart and Maydt [94], to fourteen
by introducing features which were rotated by 45o. The motivation behind using
these extended features was that they add additional domain-knowledge to the
learning framework which is otherwise hard to learn. Lienhart and Maydt showed
that improved performance is achieved with these set of extended features with an
average of 10% reduction in the false alarm rate at a given hit rate. A comparison
of these two feature sets are given in Figure 4.3. The value of these haar like
features are calculated as the sum of the pixels within the white rectangles are
subtracted from the sum of the pixels in the black rectangles.
If the object of interest, say a face, was 16 × 16 pixels within an image, the
number of features derived could be well over 100 000 for that face. This is
because the feature set shown in Figure 4.3(b) are found sliding over the face
4.5. Viola-Jones algorithm 59
at different location and scales in both the x and y directions. These features
can however be computed extremely rapidly using the integral image [180]. The
upright integral image at location x, y contains the sum of the pixels above and
to the left of x, y inclusive
iiu(x, y) =∑
x′≤x,y′≤y
i(x′, y′) (4.2)
where iiu(x, y) is the integral image and i(x, y) is the original image. The integral
image for the upright rectangle features can be computed in one pass over the
original image using
iiu(x, y) = iiu(x, y − 1) + iiu(x− 1, y) + (4.3)
i(x, y)− iiu(x− 1, y − 1)
where iiu(−1, y) = 0 and iiu(x,−1) = 0. Figure 4.4 shows how the integral image
can be used to determine the rectangular sum using four point references. Given
that: the value at point 1 is the sum of the pixels in A; point 2 is the sum of
the pixels within A + B; point 3 is the sum of the pixels within A + C; and
point 4 is the value of the pixels within A + B + C + D; the sum of the pixels
in rectangle D, can be computed as 4 + 1− (2 + 3). Since two-rectangle features
defined above involve adjacent rectangular sums they can be computed in six
array references, eight in the case of the three rectangle features, and nine for
four-rectangle features. All these values, once computed are stored in a look-up
table, and can be accessed to calculate the features.
The integral image can also be computed easily for the rotated features. At a
given point in the image, the sum of the pixels of a 45o rotated rectangle with the
bottom most corner at the given point and extending upwards till the boundaries
of the image is calculated the same as in Equation 4.2. The rotated integral
image, iir(x, y), can also be calculated in one pass from left to right and top to
bottom over all pixels by:
60 Chapter 4. Visual Front-End
A B
C D1 2
3 4
Figure 4.4: Example of how the integral image can be used for computing uprightrectangular features.
ABC
D
1
23
4
Figure 4.5: Example of how the rotated integral image can be used for computingrotated features.
iir(x, y) = iir(x− 1, y − 1) + iir(x + 1, y − 1)− (4.4)
iir(x, y − 2) + i(x, y) + i(x, y − 1)
where iir(−1, y),iir(x,−1) and iir(x,−2) = 0. Like the example shown for the
upright case in Figure 4.4, the rotated integral image can also be used to calculate
any rotated sum by four point references as shown in Figure 4.5. In this example,
the area within D can be found exactly the same as the previous example with
D = 4 + 1− (2 + 3).
4.5.2 Classification
As mentioned in the previous subsection, there are over 100 000 features asso-
ciated with an object of size 16 × 16 pixels within an image. Even though the
4.5. Viola-Jones algorithm 61
integral image allows for quick computation of these features, this number is pro-
hibitively large to process. To counter this, Viola and Jones hypothesized that
only a small number of these features were required to successfully detect/locate
the object of interest. They overcame the challenge of selecting which features
to use via “AdaBoost” which was initially proposed by Freund and Schapire
[49]. AdaBoost is a learning algorithm which combines the performance of many
“weak” classifiers to produce a “strong” final classifier which has good gener-
alisation performance [129, 162]. Viola and Jones used a variant of AdaBoost
by constraining the weak classifier to be dependent on a single feature. The
AdaBoost procedure chose this single feature as it was the best in separating
the positive and negative examples in the training dataset (see next section for
description on training). This was done by determining the optimal threshold
classification function for each feature such that the minimum number of exam-
ples are misclassified. A weak classifier hj(x) thus consisted of a feature fj, a
threshold θj and a polarity pj indicating the direction of the inequality sign, so
that
hj(x) =
1 if pjfj(x) < pjθj
0 otherwise.
(4.5)
where x is the sub-window of the image. Taken from [180], the process to finding
these parameters are as follows:
1. Given N example images (x1, y1), . . . , (xn, yn) where x is the sub-window
image of the object/background within the entire image and yi = 0, 1 for
negative and positive examples respectively.
2. Initialize weights w1,i = 12m
, 12l
for yi = 0, 1 respectively, where m is the
number of negative examples and l is the number of positive examples.
3. For t = 1, . . . , T :
(a) Normalize the weights,
wt,i ←− wt,i∑nj=1 wt,j
62 Chapter 4. Visual Front-End
so that wt is a probability distribution
(b) For each feature, j, train a classifier hj which is restricted to using a
single feature. The error is evaluated with respect to wt, ε =∑
i wi|hj(xi)−yi
(c) Choose the classifier ht, with the lowest error εt
(d) Update the weights:
wt+1,i = wt,iβ1−eit
where ei = 0 if example xi is classified correctly, ei = 1 otherwise, and
βt = εt
1−εt
4. The final strong classifier is:
hj(x) = 1 if
T∑t=1
αtht(x) ≥ 1
2
T∑t=1
αt
= 0 otherwise
where α = log 1βt
Viola and Jones gave an example of what the selected features from this
process were for the task of face localisation. In the example they gave, they
mentioned that the first feature selected was across areas of the eyes, nose and
cheeks. It was suggested that this was chosen as the eyes being dark, were
contrastive to the lighter areas of the cheek and nose. It was also mentioned that
the first feature was relatively large with respect to the face, and was insensitive
to size and location of the face. This characteristic highlights the ability of the
features to scale well and be adaptable to the face being in various locations.
This example is replicated from [180], to highlight this point.
4.5.3 Cascading the Classifiers
Instead of having one “strong” classifier to detect/localise all objects of interest,
Viola and Jones proposed the use of cascading a series of the “weak” classi-
fiers, increasing in complexity at each stage, to dramatically increase the speed
4.5. Viola-Jones algorithm 63
Figure 4.6: Example of the first feature selected by AdaBoost. It has selectedthe feature across the eye, nose and cheek areas, possibly due to the contrast incolour.
� � � ������� �� ���� ����
������������ Figure 4.7: Example of a face localiser based on a boosted cascade of 20 simpleclassifiers. If the hit rate for each classifier is 0.9998 and the false-alarm rate isset to 0.5 then the overall localiser should be able to yield a hit rate of 0.999820 =0.9960 and a false-alarm rate of 0.520 = 9.54× 10−7.
of the detector/localiser. This is achieved by focussing attention of the detec-
tor/localiser on the regions of the image which are most likely to contain the
object of interest which is performed via the cascade. The cascade, which essen-
tially takes the form of a decision tree, is a simple process where a positive result
from the first classifier on a given sub-window triggers the second classifier and
so on. A negative outcome at any stage of the cascade leads to the rejection of
that sub-window. An illustration of a 20 stage cascade is shown in Figure 4.7,
for the task of face localisation.
If the false-alarm rate was set to 0.5, then after the first five stages, the
detector/localiser would have eliminated nearly 97% of the non-object windows,
which allows more computation on the areas of the image which may contain the
object. This cascading framework, allows extremely rapid detection/localisation
of objects, as it is very efficient in determining whether a sub-window is possibly
64 Chapter 4. Visual Front-End
the object of interest or not. As in face and facial feature localisation, most sub-
windows are not objects of interest, this particular characteristic bodes well for
the localisation of these objects for every frame. By setting the hit rate high and
the false-alarm rate to a reasonable value, then very good performance can be
obtained. In Figure 4.7, a 20 stage boosted classifier is shown. If the hit rate is
0.9998 and the false-alarm rate is set to 0.5, then the overall localiser can obtain
a performance of 0.999820 = 0.9960 with a false-alarm rate of 0.520 = 9.54×10−7.
From this section, it can be seen that the Viola-Jones algorithm gives a frame-
work for which accurate yet quick object detection/localisation can take place.
Probably the key to this algorithm is to choose simple classifiers which can reject
the majority of the sub-windows before more complex classifiers are called into
action. However, it must be noted that these simple classifiers are determined
from the positive and negative algorithm that are given to it in the training
phase, so it is imperative that a exhaustive set of positive and negative images
are provided so that good generalisation is achieved.
The next section describes this training process of the Viola-Jones algorithm
through the implementation of the visual front-end for the frontal-pose. This
visual front-end produced the ROI’s which were used for the basis of this thesis.
4.6 Visual Front-End for Frontal View
The visual front-end for the frontal view was implemented and tested on the
frontal view data from the IBM smart-room database (see Chapter 3.6.2). This
visual front-end and the eventual ROI extraction was devised on a similar hier-
achical strategy to that of Cristinacce et al. [38], where the facial feature local-
isation was based on the search areas defined by the previously localised feature
points. Both the face and the facial features were localised using the boosted
cascade of classifiers based on the work described in the previous section. The
classifiers were generated using OpenCV libraries [128].
The positive examples used for training these classifiers were obtained from
a set of 847 training images taken from the training set of visual speech utter-
ances, with 17 manually labeled points for each face. As only the ROI was to
4.6. Visual Front-End for Frontal View 65
��� ��������� ��� ����������Figure 4.8: Points used for facial feature localisation on the face: (a) right eye,(b) left eye, (c) nose, (d) right mouth corner, (e) top mouth, (f) left mouth corner,(g) bottom mouth, (h) mouth center, and (i) chin.
be extracted, it was decided that 9 of the 17 manually labeled points were to be
used to somewhat simplify the process. These points were the: left eye; right eye;
nose; right mouth corner; top mouth; left mouth corner; bottom mouth; center
mouth; and chin; and are depicted in Figure 4.8. This provided 847 positive
examples for all 9 facial features.
he resulting positive examples for the face was further augmented by including
rotations in the image plane by ±5 and ±10 degrees, as well as mirroring the
images, providing 5082 positive examples. As a number of the facial features were
located so close to each other (a matter of pixels in some cases), it was decided
not to include rotated examples of the facial features. The positive examples for
the face were all normalized to 16 × 16 pixels, based on the distance of 6 pixels
between the eyes. Examples of the face templates are shown in Figure 4.9.
The negative face examples consisted of a random collection of approximately
5000 images which did not contain any faces. Some of them were of the back-
ground within the face images, as well as random objects. A small array of these
examples are shown in 4.10. The majority of these images were of a high resolu-
tion in comparison to the face images (around 360× 240 pixels), so that enough
negative sub-windows could be used to train up the classifier adequately. This
was very important, as the Viola-Jones algorithm disregards most of the nega-
tive examples in the first few stages, so it was vital that there is an abundant of
66 Chapter 4. Visual Front-End
Figure 4.9: Example of the 16 × 16 frontal faces from the IBM smart-roomdatabase used for this thesis.
background examples to satisfy this requirement. Having background images of
high resolution was one way of overcoming this.
The eye classifiers were trained using image templates of size 20×20, the nose
and chin using templates of 15 × 15, and the right, top, left and bottom mouth
templates were of size 10× 10. The mouth center templates were of size 24× 24,
and this classifier was used to find a coarse ROI so that further refinement could
take place, hence the larger template size. All these templates were taken from
normalised face images of size 64× 64, based on a distance of 32 pixels between
the eyes. Figure 4.11, gives an example of the facial feature templates used to
train up the various classifiers. As the face localisation step reduces the search
space for the facial feature localisation, the negative examples for the various
facial features consisted of images of other facial features. This was done to
alleviate the confusion that might have occurred due to various facial features
4.6. Visual Front-End for Frontal View 67
Figure 4.10: Example of the negative images used for training of the face classifier.
looking alike, i.e. the mouth open can appear like an eye in various illumination
conditions. Examples of the negative images used to train up the facial feature
classifiers are shown in Figure 4.12.
Due to the lack of manually labeled faces available, all classifiers were tested
on a small validation set of 37 images, which would give an indication on what
particular features would give us the best chance of reliably tracking the localised
features. The results are shown in Table 4.1. Normally, the distance between
the eyes (deye) is used as a measure of performance for the task of facial feature
localisation as they have been long regarded as an accurate measure of the scale
of a face [80]. As such, a facial feature was not considered located if the estimate
of the position of the feature (pf ) and its manually annotated position (pf ) was
more than 10% of the annotated distance between the eyes (i.e. for a feature
deemed to be located, pf < 0.1× deye [81].
As can be seen from this table, most of the facial features were located at
68 Chapter 4. Visual Front-End
Figure 4.11: Example of the templates used for the training of the frontal facialfeatures. The ROI shown on the right is an example of the mouth center template.
Facial Feature Accuracy (%)
Right Eye 91.08
Left Eye 89.47
Nose 89.47
Right Mouth 91.08
Top Mouth 81.08
Left Mouth 89.47
Bottom Mouth 83.78
Center Mouth 89.47
Chin 67.57
Table 4.1: Facial feature point detection accuracy results for frontal pose
a pretty high rate, except the chin and top and bottom mouth. As the final
extracted mouth ROI needed to be normalised for scale and rotation to enforce
alignment across all the different ROI images, two geometrically aligned points
had to be found for this to happen. In literature, normally eye locations are used
for such alignment. However, it was found heuristically that this metric was not
ideal for scaling the mouth as there is a great deal of variability in mouth shape
and size, which did not appear to be correlated with the distance between the
eyes. As such, it was determined that the left and right mouth corners would
be used as these gave much better reference points for the scale and rotation
normalisation to occur. Upon inspection, the face localisation accuracy on this
4.6. Visual Front-End for Frontal View 69
(a) Eyes
(b) Nose
(c) Mouth Region
(d) Right Mouth
(e) Top Mouth
(f) Left Mouth
(g) Bottom Mouth
(h) Chin
Figure 4.12: Example of negative images used for the training of the frontal facialfeature classifiers.
validation set was 100%.
The visual front-end used to extract the mouth ROI for the frontal pose is
outlined in Figure 4.13. Given the video of a spoken utterance, face localisation
is first applied to estimate the position of the speaker’s face. As the classifier is
able to scale well, an image pyramid approach to search at different scales was not
required. Once the face was located, the eyes were searched over specific regions
of the face (based on training data statistics). Once these eye location were found,
a general mouth search region was specified. The mouth center classifier was then
used to refine this search region. The resulting mouth region was then used as the
search region to locate the right and left mouth corners. Once these two points
were found, the extracted mouth ROI was then rotated so that these two points
70 Chapter 4. Visual Front-End������� ������� � �������� ��� �� � ��������� � ��� � ������ � ! � ����� ��" � ���� �#$%&$%'(���)* +���� �����,-./0��1�2345678 679 : �3-./0 �;�-<= -.>�?@AB CD@EEFGFBHIJ K IJ LBMND@LBE OPB QRSKRST CD@EEFGFBHE UVWLX CBYLBHCD@EEFGFBHZ [RK[R��� * +����� � ���� * +����� �\� �� +�� � ]�^� +�� ���� �_ � ��� ��� ��"����� * +��\ ����� �� �� �
`BGL @Ya bFcXL UVWLXCD@EEFGFBHEZ ISKISFigure 4.13: Block diagram of the visual front-end for the frontal pose.
were aligned horizontally and scaled to be 20 pixels apart to yield a final 32× 32
pixel ROI to be used in the lipreading system. The final ROI contained most of
the lower part of the face. In a comprehensive review conducted by Potmianos
and Neti [140], they found that improved lipreading results can be obtained by
having the jaw and cheeks included in the final ROI compared to the ROI just
containing the lips. This finding is also supported by human perception studies
which show that visual speech perception is improved when the entire lower face
is visible [169]. It must be also noted that the final 32×32 ROI was downsampled
from a much higher resolution (on average approximately 80 × 80 pixels). The
reason why the ROI was downsampled was to keep the dimensionality low. Such
a method is not expected to affect the lipreading performance according to the
work conducted by Jordan and Sergeant [83].
Following the ROI localisation, the ROI is tracked over consecutive frames.
If the detected ROI is too far away from previous frame, then it is regarded as
a detection failure and the previous ROI location is used. A mean filter is then
used to smooth the tracking. Due to the speed of the Viola-Jones algorithm, this
process was performed on every frame. Prior to this full process beginning, an
initialization phase is executed to get an initial lock on the location of the various
facial features.
4.7. Chapter Summary 71
Figure 4.14: Mouth ROI extraction examples. The upper rows show examples ofthe localised face, eyes, mouth region and mouth corners. The lower row showsthe corresponding normalised mouth ROI’s (32× 32 pixels).
Overall, the performance of the visual front-end was very good, with it ap-
pearing to generalise well across all the different variations present in the dataset,
such as appearance and illumination. It should be noted that there was only a few
number of poorly or mistracked ROI’s in the dataset, which could be attributed
to random head movement. As it was assumed that little face movement would
occur, strict thresholds were set to minimise the amount of allowable movement
in the facial features. This was alleviated however, by the relaxing of such con-
straints and performing localisation on each frame. Figure 4.14, shows face and
facial feature localisation examples from the visual front-end and the final ex-
tracted mouth ROIs.
4.7 Chapter Summary
A visual front-end which can automatically and accurately localise face and facial
features positions quickly is of the utmost importance for a lipreading system.
However, as it was noted in this chapter, this task is difficult due to the many
variations the visual front-end has to deal with such as pose, illumination, ap-
pearance and occlusion. If the system can not deal with these variations, poor
localisation of the mouth ROI will inevitably take place which will effect the over-
all accuracy of the lipreading system due to the front-end effect. This chapter
reviewed various approaches to the visual front-end, especially focussing on the
72 Chapter 4. Visual Front-End
Viola-Jones algorithm which is both extremely rapid and accurate across all dif-
ferent conditions. This algorithm was then implemented for extracting the mouth
ROIs for the frontal pose scenario, achieving accurate results. The next step after
extracting the mouth ROIs is to extract visual features from them. This process
is described in the next chapter.
Chapter 5
Visual Feature Extraction
5.1 Introduction
The visual feature extraction step seeks to find representations of the given ob-
servations that provide discrimination between the various speech units whilst
providing invariance to irrelevant transforms on the observations that are in the
same class. Ideally, the task of the visual feature extraction step is to yield visual
speech features which make the job of the classifier trivial, i.e. the features would
already be clustered into their separated classes without overlapping. However,
due to the many variants within the mouth ROI such as, illumination, appear-
ance, viewpoint, alignment, speaking style and of course the high dimensionality
associated with image/video data, the task of finding features which provide good
speech discrimination is extremely difficult.
Over the past twenty or so years, various sets of visual features for lipreading
have been proposed in the literature [142]. In general, they can be grouped into
three groups; appearance-based, contour-based, or a combination of both. The
first part of this chapter is dedicated to briefly reviewing these. Even though no
one technique has shown itself to be superior to the other, the appearance based
methods has been preferred by many researchers as it is motivated by human
perception studies and does not require finer localisation and tracking, which
reduces the impact of the front-end effect (see Chapter 4.2). As such, the latter
part of the chapter focusses on an appearance-based technique which is considered
73
74 Chapter 5. Visual Feature Extraction
as the current state-of-the-art. This current state-of-the-art technique is based on
a cascade of appearance features and in this chapter each stage of the cascade is
investigated to determine the relative impact, which is a novel contribution. Even
though this techniques work well, there are shortcomings associated with it such
as dimensionality constraints. Making use of the laterally symmetrical nature
of the frontal ROI has been shown to alleviate some of these problems and this
is investigated in this chapter as well. Another potential method which shows
promise is through the use of patches. Motivated by the frontal ROI symmetry
work, novel analysis of the ROI via patches is introduced. The idea behind the
use of patches is that if there are areas of the ROI which are more pertinent to the
task of lipreading, then these areas can be weighted higher to improve lipreading
performance. The remainder of the chapter is dedicated to developing a new
multi-stream visual feature extraction technique which fuses the more pertinent
areas of the ROI together to gain a better representation.
5.2 Review of Visual Feature Extraction Tech-
niques
Potamianos et al. [142], divided the ways visual speech could be represented into
the following three categories:
(i) appearance based,
(ii) contour based, and
(iii) combination of appearance and contour based features.
The following section gives a brief review of the progress that has been made over
the past twenty years with respect to these approaches, citing the advantages
and disadvantages associated with them. The section concludes by comparing
the three approaches.
5.2. Review of Visual Feature Extraction Techniques 75
5.2.1 Appearance Based Representations
Appearance based representations are concerned with transforming the whole in-
put ROI image into a single meaningful feature vector. This method is motivated
by the fact that in addition to the lips, the visible speech articulators such as
the teeth, tongue, jaw as well as certain facial muscle movement are informative
about the visual speech [167]. To incorporate such features, the ROI is normally
a square or rectangular region around the speaker’s mouth, as was described in
the previous chapter. However, there is no fixed dimension, shape or size that
the ROI has to conform to. Some researchers have used the entire face for the
ROI [116], a three-dimensional rectangle to capture temporal nature of the signal
[136], or a disc around the speaker’s mouth [45]. Instead of pixel values, differ-
ence [161] or optical flow values [65, 112, 171] have been used as the ROI. As the
dimensionality of the ROI’s are generally too high to be applied successfully to
a statistical classifier such as a HMM [147], the goal of these methods has been
to find a compact representation of the ROI, which is low in dimensionality but
retains most, if not all, of the visual speech information. This limitation on the
number of features allowed to be used is also known as the curse of dimensionality
[8].
The dimensionality reduction problem is similar to that encountered in face
recognition, where a compact representation of a face image is desired. As such,
it is not surprising to see that a lot of the work performed in lipreading has mir-
rored the work done in face recognition. After Turk and Pentland published their
ground breaking paper on eigenfaces [176], where principal component analysis
(PCA) was used for feature reduction, Bregler and Konig [13] later introduced
eigenlips. This work was based on the same idea, except that PCA would be per-
formed on the mouth ROI, not the face. Since then, PCA has been a very popular
appearance based method used for lipreading with many researchers achieving
good results [12, 13, 27, 45, 47, 98, 102, 137, 175]. Independent component
analysis (ICA) has also been used for lipreading [64]. The goal of ICA [78] is
to perform nonlinear monotonic transformations such that the transformed rep-
resentation is statistically independent, not just uncorrelated. However, results
received using such a transform were not significantly better than traditional
76 Chapter 5. Visual Feature Extraction
PCA representations for lipreading [64].
Linear image transforms such as the discrete wavelet transform (DWT) [134,
137] and the discrete cosine transform (DCT) have been employed in various sys-
tems [45, 62, 72, 98, 106, 126, 137, 158, 161]. These non-data driven approaches
do not produce any compression, but make the transformed coefficients more
amenable to quantisation by removing much of the statistical redundancy in the
image. The DCT is the basis of many low-rate algorithms in use today such as
MPEG for High Definition Television (HDTV) and JPEG for still images. These
image transforms usually allow fast implementation when the image size is of
a power of 2 by use of the Fast Fourier Transform (FFT) and can lend itself
to a real-time implementation of a lipreading system [32]. Another benefit of
these approaches are they are not dependent on a training ensemble, however,
they bring minimal prior knowledge about the mouth to the system if used as
the sole visual feature extractor. Non-linear transforms such as the multiscale
spatial analysis (MSA) technique have also been used to gain a representation
of the mouth ROI [115]. MSA uses a nonlinear scale-space decomposition algo-
rithm called a sieve which is a mathematical morphology serial filter structure
that progressively removes features from an image by increasing the scale.
The methods mentioned above achieve reasonable performance, however, they
are merely concerned with compressing the ROI and do not necessarily take into
consideration the speech content contained within the ROI. A major reason for
this is that they are unsupervised processes and as such, do not make use of im-
portant speech information available such as timed labeled transcriptions, which
is used for the training of the HMM models. Linear discriminant analysis (LDA)
on the other hand, is a supervised process which can use information such as the
time label transcription to segment the training data into speech classes (such
as HMM states) and calculate a transform which can project the ROI data to
a lower dimensionality whilst maximisimg the separation of the speech classes.
First proposed for lipreading by Duchnowki et al. [45], LDA was applied directly
to the pixels of the ROI. However, this can be problematic as when there are many
training examples, the calculation of the LDA transform matrix can become com-
putationally prohibitive. To counteract this, a dimensionality reduction step of
5.2. Review of Visual Feature Extraction Techniques 77� �Figure 5.1: Appearance based features utilise the entire ROI given on the left.Contour based features require further localisation to yield features based on thephysical shape of the mouth, such as mouth height and width which is depictedon the right.
PCA [98] or DCT [140] is applied, which has shown to outperform other appear-
ance based methods. Potamianos et al. [140] further improved performance by
following the LDA step with an application of a data maximum likelihood linear
transform (MLLT), which maximises the observations in the LDA feature space,
under the assumption of a diagonal covariance.
As mentioned in Chapter 2, the dynamics of visual speech play a very impor-
tant role in human perception [152]. Dynamic speech information such as first
and second order derivatives of the visual speech feature vector can be used to
capture this information[189]. LDA can also be used in this capacity as well [142],
by concatenating ±J adjacent feature vectors around the current frame to form
one large feature vector. LDA is then used to produce a transform which can
maintain the dynamic information whilst producing a final feature vector which
is small enough to allow convergence of the HMM. This process, first proposed by
Potamianos et al. [145], is performed via a multi-stage cascade and is currently
the state-of-the-art in visual feature extraction. As this is the case, this process
will be used as the baseline system for this thesis and is described in Chapter 5.3.
5.2.2 Contour Based Representations
Contour based representations are concerned with representing the mouth based
on the physical shape of the visible articulators. Whereas the appearance based
features just utilise the pixels within the ROI, the contour based approach goes a
step further by specifying the locations of the various visible articulators and using
78 Chapter 5. Visual Feature Extraction
these locations as their features as seen in Figure 5.1. This intuitive approach
has the appeal of being low in dimensionality, however, it does require further
localisation and tracking which can have an adverse effect on the lipreading system
due to the front-end effect.
A common contour based technique is to represent the mouth based on its
physical measurements such as mouth height, width, area etc. [1, 22, 59, 71, 132,
172] or even teeth [56] (see Figure 5.1). In [88], Kaynak et al. have provided a
comparison of these type of techniques. Another popular technique is to use active
shape models (ASM) [109, 126], as discussed in the previous chapter, to represent
the inner and outer lip contour by a set of labelled points. Other parametric
models such as the snake based algorithm [27], lip template parameters [21] and
deformable templates [73] have been used to good effect. Contour based features
based on MPEG4 standard of Facial Action Parameter (FAPs), have also been
proposed [3]. Recently, Rothkrantz et al. [153] introduced the use of lip geometry
estimation (LGE) along with optical flow analysis as another method for visual
feature extraction for lipreading.
5.2.3 Combination of Features
Based on the theory of fusing the complementary visual features to the acoustic
features to improve system performance, researchers have used a similar theory in
combining the appearance and contour based representations. This theory stems
from the hypothesis that the appearance features encode low-level information
and the contour features encode high-level information about visual speech and
by combining them together, improved performance can be sought as they are
complementary to each other. Luettin first did this by combining ASM features
with PCA features [109]. Chiou and Hwang [27] followed this up by combining
their snake contour features with the PCA features. Chan [19] then used geo-
metric features with PCA features. These approaches just concatenate the both
sets of features into a single feature vector. Conversely, the active appearance
model (AAM) creates a single model of both shape an appearance, using PCA
to statistically combine both the ASM with appearance features based on the
pixel intensity values [33, 114, 127]. A disadvantage of using this AAM approach
5.2. Review of Visual Feature Extraction Techniques 79
is that it requires an extremely large number of manually annotated points for
the training examples and does not work well when the speaker is not contained
within the training set.
Recently, Saenko et al. [157] proposed the use of multiple streams of hidden
articulatory features (AFs) to represent the visual speech signal. In this work,
each sound is described by a unique combination of various articulator states,
such as “lip-opened”, “lip-rounded”, “presence of teeth” etc. A problem associ-
ated with this multi-stream approach is the complexity involved as each of these
articulatory states (such as “lip-opened”) require extra classification (via a SVM
for example) prior to the sound classification, which may make this approach
intractable.
5.2.4 Appearance vs Contour vs Combination
Even though a plethora of research has been conducted within the field of visual
feature extraction for lipreading, it is still not clear which approach is best. A ma-
jor reason for this is that no comprehensive comparison of the various approaches
have been conducted as yet. In comparisons of limited size, Matthews et al. [115]
showed that AAMs outperform ASMs. Chiou and Hwang [27] documented that
their combined features were superior to the contour and appearance based fea-
tures. Potamianos et al. [137] and Scanlon and Reilly [161] demonstrated that
the appearance based features outperformed contour features. In experiments
based on the task of large vocabulary speaker independent AVASR, Matthews et
al. [116] showed that the appearance features outperformed AAMs.
In a recent paper by Rothkrantz et al. [153], they cite that the contour based
approach is superior to the appearance based approach. Even though they did not
do any experiments to support this statement, they hypothesised that the appear-
ance based features were inferior as they contained a lot of information which may
not pertain to the task of speech recognition but more so to speaker recognition,
as the heavily compressed features relate to speaker information and not speech
information. If a coarse image compression technique such as PCA or DCT is
applied and nothing else, this can be the case. However, if dimensionality reduc-
tion schemes such as LDA are employed, speech classification information can be
80 Chapter 5. Visual Feature Extraction
maintained. Other feature normalisation techniques such as removing the mean
feature vector or image over the utterance also normalise against speaker appear-
ance information. In the current state-of-the-art system devised by Potamianos
et al. [145], they use a cascade of appearance features using LDA as well as a
speaker normalisation step to maximise visual speech information.
Contour representations of the mouth ROI can be said to have certain bene-
fits over appearance based representations. The main benefit can be found in the
invariance provided from the shape information contained within the contours of
a speaker’s mouth region. Appearance features tend to suffer from irrelevant vari-
ations pertaining to visual speech due to illumination, speaker appearance and
ROI alignment. However, the accuracy of physically extracting the contours of
the mouth region makes this task very difficult, thus making them susceptible to
the front-end effect. By comparison, appearance based features rely on a coarse
detection of the ROI, making them far more stable, especially in difficult con-
ditions. Also, normalisation techniques for speaker appearance and illumination
can be employed to aid in the robustness of these features. AAMs appear to be
the best of the combined feature set, however, as shown during a comprehensive
comparison [116], appearance based features seem to be superior to them at the
moment.
Potamianos et al. in their review paper [142], highlight two very important
points in the argument towards appearance based features. Firstly, their use is
well motivated by human perception studies of visual speech as they contain in-
formation about the visible articulators (such as tongue, teeth, muscles around
the jaw etc.), which are not contained just by the contours of the lips [7]. In the
perception studies cited, perception of the mouth using the entire mouth ROI was
far superior to just the lip movement [169]. Secondly, appearance based features
can be computed very quickly, which lends itself to real-time implementation.
This point is probably the most important in terms of deploying a real-world
lipreading system. Another point which is also very important, is that the ap-
pearance based features are generic and can be applied to mouth ROIs of any
viewpoint compared to the contour based approaches as specific contours have to
be developed for the many views which may be a very cumbersome and exhaustive
5.3. Cascading Appearance-Based Features 81
Figure 5.2: Block diagram depicting the cascading approach used by Potmianoset al. [145] to extract appearance based features from the mouth ROI.
task.
For all these reasons, the appearance based approach was employed as the
visual feature extraction method of choice for this thesis. In the next section,
the current state-of-the-art technique based on a cascade of appearance-based
features is used as the benchmark, and experiments are conducted to show that
certain measures can be executed to normalise against the numerous variations
that the appearance based features are susceptible to.
5.3 Cascading Appearance-Based Features
The current state-of-the-art in visual feature extraction is that of multi-stage
cascade of appearance features devised by Potamianos et al. [145]. For this
thesis, a system based on this approach is used as the baseline system for all
work conducted. The complete system is depicted in Figure 5.2. From this figure
it can be seen that the system consists of 2 main stages. These being:
1. static feature capture (features captured per single frame), and
2. dynamic feature capture (features capturing temporal information over
multiple frames).
Each of these steps will be described in detail in the following subsections. How-
ever, as can be seen in Figure 5.2, a preprocessing step is required to convert the
ROI from a colour image into a grayscale image. The curse of dimensionality
can account for the reason why grayscale intensity values have been preferred
to using colour values, due to their more compact representation (three times
smaller). Research has found that the loss of the chromatic information does not
impact on the speech classification performance [143].
82 Chapter 5. Visual Feature Extraction
Remove MeanFeature Vector,
yI
2D-DCT Intra-FrameLDA
Input GrayscaleROI, It
1...
M
1...N
Static FeatureVector, yIII
t
To Dynamic Feature Capture
yIt
1...
M
Normalised Vector, yII
t
Figure 5.3: Block diagram showing the capturing of the static features of a ROIframe.
5.3.1 Static Feature Capture
The goal of the static feature capture module is to maximise the amount of speech
data contained within each ROI frame with the least amount of features. This
fine balancing act is due to the dimensionality constraint enforced by the HMM
classifier as was mentioned in the previous section. However, this is achieveable
via a three stage cascading module which is depicted in Figure 5.3. As can
be seen in this figure, the static feature capture starts with a two-dimensional
DCT. As mentioned in the review given in the previous section, compression
techniques such as PCA and DWT can also be used instead of the DCT. However,
it was found in Potamianos et al. [143] that all of these algorithms achieve
approximately the same performance, with the DCT on par with PCA and slightly
outperforming DWT. The DCT and the DWT have the added advantage that
they allow fast implementations if the image resolution is a power of two and
also do not require prior knowledge of the training ROI examples. On the other
hand, PCA does require these training examples which is very computationally
expensive, especially given the high dimensionality of the images (32×32 = 1024)
and the number of training examples that are normally required (>= 200, 000)
to adequately train up the PCA subspace. As such, the DCT has been chosen to
be the image compression technique of choice.
Discrete Cosine Transform (DCT)
Given a grayscale intensity frame of the ROI of dimension D = L×W , at pixel
location (l, w), the two-dimensional DCT of the ROI, It(l, w), can be computed
as follows
5.3. Cascading Appearance-Based Features 83
Ft(l, w) =
√2
L
√2
Wclcw
L−1∑i=0
W−1∑j=0
It(i, j) cos(2i + 1)lπ
2Lcos
(2j + 1)wπ
2W(5.1)
where
l = 0, 1, . . . , L− 1
w = 0, 1, . . . , W − 1
for
cl,w =1√2
for l, w = 0
= 1 for l, w 6= 0
where L and W refer to the length and width of the ROI in pixels respectively.
No compression on the DCT image ,Ft(l, w), has taken place in the form given
in Equation 5.1. All this has achieved is transforming the image from the spatial
domain into the frequency domain, similar to what the discrete Fourier trans-
form (DFT) does. However, the form of Ft(l, w) lends itself to be compressed
as the coefficients within the transformed image are grouped according to their
importance. Most of the energy or information is contained within the low-order
coefficients, whereas, the higher order coefficients have very low information con-
tained within them and can be discarded. By reorganising the image data in
this way, compression can be achieved by retaining the coefficients with the most
information.
A convenient way to scan the two-dimensional DCT is the zig-zag scheme
(see Figure 5.4) used in the JPEG standard [173] because it groups together
coefficients with similar frequency. As such, the top M coefficients according
to this pattern represent information within the image which contains the most
variability or information which is used to represent the given ROI. Figure 5.5
shows examples of the reconstructed images using the various numbers of M . As
can be seen in this figure, low number of features such as M = 10 or 30 result
in very little information being maintained from the original ROI. However, as
84 Chapter 5. Visual Feature Extraction
Figure 5.4: Diagram showing the zig-zag scheme used in reading in the coefficientsfrom an encoded the two-dimensional DCT image.
(a) (b) (c) (d) (e)
Figure 5.5: Examples showing the reconstructed ROI’s using the top M coeffi-cients from the DCT: (a) original, (b) M = 10, (c) M = 30, (d) M = 50 and (e)M = 100.
the number of features used is increased to M = 50 and 100, the more the
reconstructed ROI’s look like the originals, but with a dimensionality reduction
of a factor of 10 to 20.
Using the top M DCT coefficients, the ROI frame can be expressed as a vector
of the formyI
t = [y1, . . . , yM ]′ (5.2)
where y1, . . . , yM correspond to the top M coefficients according to the zig-zag
pattern and yIt refers to the feature vector after the first stage of the cascade,
which can be seen in Figure 5.3. It is worth noting that the DCT is completely
reversible, that is, performing the inverse DCT on the DCT of an image will
restore the original image; using this property, allowed the reconstructions in
Figure 5.5.
5.3. Cascading Appearance-Based Features 85
-300 -200 -100 0 100 200 300 400 500 600 700-400
-350
-300
-250
-200
-150
-100
Comparison of 2nd and 3rd DCT coefficients for two speakers for three digits
2nd DCT coefficient
3rd
DC
T c
oeffi
cien
t
- zeroo - onex - two
speaker 1
speaker 2
Figure 5.6: Plot showing the speaker information contained within the featureswithout normalisation, for the digits “zero”, “one” and “two”.
Feature Mean Normalisation
Once the DCT step has been executed, the next step is to perform feature mean
normalisation (FMN). The FMN step is very important due to the fact that
appearance based features have an abundance of speaker information contained
within them as noted recently by Rothkrantz et al. [153] and addressed earlier in
[104]. This becomes apparent when the DCT features for two different speakers
are analysed. In Figure 5.6, the second and third DCT coefficients for two speak-
ers are plotted against each other for three spoken digits. It can be seen that
the features are grouped according to their speaker and not the speech classes.
This is a very good result for the task of speaker recognition, however in terms of
lipreading, this information is irrelevant and can be classed as a type of noise. In
audio-only speech recognition a method called cepstral mean subtraction (CMS)
has been used as a method to remove this speaker information, as well as other
environmental variations [100, 189]. Similarly, this can be done in the visual do-
main by subtracting the mean feature vector over the entire utterance, yI , thus
effectively removing the speaker information. A simple block diagram depicting
86 Chapter 5. Visual Feature Extraction
1...
M
Input Static FeatureVector, yI
t
Remove MeanFeature Vector,
yI
1...
M
Normalised Vector, yII
t
To Intra-Frame LDA
Figure 5.7: Block diagram showing the feature mean normalisation (FMN) stepof the cascading process, resulting in yII
t .
this is given in Figure 5.7. So, given the feature vector from the previous DCT
step, the normalised feature vector can be found via
yIIt = yI
t − yI (5.3)
where
yI =1
T
T∑t=1
yIt (5.4)
and T is the length of the utterance. This FMN step has shown itself to greatly
improve the performance of appearance based features [137, 140]. As can be
seen in Figure 5.8, the FMN essentially removes the mean of the image, which
essentially contains the redundant speaker information. In this thesis, the result-
ing normalised DCT features yIIt are termed as mean-removed DCT (MRDCT)
features.
The MRDCT features can be obtained via a slight augmentation to the static
feature capture module. Instead of removing the mean feature, the mean ROI
image over the utterance can be removed. This can be done by placing the FMN
step prior to the DCT step as can be seen in Figure 5.9. In the end yIIt is still
obtained, essentially resulting in the same output, but the normalisation is done
in the image domain rather than the feature domain. In this configuration the
mean image over the utterance, I is subtracted from the input image It to yield IIt .
The two-dimensional DCT is then performed on IIt to gain the MRDCT features
yIIt .
There are a few reasons why this change is necessary. Firstly, in the current
system the FMN process is relied on as the sole normalisation step. No rotation
normalisation, pose compensation or lighting normalisation is directly applied
5.3. Cascading Appearance-Based Features 87
-300 -200 -100 0 100 200 300 400-150
-100
-50
0
50
100
Comparison of 2nd and 3rd MRDCT for two speakers for 3 digits
2nd MRDCT coefficient
3rd
MR
DC
T c
oeffi
cien
t
- zeroo - onex - two
Figure 5.8: Plot showing that with FMN the unwanted speaker information con-tained within the features is effectively removed, for the digits “zero”, “one” and“two”.
on the ROI. As mentioned in the previous chapters, these variations have not
been a problem for lipreading and as such there has been no requirement for
accounting for such variations. However, as more work in this field is being ap-
plied on real-world data where these variabilities are an issue, precautions should
be made so that a lipreading system can handle them. The only work found to
deal with such variations in the visual domain is by Potamianos and Neti [141],
who found that lipreading performance degrades significantly when deployed in
challenging environments. This mirrors the findings in face recognition where il-
lumination and pose variability have shown itself to be one of the biggest sources
of train/test mismatch causing sever performance degradation [67]. Secondly, by
placing the FMN step directly after the ROI extraction instead of the DCT step,
it is believed that variabilities such as pose and illumination can be dealt with
in a more efficient manner by incorporating an illumination or pose normalisa-
tion step within the FMN module. By allowing this change, the FMN process
essentially acts as an pre-processing step dealing in the image domain rather than
the feature domain. This allows the input ROI image to be enhanced prior to
88 Chapter 5. Visual Feature Extraction
Mean ROI, I I
To Intra-Frame LDA
2D-DCT
1...
M
Normalised Vector, yII
t
Input ROIIt
Mean RemovedROI, II
t
Feature MeanNormalisation
Figure 5.9: Block diagram showing the augmented static feature capture systemusing the FMN in the image domain rather than the feature domain.
any feature extraction and normalise for any unwanted variations within the in-
put ROI, hopefully reducing the train/test mismatch thus improving lipreading
performance.
In the next section, experiments will be conducted showing that this basic
change to the system does not affect lipreading performance.
Linear Discriminant Analysis
Whilst the previous two steps have extracted and normalised the features for each
ROI frame, they are only coarse image compression and normalisation techniques
and do not identify those features that will give the best discrimination between
the various speech classes (such as words). For a successful lipreading system to
be employed, it is important that features not only describe each class well, but
also allow distinguishing characteristics of each class to be easily identifiable.
Linear discriminant analysis (LDA) aims to find the optimal transformation
matrix WILDA such that the projected data is well separated. Unlike the DCT,
LDA is a supervised process which uses a predefined set of classes C associated
with their training data vector yIt to determine this optimal transform. The
class set C consists of A number of classes so that c(a) ∈ C. For the task of
lipreading, where recognition of words is the overall goal, these A classes are
normally associated with words. Improved performance can be gained however,
with the increase in the number of classes, so HMM states are used as the class
set. Labeling the training data is done via aligning the training feature vectors
with the state-aligned time labeled transcription which can be obtained from the
audio-only models using HTK [189].
5.3. Cascading Appearance-Based Features 89
Given a set of NT training examples, XI = [yII1 , . . . , yII
NT], and associated class
labels C, the LDA transformation matrix WILDA can be found by minimising
the intraclass dispersion whilst maximising the interclass distance 1. To formu-
late the criteria for class separability, the within-class scatter matrix Sw and the
between-class scatter matrix Sb are used. The within-class scatter matrix de-
scribes the statistics of the data points around their own expected vector, whilst
the between-class scatter matrix describes the distribution statistics of all class
expected vectors. The within-class scatter matrix can be expressed as
Sw =A∑
a=1
caΣa (5.5)
where ca is the ath class mixture weight and Σa is the ath class covariance matrix.
The between-class scatter matrix can then be expressed as
Sb =A∑
a=1
ca(µa − µ0)(µa − µ0)′ (5.6)
where µa is the ath class mean and µ0 is the mixture mean given by
µ0 =A∑
a=1
caµa (5.7)
The transformation matrix WILDA can then be estimated by maximising
tr(WILDAS−1
w Sb(WILDA)′) [51]. Similar to what occurs in PCA, this translates
to the N greatest eigenvalues and eigenvectors of S−1w Sb. Although both S−1
w and
Sb are symmetric there is no guarantee that S−1w Sb will be symmetric making nor-
mal eigen decomposition impossible. Simultaneous diagonalisation as proposed
by Fukanaga [51] can be used to diagonalise S−1w Sb where
(WILDA)′S−1
w WILDA = I (5.8)
and
(WILDA)′S−1
w WILDA = Λ (5.9)
1Note that in this cascading algorithm there are steps which are repeated several times,hence the need for the indexing of the matrices and vectors to avoid confusion, i.e. W I
LDA andyII
NT.
90 Chapter 5. Visual Feature Extraction
where Λ and WILDA are the eigenvalues and eigenvectors of the matrices S−1
w Sb. It
must be said that the resulting eigenvectors in the transformation matrix WILDA
are not mutually orthonormal or orthogonal. As a result the transform does not
preserve energy, but preserves class separability as defined by the within-class
and between-class scatter matrices.
It has been shown that LDA does not work well when it is applied directly
to high dimensional data such as images [6, 184]. This is mainly due to its
susceptibility to low energy noise and it being computationally prohibitive to
calculate the LDA matrix when the input matrix is extremely large. To alleviate
this problem, a dimensionality reduction step is normally taken to remove this
low energy noise. This is why the two-dimensional DCT was performed on the
input ROI prior to the LDA step, as it is more effective to work on data of
dimensionality M compared to D, where D >> M .
LDA is similar to PCA where the linear transform matrix WILDA maps the
input data XI of dimensionality M to output matrix YI , which is of dimension
N (where M > N). To achieve this feature reduction, the top N eigenvectors
of WILDA corresponding to the largest N eigenvalues of Λ are retained yielding
W”ILDA. So given the input matrix XI , the output YI can be found by simply
applying
YI = (W”ILDA)TXI (5.10)
where the output from the LDA step is YI = [yIII1 , . . . ,yIII
T ], which corresponds
to the final static feature vector and the third step in the overall cascading al-
gorithm. This LDA step has been termed intra-frame LDA due to this step
occurring within each individual frame [142].
There are assumed constraints associated with using LDA. Firstly, it assumes
each class is described by the same convariance matrix Sb. This can be a large
problem when this approximation does not hold. Additionally, the rank of the
within-class scatter matrix Sw is limited to M ≤ A − 1, limiting the size of the
subspace defined by WILDA to A− 1. Finally, LDA is only suitable for problems
where classes are separated by means not covariances. When this assumption does
not hold it is possible to find clusters in each class that force each distribution to
be described by several unimodal Gaussians of the same covariance matrix.
5.3. Cascading Appearance-Based Features 91
1...N
Input Static FeatureVector, yIII
t
1...P
Final Dynamic Feature Vector, yV
t
+J
-J
Inter-frame LDA
Concatenated FeatureVector, yIV
t
1
.
.
.
(2J+1)N
Figure 5.10: Block diagram showing the capturing of the dynamic features cen-tered at each ROI frame.
In the current state-of-the-art system, the LDA step is normally followed by
a step of maximum likelihood data rotation (MLLT) [61]. In this thesis however,
the MLLT step was not performed as it did not add to the performance of the
lipreading system during preliminary experiments.
5.3.2 Dynamic Feature Capture
The temporal aspect of the visual speech signal is known to help human percep-
tion of visual speech [152], as mentioned in Chapter 2. There are many ways of
incorporating this temporal information. The most popular method of capturing
the dynamic information is via the first and second derivatives of the feature
vectors [189]. Another method which can give improved results is to use LDA
as a means of learning a transformation matrix which can optimally capture the
dynamic nature of speech. Such a method is depicted in Figure 5.10 and this is
used in the current state-of-the-art system employed by Potamianos et al. [140].
It can be seen in this figure that the transformation matrix is found from the
concatenation of ±J frames centered around the current frame. So each input
frame to the LDA step is represented by
yIVt = [(yIII
t−J)′, . . . , (yIIIt )′, . . . , (yIII
t+J)′]′ (5.11)
The LDA transformation matrix for the dynamic features, W′′IILDA, is calcu-
lated exactly the same way as for the static features, with the classes and the
92 Chapter 5. Visual Feature Extraction
number of training examples remaining the same. The only difference is that the
input feature vectors span across multiple frames and not just within the frame.
For this reason, this step has been termed inter-frame LDA [142].
Similar to the result in Equation 5.10, the the output YII can be found by
simply applying
YII = (W′′IILDA)′XII (5.12)
where the input is XII = [yIV1 , . . . ,yIV
T ] and the output from the LDA step is
YII = [yV1 , . . . ,yV
T ], which corresponds to the final dynamic feature vector and
the final step in the overall cascading algorithm shown in Figure 5.10. In the
work conducted by Neti et al. [126] and Potamianos et al. [142], they found that
using 5 adjacent frames (J = 2) gave optimal results 2.
5.4 Lipreading from Frontal Views
In this section, experiments are conducted on frontal view data to test the cas-
cading appearance feature extraction method described in the previous section.
Analysis of the features is carried out at each stage, in an effort to show the
importance of each individual stage. As no analysis like this has been conducted
before, this is very important as it shows what impact each stage of the cascade
has on the overall lipreading performance. This analysis is also important in
working out the parameters to yield the optimal performance from the lipreading
system. These experiments also show some of the limitations and restrictions on
associated with some of the stages, which may be overcome with some modifica-
tions.
The frontal pose portion of the IBM smart-room database was used for this
experiment. As for all experiments carried out using this database, the multi-
speaker paradigm using the protocol described in Chapter 3.6.2 was used. As the
dynamic nature of speech is vital in terms of recognising visual speech, it was
2In these systems the visual features were interpolated to the audio rate of 100Hz to alloweasy integration of the audio and visual streams. Because of interpolation, they actually used15 adjacent frames. In this system however, the focus is solely on lipreading and as such nointerpolation was performed, so the value of 5 frames is an approximate equivalent (30Hz vs100Hz).
5.4. Lipreading from Frontal Views 93
decided that difference images would be tested as well as the original images to
see how much impact the temporal aspect of visual speech had on performance.
Given an input ROI image It, the difference image can be defined as
I∗t = It − It−1 (5.13)
The features for the difference ROI images are then calculated the same way
for the original image, except that I∗t is used instead of It. Both the static and
dynamic features are evaluated in these experiments, which are described in the
following subsections.
5.4.1 Static Feature Analysis
The DCT step is a coarse compression technique aimed at reducing the dimen-
sionality of the ROI without adhering to any class structure. As it is the first
step within the cascading algorithm, it is interesting to see how much speech
discrimination power is contained within these early features. In Figure 5.5, it
was visible that when more DCT coefficients were used to represent a ROI, the
more it appeared like the original ROI. It is therefore intuitive that more features
would correspond to improved speech classification, however, this is not possible
as the HMM classifier can only handle a finite number of features. Another factor
to consider is what impact the redundant speaker information has on speech clas-
sification and to that end, what effect the FMN step has on speech classification.
The results are shown in Figures 5.11 3.
Figure 5.11 shows the raw DCT features do not provide much speech clas-
sification performance, achieving a minimum WER of around 87%. However,
when the FMN step is utilised a massive improvement is gained with the WER
of 59% achieved using 40 features. This result shows that speaker information
is an unwanted form of noise and great benefit can be sought be removing this
redundant component from the signal. Conversely, the difference DCT features
performed very well achieving a WER of 48% using 40 features. This highlights
the importance of the temporal nature of visual speech. It is interesting to note
3It is worth noting that“DCT” refers to results of the the DCT of the input ROI and “Diff”refers to the results of the DCT of the difference ROI.
94 Chapter 5. Visual Feature Extraction
10 20 30 40 50 60 70 80 90 10045
50
55
60
65
70
75
80
Number of features per feature vector (M)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison showing the effect of FMN on static features
DCTMRDCT ftDiffMRDiff ft
Figure 5.11: Plot showing the effect that FMN has on the lipreading performance.
however, that the FMN normalisation step did not improve speech classification
for the difference features. This is because the difference ROI has already un-
dergone a FMN step prior to feature extraction through subtracting the previous
ROI; this result is to be expected as there is no redundant speaker information
to normalise for. Another interesting result to be gained from this experiment
is the impact, or lack of impact, that the number of features have on lipreading
performance. Apart from the DCT result, from 10 to 20 features it can be seen
that there is a jump in performance and this improvement peaks at around 40
features. There are two possible reasons for this. Either, all the visual speech
information is contained in the top 40 features, or the curse of dimensionality is
having an impact on performance. As will be shown later on, it would appear
that the latter is the cause.
Placing the FMN step prior to the DCT step can allow the input ROI to
be enhanced which can allow the lipreading system to deal more easily with the
variations associated with illumination and pose. The next result shows that by
augmenting the feature extraction step in this way no degradation in lipreading
performance is suffered. In fact, by viewing Figure 5.12, it can be seen that
marginal improvement in performance at some levels can be sought by doing the
FMN in the image domain rather than the feature domain. Even though the
improvement is slight (WER of 55.75% compared to 58.89%) for MRDCT using
5.4. Lipreading from Frontal Views 95
10 20 30 40 50 60 70 80 90 10046
48
50
52
54
56
58
60
62
64
Number of features per feature vector (M)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of FMN approaches on static visual features
MRDCT ftMRDCT imMRDiff ftMRDiff im
Figure 5.12: Plot comparing the lipreading performance of both the image basedand feature based FMN methods.
40 features, there is minimal to no improvement sought for the mean removed
difference features (MRDiff)). This suggests that using FMN in the image domain
would be of benefit to the overall lipreading system and so this method was
employed for the rest of the experiments in this thesis.
Using the top M = 100 normalised features from the previous step, intra-
frame LDA is performed to further reduce the dimensionality of the features
whilst maintaining speech classification information. The speech classification
information is based on the class set C, which for these experiments were the
HMM states. The results for these experiments are shown in Figure 5.13. As can
been seen in this figure, the intra-frame LDA step reduces the best case WER
from 55.75% down to 43.39% for MRDCT and from 47.96% down to 32.37% for
MRDiff. These reductions in WER correspond to significant improvements in
lipreading performance, which highlights the importance of LDA to the task of
lipreading. As can be seen from this plot, optimal performance is gained using
N = 10 to 20 features to represent the visual signal, which is a useful result when
performing the inter-frame LDA which is the next step.
As it can be seen from the results shown in this section, significant improve-
ments in lipreading performance can be obtained at each stage of the cascade.
It is worth noting that even though high performance is gained using just the
96 Chapter 5. Visual Feature Extraction
10 20 30 40 50 60 70 80 90 10030
35
40
45
50
55
60
65
70
Number of features per vector (M for DCT, N for LDA)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of visual features showing the effect of LDA on lipreading performance
MRDCT imMRDCT im/LDAMRDiff imMRDiff im/LDA
Figure 5.13: Plot of the lipreading results showing the effect that LDA has onimproving speech classification on the final static features over various values ofN .
static frame data, the temporal nature of visual speech does provide significantly
more speech discrimination through the use of difference ROI’s. This point is
highlighted by the fact that after the intra-frame LDA the difference features
obtained a WER of 32.37% compared to the 43.39%. This result is very signifi-
cant if only the static feature capture is able to be implemented into a lipreading
system due to real-time constraints. The results in this section also highlights
the curse of dimensionality when dealing with a classifier like a HMM. In Figures
5.11 and 5.12 the best performance was obtained using 40 features. However, in
the previous result shown in Figure 5.13, it is quite obvious that using around
100 features gave more visual speech information.
5.4.2 Dynamic Feature Analysis
In these experiments many different permutations of the number of input features
to the inter-frame LDA step were used in determining what gave the best lipread-
ing results. Even though in the previous subsection it was found that N = 10 to
20 gave the optimal lipreading results for the static features, this does not nec-
essarily translate to being the best configuration to capture the dynamics of the
speech. This is because their has to be a balance between the number of input
5.4. Lipreading from Frontal Views 97
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=30
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=30
MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3MRDiff/LDA2 J=4
(a) (b)
Figure 5.14: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 30 input features.
features used N and the length of the temporal window J . This is very important
as calculating the transformation matrix WIILDA is computationally expensive and
there is a limit on how large the input matrix XI can be (approximately < 6M
element matrix).
For the sake of clarity only the results of the best performing configuration
using N = 30 for both the MRDCT and MRDiff features are shown in Figure 5.14;
a complete comparison of results using N = 10, 20, 30 and 40 is given in Appendix
A. As can be seen in Figure 5.14(a), the performance of the MRDCT features
improves when the temporal window J is increased from 1 to 2. When the value of
J is increased past 2, performance appears to level off with no real improvement
gained. From these results it appears that the best lipreading performance is
gained when P = 40 features are used, even though their is no real difference
between P = 30 to 60 features. This is in comparison to Figure 5.14(b), where
the improvement in lipreading performance of the MRDiff features when the
value of J is increased from 1 to 2 is not as large. This is because some temporal
information is already included in the MRDiff features, as they are difference
features. This can explain why there is considerable discrepancy between the
performance of the difference ROI image and original image features for the static
features. When the value of J is increased past 2, a similar flattening off of the
lipreading performance is experienced to the MRDCT features. This has resulted
in the performance of both set of features to be essentially equivalent with the
98 Chapter 5. Visual Feature Extraction
MRDCT features obtaining a best WER of 27.66% with J = 2 and P = 40
compared to the difference features achieving a best WER of 27.95% with J = 3
and N = 50.
Based on the analysis of the different components of the proposed visual fea-
ture extraction system, the configuration of the baseline system which yields
optimal performance is M = 100, N = 30, J = 2 and P = 40. This configuration
is for the augmented static capture module as per Figure 5.9 and is used for the
remainder of this thesis. Also, seeing that there is no discernable advantage in
using the difference ROI images compared to the original ROI images, visual fea-
ture extraction will only be performed on the original ROI’s. These parameters
agree with those found in [142].
5.5 Making use of ROI Symmetry
As was seen in the previous section, the more number of features used to rep-
resent the ROI does not translate to better lipreading performance, partly due
to the curse of dimensionality. As such, it is imperative that the a compact rep-
resentation of the ROI is gained. Although the system presented in this thesis
does this to a certain extent, there are still some measures which can be taken
to maximise the amount of visual speech content captured from each ROI frame.
One such technique is to make use of the symmetrical nature of the ROI for the
frontal pose. If the ROI was perfectly symmetrical around its midpoint then only
half of the ROI would be required to represent the ROI as the other half would
be identical. This corresponds to a 50% reduction in the number of feature that
need to be used. However, accurate ROI localisation is difficult and is prone to
error as was seen in Chapter 5. Potamianos and Scanlon [144] proposed a method
of overcoming this constraint by forcing the lateral symmetry of the ROI in the
frequency domain by exploiting the properties of the DCT by removing the odd
frequency components. The description of the algorithm is shown in the following
paragraphs and it is adapted from the description given in [144].
The two-dimensional DCT given back in Equation 5.1 is just a one-dimensional
DCT applied to the ROI rows, followed by a one-dimensional DCT on the columns
5.5. Making use of ROI Symmetry 99
of the result. As such, the one-dimensional DCT can be computed as
ft(l) =
√2
Lcl
L−1∑i=0
it(i) cos(2i + 1)lπ
2L(5.14)
where
l = 0, 1, . . . , L− 1
for
cl =1√2
for l = 0
= 1 for l 6= 0
where it(i) refers to the vector within the image the one-dimensional DCT is being
applied to and L is the dimension of the vector. It is evident that if the input
signal is laterally symmetric around its midpoint, (L−1)/2, i.e. it(i) = it(L−i−1)
for l = 0, 1, . . . , L/2− 1, then Equation 5.14 can be rewritten as
ft(l) =
√2
Lcl
L/2−1∑i=0
it(i)
[cos
(2i + 1)lπ
2L+ cos
(lπ − (2i + 1)lπ
2L
)]
= 2
√2
Lcl
L/2−1∑i=0
it(i) cos(2i + 1)lπ
2L, if k mod 2 = 0
= 0, if l mod 2 = 1 (5.15)
since
cos(2i + 1)lπ
2L= (−1)−l cos
(lπ − (2i + 1)lπ
2L
)(5.16)
Therefore, the DCT odd frequency components of a symmetric one-dimensional
signal are all zero. Similarly, if ft(l) = 0, for l = 1, 3, . . . , L − 2 (assuming that
N is a power of 2), then the inverse DCT is
it(i) =
√1
Lft(0) +
√2
L
L/2−1∑
l=1
ft(2l) cos(2i + 1)lπ
L(5.17)
100 Chapter 5. Visual Feature Extraction
(b) (c) (d) (e)(a)
Figure 5.15: Examples showing the reconstructed ROI’s using the top M coeffi-cients for: (a) original, (b) M = 10, (c) M = 30, (d) M = 50 and (e) M = 100.The images on top refer to the reconstructed ROI’s using MRDCT coefficients.The images on bottom refer to the reconstructed ROI’s using the MRDCT withthe odd frequency components removed (MRDCT-OFR).
given that the inverse one-dimensional DCT is given by
it(i) =
√1
Lft(0) +
√2
L
L−1∑
l=1
ft(l) cos(2i + 1)lπ
2L
for l = 0, 1, . . . , L.
Using the trigonometry identity given in Equation 5.16 for Equation 5.17, it
can be shown that the odd frequency DCT components are zero which implies a
symmetric original signal via
it(i)−it(L−i−1) =
√2
L
L/2−1∑
l=1
ft(2l)
[cos
(2i + 1)lπ
L− cos
(2iπ − (2i + 1)lπ
L
)]= 0
for i = 0, 1, . . . , L/2− 1.
The expression derived in Equation 5.16 can be used in the visual feature
extraction process by substituting it with the normal DCT form. As this is applied
to the mean-removed ROI, this technique has been termed as mean-removed DCT
with odd frequencies removed (MRDCT-OFR) compared to just MRDCT. Using
the inverse DCT given in Equation 5.17 on the MRDCT-OFR coefficients, the
ROI’s can be reconstructed using the top M coefficients. These reconstructions
are compared against the reconstructed ROI’s of the MRDCT. Upon inspection
5.5. Making use of ROI Symmetry 101
(b) (c) (d) (e)(a)
Figure 5.16: Examples showing the reconstructed half ROI’s using the top Mcoefficients from the MRDCT for each side: (a) original, (b) M = 10, (c) M = 30,(d) M = 50 and (e) M = 100. The top refers to the reconstructed images of theright side of the ROI. The bottom refers to the reconstructed images of the leftside of the ROI. These images are all of size 16× 32 pixels
it can be seen that the MRDCT-OFR coefficients give more detail about the
ROI than the MRDCT coefficients. This is to be expected as the MRDCT-OFR
provide twice the amount of information than the MRDCT. This is evident when
viewing the MRDCT-OFR ROI reconstruction using M = 50 features compared
to the reconstructed MRDCT ROI with M = 100, as they appear to be similar.
In the work conducted by Potamianos and Scanlon [144], they found that the
MRDCT-OFR coefficients also essentially acted as a post-processing step which
could compensate for small ROI localisation errors.
5.5.1 Experimental Results
Similar to the experiments conducted in Chapter 6.4, the performance of the
MRDCT-OFR features were tested at both the static and dynamic feature level
and compared to the MRDCT features. In addition, the lipreading performance
of both the left half of the ROI and the right half of the ROI were evaluated, as
shown in Figure 5.16. This was done to see if the assumption that the ROI’s were
symmetrical were valid and show the benefit of forcing the lateral symmetrical
in the frequency domain rather than the image domain. Ideally, the results for
102 Chapter 5. Visual Feature Extraction
10 20 30 40 50 60 70 80 90 10052
54
56
58
60
62
64
66
68
Number of features per feature vector (M)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison showing the effectiveness of utilsing the ROI symmetry
MRDCTMRDCT−OFRMRDCT−LeftMRDCT−Right
Figure 5.17: Results showing that removing the odd frequency components of theMRDCT features helps improve lipreading performance.
both the left and right ROI’s should be equivalent to those of the MRDCT-OFR
features. However, it is anticipated that the results for both the left and right
ROI’s will lag behind as ROI symmetry in the image domain is difficult to attain
as previously described. In terms of the visual front-end, these experiments also
give an indication of how well the ROI’s had been located and/or if there exists
a particular bias within the visual front-end. The results for the prospective
features are given in Figure 5.17.
As can be seen in Figure 5.17, it is evident that removing the odd frequency
component to gain a more compact representation of the ROI gives an additional
improvement in lipreading performance. At M = 40 features, the MRDCT-OFR
features achieved a WER of 52.30% compared to the MRDCT features which only
achieve a WER of 55.75%. This result is to be expected as 40 features for the
MRDCT-OFR scheme is essentially equivalent to 80 MRDCT features, however
the dimensionality restriction enforced by the HMM is not in effect here. As
anticipated the lipreading performance of the left (60.06%) and right (60.79%)
ROI’s is somewhat behind at the same level.
The MRDCT features were then subjected to the intra-frame LDA with
M = 100. As can be seen from Figure 5.18, the improvement gained from mak-
ing use of the symmetrical nature of the ROI is nullified by the intra-frame LDA
5.5. Making use of ROI Symmetry 103
10 20 30 40 50 60 70
45
50
55
60
65
Number of features used per feature vector (N)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison showing the effectiveness of using the symmetry of the ROI with LDA
MRDCT/LDAMRDCT−OFR/LDAMRDCT−Left/LDAMRDCT−Right/LDA
Figure 5.18: Plot of the results showing that LDA effectively nullifies the benefitof the MRDCT-OFR in the previous step.
step, with the MRDCT-OFR features getting a WER of 43.63% and the MRDCT
features obtaining an almost equal WER of 43.39%, both with N = 10. This sug-
gests that the intra-frame LDA does a good job at extracting out the relevant
discriminating visual speech information and pre-processing steps such as remov-
ing the odd frequency components of the symmetrical ROI gives is of no real
benefit. Again the performance of both the left and right ROI features are lag-
ging behind the holistic representations, with the left getting a WER of 46.49%
and the right with 47.04%. This backs up the inital hypothesis that symmetry in
the image domain is almost impossible to obtain due to the problems in obtain-
ing accurate ROI localisation, which again highlights the impact of the front-end
effect. These results may also suggest that the structure of each speaker’s ROI
is important (i.e lips rounding etc.) and that the full ROI is required for better
speech classification.
These results are also mirrored for the dynamic features, with the same pattern
emerging. Even though the improved performance is gained uniformly across each
set of features, both the MRDCT and MRDCT-OFR features gain approximately
the same performance, with the MRDCT achieving its best WER of 27.66%
compared to MRDCT-OFR 27.75% WER, using the optimal configuration found
in the previous section of J = 2 and P = 40. This is compared to the left ROI
104 Chapter 5. Visual Feature Extraction
which gained a WER of 32.46% and the right with a WER of 33.32%. Even
though the last result suggests that there may be a small bias in the localisation
of the ROI’s towards the left hand side, the difference in performance is so slight
this can be viewed as insignificant, which indicates that the localisation of the
ROI’s was done satisfactorily.
Overall, it can be seen that making use of the symmetrical nature of the
frontal ROI by removing the odd frequency components in the DCT coefficients
can improve lipreading performance at an early stage. However, this is not a
really necessary step when it is used in conjunction with LDA as LDA appears to
do a more effective job of gaining a compact feature representation of the ROI due
to its ability to maintain the given speech classes. Another apparent advantage of
this method is that by enforcing lateral symmetry in the ROI can normalise the
ROI for variations in lighting and errors in localisation. Even though this may be
the case, no improvement in lipreading performance was gained at the final stage.
Another pertinent point is that this particular feature extraction step can only be
applied to the frontal pose, as that is the only pose where this symmetry exists.
As this thesis aims to develop a lipreading system which can be used irrespective
of pose, the ROI symmetry technique described in this section is not investigate
further.
5.6 Patch-Based Analysis of Visual Speech
Motivated from the work in the previous section, it was decided it would be de-
sirable to determine which areas of the mouth are more salient for the task of
lipreading. The hypothesis behind this investigation was that if there is a ten-
dency for a particular area of the ROI to be more useful in terms of lipreading
than the other areas, maybe that area can be weighted more to improve perfor-
mance over the current holistic representations. This gives rise to patch-based
analysis.
Patch-based analysis of the ROI is heavily motivated by the work conducted
in face recognition. Techniques that decompose the face into an ensemble of
salient patches have reported superior face recognition performance with respect
5.6. Patch-Based Analysis of Visual Speech 105
Bottom
Top
Left Right
7 8 9
4 5 6
1 2 3
(a) (b)
Figure 5.19: Examples of the ROI broken up into: (a) top, bottom, left and rightside patches; and (b) 9 patches, starting from the top, refer to patches 1, 2 and3; the middle band refer to patches 4, 5 and 6; and the bottom band of patchesrefer to patches 7, 8 and 9.
to approaches that treat the face as a whole [15, 87, 111, 121]. The idea behind
breaking the face into a series of patches is that it is easier to take into account
the local changes in appearance due to the faces complicated three-dimensional
shape, in comparison to treating it holistically [103], which is also the motivation
for this work. Furthermore, as no work like this has been conducted before in
the area of lipreading, this would be a provide an understanding as to which
areas of the ROI are more pertinent to visual speech. Apart from the recent
work by Saenko [157], the proposed multi-stream patch-based approach takes a
different path than the current methods which model the ROI in a holistic single
stream fashion. As a precursor to the work conducted in this thesis, work in
[106] showed that a patch-based method of representing visual speech showed
potential. Although it must be noted that the work was conducted on a very
small database and constrained to the task of isolated digit recognition.
5.6.1 Experimental Results
The lipreading performance for the patch sides depicted in in Figure 5.19(a) are
given in Table 5.1. As can be seen from these results, all the patches achieve
reasonable lipreading performance, albeit being well behind the performance of
the holistic representation. It also appears that the top of the ROI is the area
which contains the least amount of information which is useful for visual speech
106 Chapter 5. Visual Feature Extraction
ROI Region Word Error Rate (WER %)
Top 38.97
Bottom 34.17
Left 32.46
Right 33.32
Holistic 27.66
Table 5.1: Lipreading performance of the various regions of the ROI
classification. As the lower areas of the ROI are more prone to move during an
utterance due to the jaw than the upper part of the ROI which is somewhat
fixed, this result supports the hypothesis that the movement within the ROI is
extremely important to lipreading. Although the bottom patch is more prone to
movement due to talking than the other patches, it does not contain much lip
information which is possibly why the bottom patch performance lags behind the
left and right patches.
Seeing that the performance of these patches is somewhat worse than the
holistic representation, the results suggest that the full structure of the ROI is re-
quired to yield the best result. To do simulate this, the various patches were fused
together to determine if the same lipreading performance could be obtained. For
these experiments, the following patches were combined: top and bottom, left and
right, and the top, bottom, left and right (all patches). These combinations were
fused via two methods. The first was the HiLDA feature fusion approach, whilst
the second was the synchronous multi-stream HMM (SMSHMM). The HiLDA
approach was used to see if similar performance could be gained to the holistic
representation and the SMSHMM was used to make use of the saliency of the
different patches and weight the more salient patches accordingly. Both these
approaches have dimensionality constraints to allow the HMM to converge. As
such, for these experiments the total dimensionality of the combined features was
constrained to a total of 40 features. This was also done so that a fair compar-
ison between the holistic and the combined patch-based systems of the various
patches could take place. For HiLDA fusion, all patches had their 40 features
5.6. Patch-Based Analysis of Visual Speech 107
ROI Region HiLDA SMSHMM
(WER %) (WER %)
Top & Bottom 28.94 28.12
Left & Right 28.40 27.28
All Patches 28.31 28.41
Holistic 27.66
Table 5.2: Lipreading performance of fusing the various side patches of the ROItogether.
concatenated into a single feature vector of size 160. HiLDA was then performed
on this feature vector using the same class set, C as used previously, yielding a
final 40 dimensional feature vector. For the multi-stream approach, the top 20
features from each patch were used for the two stream experiments. The top 10
features from the various patches were used in the four stream experiment. The
optimal stream weights for the SMSHMM were found heuristically and applied
to the respective streams. The results from these experiments are given in Table
5.2.
From the results it can be seen that the holistic representation of the ROI
still outperforms the feature fusion results. This suggests that letting the patches
evolve independently over time does not improve lipreading performance. Weight-
ing the various patches with the SMSHMM does seem to have benefits when fusing
the left and right patches together as there appears to be a slight improvement in
performance (27.28% WER compared to 27.66%). No improvement when using
the SMSHMM for all the patches was experienced. However, it must be noted
that this was constrained heavily by dimensionality restriction.
Even though the results so far give an indication which general areas of the
ROI are more salient than the others, they do not show what particular areas or
features of the ROI have the most impact on lipreading performance. That is, it
would be interesting to see how much visual speech information is contained in
the periphery of the ROI such as the areas around the lips and the nose. It would
also be interesting to see how much impact the lip corners have on lipreading,
as well as the center of the lips were teeth and tongue information is prevalent.
108 Chapter 5. Visual Feature Extraction
WER (%)
Patches 1-3 47.53 54.80 49.19
Patches 4-6 33.98 33.94 33.46
Patches 7-9 39.92 38.55 47.86
Holistic 27.66
Table 5.3: Lipreading performance of the smaller 16 × 16 pixel patches of theROI (overlapping by 50%)
As a result, the ROI was broken up into smaller 16 × 16 patches, which were
overlapped by 50%. Examples of these patches are depicted in 5.19(b) with the
results given in Table 5.3.
From viewing the results from these extended experiments, it suggests that
most visual speech information stems from the middle band of the ROI (patches
4-6). This result is of no surprise though, as these areas of the ROI contain the
most visible articulators such as the lips, teeth and tongue. It can be seen that the
area of the ROI which contains the least amount of visual speech information is
patch 2, which contains the nose and surrounding areas. This result also supports
our initial hypothesis that the top of the ROI is the least effective for lipreading
due to its fixed nature.
These results highlight a potential problem with the holistic approach. Seeing
that most of the lipreading performance stems from the center of the ROI (patches
4-6), it is a possibility that when executing the holistic approach, some of this
speech discrimination power that exists in the center of the ROI is diminished in
an effort to incorporate all of the ROI into the representation. To see if this was
the case or not, it was decided to fuse the holistic representation with the each
of the 16 × 16 pixel patches. By doing this, it was hoped that any important
information which was lost or diminished by the holistic representation will be
reenforced be the introduction of local patch. For these experiments, only the
holistic and individual patches were used and were combined using a two-stream
SMSHMM. For these experiments, a total dimensionality of 60 was used, 40 for
the holistic stream and 20 for each patch. As 40 features was the optimal number
5.6. Patch-Based Analysis of Visual Speech 109
WER (%)
Patches 1-3 27.70 27.98 27.67
Patches 4-6 26.84 26.76 26.79
Patches 7-9 27.02 27.15 28.21
Holistic 27.66
Table 5.4: Lipreading performance of the each individual patch fused togetherwith the holistic representation of the ROI using the SMSHMM
for the holistic approach, it was deemed appropriate to use 20 for each patch as
the HMM convergence was still obtainable with a dimensionality of 60. Again,
the optimal weights for each stream were found heuristically. The results for
these experiments are given in Table 5.4.
From these results, it suggests that by fusing each patch with the holistic rep-
resentation, a slight improvement over the holistic-only result for most patches
can be achieved (except for patch 2). This appears to support the hypothesis
that some important visual speech classifying information is lost when the visual
features are calculated for the entire patch. It appears that this is mostly affected
in the more salient regions of the ROI. However, this shows that by fusing the
features of the more salient regions with the holistic features, some of this im-
portant local information can be retained which improves the overall lipreading
performance. This is highlighted by the performance of patch 5 with the holistic
features, which achieves a WER of 26.76% compared to 27.66% of the holistic
representation.
Even though some improvement was sought fusing the salient patches of the
ROI with the holistic representation, it must be noted that it took a lot of extra
processing power to achieve this slight improvement. When implemented in a full
AVASR system, this type of approach would not be worth the extra complexity
due to the small improvement in performance. In reality. Maybe as the frontal
pose ROI is symmetric, there is no real benefit is applying a patch-based method.
However, this may be useful for non-symmetric ROI’s such as those found in
non-frontal poses and maybe a viable research avenue.
110 Chapter 5. Visual Feature Extraction
5.7 Summary
In this chapter, visual feature extraction techniques for lipreading were investi-
gated. From the initial review on the various techniques used for visual feature
extraction, it was deemed that the appearance based features were the represen-
tation of choice, as they are heavily motivated by human perception studies and
amendable to real-world implementation. The appearance based features also do
not require further localisation of lip features making them less succeptible to the
front-end effect than the contour and combination based techniques. The cur-
rent state-of-the-art appearance based visual feature extraction scheme based on
the cascading of features [142] was also presented as the baseline system for this
thesis. Each particular module of this algorithm was analysed and the lipreading
performance was also presented. It was shown that the DCT features contained
a lot of speaker information within them which is irrelevant for lipreading, and
that via FMN this irrelevant information could be removed thus improving per-
formance. A variant of the FMN step was also presented which normalises in the
image domain rather than the feature domain which will be useful when different
pose and illumination becomes a concern. It was shown that this image based
FMN slightly outperformed the feature-based FMN.
As the ROI for the frontal-pose is symmetrical, an algorithm presented by
Potamianos and Scanlon [144] was implemented making use of this character-
istic. It was shown that it can improve lipreading at an early level within the
cascading framework, however, this is nullified by the LDA step at a latter level.
Motivated by this work, analysis of the various regions of the ROI were then
conducted using patches which is the first known analysis of its type. From this
novel analysis, it was found that the middle band of the ROI contained most in-
formation pertinent to lipreading whilst the top band provided significantly less
visual speech information. As a means of making use of this prior knowledge,
a novel patch-based representation of the ROI was introduced. Although slight
improvement was sought weighting the pertinent regions of the ROI, only a slight
improvement was gained. It was postulated that this would be a more effective
method for non-symmetrical ROI such as those found in profile views, which is
the topic of the focus chapter.
Chapter 6
Frontal vs Profile Lipreading
6.1 Introduction
In the past two chapters, a lipreading system which recognises visual speech from
a speaker’s fully frontal pose was presented. This mirrors the work that has been
conducted in the field of lipreading over the past two decades. This is mainly
due to the lack of any large corpora which can accommodate poses other than
frontal. But as more work is being concentrated within the confines of a “meeting
room” [52] or “smart room” [131] environment, data is becoming available that
allows visual speech recognition from multiple views to become a viable research
avenue.
In the literature, only three studies were found to be related to lipreading
from side views. In the first work, Yoshinaga et al. [187] extracted lip infor-
mation from the horizontal and vertical variances from the optical flow of the
mouth image. In this paper, no mouth localisation or tracking was performed.
Yoshinaga et al. [188] refined their system in the second work, by incorporat-
ing a mouth tracker which utilises Sobel edge detection and binary images, and
uses the lip angle and its derivative for the visual feature on a limited data set.
The improvement sought from these primitive features was minimal as expected,
essentially due to the fact that only two visual features were used, compared to
most other frontal-view systems that utilize significantly more features [142]. The
third study was a comprehensive psychological study conducted by Jordan and
111
112 Chapter 6. Frontal vs Profile Lipreading
d 0.5d
(a) (b)
Figure 6.1: Synchronous (a) frontal and (b) profile views of a subject recordedin the IBM smart room (see Chapter 3). In the latter, visible facial features are“compacted” within approximately half the area compared to the frontal facecase, thus increasing tracking difficulty.
Thomas [84]. Their findings were rather intuitive, as the authors determined that
human identification of visual speech became more difficult as the angle (from
frontal to profile view) increased.
Other than these works, no other attempts to solve the problem of lipreading
from non-frontal views have been identified in the literature. To remedy this
situation, this chapter presents a novel contribution to the field of lipreading
by presenting a lipreading system which can recognise visual speech from profile
views. This is the first real attempt in determining how much visual speech
information can be automatically extracted from profile views compared to the
frontal view. This chapter also presents the first multi-view lipreading system.
This system is able to recognise visual speech from two or more cameras which
capture the different views of a speaker synchronously.
The task of recognising visual speech from a profile view is in principle very
similar to that of frontal view, requiring to first locate and track the mouth ROI
and subsequently extract the visual features. However, this problem is far more
complicated than the frontal case because the facial features which are required to
be localised and tracked lie in a much more limited spatial plane, as can be viewed
in Figure 6.1. Clearly, much less data is available compared to that of a fully
frontal face, as many of the facial features that are of interest (i.e. eyes, mouth,
chin area etc.) are fully or partially occluded. In addition, the search region for all
6.2. Visual Front-End for Profile View 113
visible features is approximately halved, as the remaining features are compactly
confined within the profile facial region. These facts remove redundancy in the
facial feature search problem, and therefore make robust ROI localisation and
tracking a much more difficult endeavour.
Nevertheless, ROI localisation and tracking can still be achieved by employ-
ing the visual front-end based on the Viola-Jones [180] algorithm presented in
Chapter 4. All that varies is the selection of facial features to locate and track.
Once these selections have been made, the associated classifiers can be trained
and the visual front-end can be developed. Once the ROIs have been extracted,
the rest of the lipreading system is the same as the frontal case. The develop-
ment of visual front-end which can extract profile mouth ROIs is described in
the next section. Following this, the lipreading performance of the profile view
is presented. These results are compared against the frontal view. Similar to the
previous chapter, patch-based analysis is then performed on the profile data to
determine which areas of the ROI are more pertinent to the task of lipreading.
The chapter then concludes by introducing the first known multi-view lipreading
system.
6.2 Visual Front-End for Profile View
The visual front-end for the profile view was developed in a similar manner to
the its synchronous frontal counterpart. Due to the compactness of the facial
features within the dataset, only 7 of the 17 manually labeled facial features were
used. These were the left eye, nose, top of the mouth, center of mouth, bottom of
the mouth, left mouth corner and chin, as depicted in Fig. 6.2. Like the frontal
data, a set of 847 images for training and 37 images for validation were available
to develop the profile visual front-end 1. This provided 847 positive examples
for all 7 facial features. The resulting face training set was included rotations in
the image plane by ±5 and ±10 degrees, providing 4235 positive examples. A
similar amount of negative examples of the background were also employed in
the training scheme. Approximately 5000 negative examples were used for each
1The 847 training images and 37 validation images were the synchronous counterparts tothe frontal images used to train and test the frontal visual front-end in Chapter 4.6
114 Chapter 6. Frontal vs Profile Lipreading
(b)
(c)
(g)
(f)
(a)
(d)
(e)
Figure 6.2: Example of the points labeled on the face: (a) left eye, (b) nose,(c) top mouth, (d) mouth center, (e) bottom mouth, (f) left mouth corner, and(g) chin. The center of depicted bounding box around the eye defines the actualfeature location.
facial feature. These negative examples consisted of images of the other facial
features that surrounded its location as these would be most likely to cause false
alarms, as per the frontal visual front-end.
One difficulty experienced was selecting appropriate facial feature points to
use for the training image normalisation (scaling and rotation). In the frontal face
scenario, eyes are predominately used for this task, but in the profile-view case,
there isn’t the luxury of choosing two geometrically aligned features. Instead
the nose and the chin were used, with a normalised constant (K) distance of 64
pixels between them. This choice was dictated by the head pose variation within
the dataset that had less of an effect on the chosen metric, compared to other
possibilities (such as eye to nose distance, etc.). The top mouth, center mouth,
bottom mouth and left mouth corner were trained on templates of size 10 × 10
pixels, based on normalised training faces. Both nose and chin classifiers were
trained on templates of size 15×15 pixels, and the eye templates were larger,
20×20 pixels. Examples of these facial feature templates are given in Figure 6.3.
The normalised positive face examples were templates of size 16× 16. Examples
of these face templates are shown in Figure 6.4.
6.2. Visual Front-End for Profile View 115
Figure 6.3: Examples of the facial feature templates of the profile view used totrain up the respective facial feature classifiers.
Facial Feature Accuracy (%)
Left Eye 86.49
Nose 81.08
Top Mouth 78.37
Center Mouth 81.08
Bottom Mouth 72.97
Left Mouth Corner 86.49
Chin 62.16
Table 6.1: Facial feature localisation accuracy results on the validation set ofprofile images.
Due to the lack of manually labeled faces available, all classifiers were tested
on a small validation set of 37 images which were the synchronous profile view
images of the validation set in the frontal domain. The localisation results of
the various facial features from this validation set gave an indication of what
particular features would give the best chance of reliably tracking the localised
features. These results are shown in Table 6.1. A similar performance metric to
the frontal scenario was also employed where a feature was not considered located
if the location error is larger than 10% of the annotated distance between the nose
and the chin. From these localisation results, it can be seen that along with the
left eye, the left mouth corner yielded the best performance. This is somewhat
116 Chapter 6. Frontal vs Profile Lipreading
Figure 6.4: Examples of the profile face templates used to train up the profileface classifier.
surprising as a close-talking microphone was located near the left mouth corner
for all the speakers in the IBM smart-room databasae. An example of this is
shown back in Figure 6.3. This shows the usefulness of using a corner for facial
feature localisation, as it provides a unique shape within the face which is hard
to get confused with by other objects.
As the left eye and left mouth corner yielded the best results, it was decided
to use these two points for scale normalisation. The only difference between using
the left eye and left mouth corner, compared to the nose and chin is changing
the scaling factor K from 64 to 45. The face localisation accuracy on this test
set was 100%. As no manual labels for the face bounding box was available, the
accuracy was determined upon inspection.
6.2. Visual Front-End for Profile View 117
Videoin
FaceLocalisation
Define Normalised
Search Regions
Localise Eye & Nose
No
Yes
No
Lenghthen/Shorten Face Box
No
Normalise ROI (48x48) based on Leftmouth Corner
Track MouthSmoothing
TrackedMouth (32x32)
Retrack everyframe
Yes
Yes
Face Classifier16 x 16 templates
Nose (15x15) and Eye (20x20) Classifiers
Mouth Classifier:
32x32
CalculateRescaling
Metric (metric1)
Metric Outside Limits?
Localise Mouth Region
(below nose)
Localise Leftmouth
Corner
Leftmouth Classifier: (10x10)
DownsampleROI (32x32)
No
No
Calculate ROIRescaling
Metric (metric2)
Figure 6.5: Block diagram of the face and mouth localisation and tracking systemfor profile views.
The final profile ROI localisation and tracking visual front-end is outlined in
Figure 6.5. Given the video of a spoken utterance, face localisation is first applied
to estimate the location of the speaker’s face at different scales as the face size
is unknown. Once the face was located, the left eye and nose were searched over
specific regions of the face (based on training data statistics). During developing
this system, it was found that the bottom of the face bounding box was often far
below the bottom of the subject’s actual face, or well above it. As the face box
defines the search region for the various facial features, this caused the system to
miss locating the lower regions of the face. To overcome this, the ratio (metric1 )
of the vertical eye to nose distance, over the vertical nose to bottom of the face
bounding box distance was used. If metric1 was below a fuzzy threshold (again
determined by training statistics), the box was lengthened, or if it was above
the threshold then it was shortened. It was found this greatly improved the
localisation of the generalised mouth area (trained on normalised 32× 32 mouth
images), which was located next. This step is illustrated in Figure 6.6(b).
Once the generalised mouth region was found, the left mouth corner was
located. The next step was to define a scaling metric, so that all ROI images would
be normalised to the same size. As mentioned previously, the ratio (metric2 ) of
the vertical left eye to left mouth corner distance over some constant K (45) was
118 Chapter 6. Frontal vs Profile Lipreading
y1
metric1 = y1/y2
y2
(a) (b)
y
metric2 = y/K
32x32
30 18
24
24
(c) (d)
Figure 6.6: (a) An example of face localisation. (b) Based on the face localisationresult, a search area to located the left eye and nose is obtained. The face boxis lengthened or shortened according to metric1. (c) The left mouth corner islocated within the generalised mouth region. The ratio (metric2 ) is then usedfor normalising the ROI. (d) An example of the scaled normalised located ROIof size (48× 48) ·metric2 pixels.
used to achieve this (see Figure 6.6). A (48× 48) ·metric2 normalised ROI based
on the left mouth corner was extracted (see Figure 6.6). The ROI was then
downsampled to 32× 32, for use in the lipreading system.
Following ROI localisation, the ROI is tracked over consecutive frames. If the
located ROI is too far away from previous frame, then it is regarded as a failure
and the previous ROI location is used. A mean filter is then used to smooth the
tracking. Due to the speed of the boosted cascade of classifiers, this localisation
and tracking scheme is used for every frame.
Overall, the accuracy of the profile visual front-end was very good, with only
a very few number of sequences in the dataset being poorly tracked. These poorly
tracked sequences were not used for the lipreading experiments 2. A major factor
affecting performance was due to random head movement and some head pose
variability, where subjects exhibit a somewhat more frontal pose than the profile
2Across the total 1700 available synchronous pairs of sequences in the IBM smart-roomdatabase, 1661 pair of sequences were used for the lipreading experiments, with 39 synchronouspairs being omitted due to poor tracking. For the sake of comparison, both the frontal andprofile sequences had to be accurately tracked for them to be used in this thesis. Evaluatingthe accurately tracked sequences was done via manual inspection.
6.3. Profile vs Frontal Lipreading 119
(a) (b) (c)
(d) (e) (f)
Figure 6.7: Examples of accurate (a-d) and inaccurate (e,f) results of the local-isation and tracking system. In (f), it can be seen that the subject exhibits asomewhat more frontal pose compared to the profile view of the other subjects.
view of the majority of the subjects – see also Figure 6.7, where examples of
accurately and poorly tracked ROI’s are depicted. The latter is also the reason
why no rotation normalisation was employed. Many different configurations were
trailed, however, all seemed to cause more problems than they solved. Rotating
the ROI according to the left eye to left mouth corner angle was also tried,
however, the many different head poses made this very problematic. Another
attempt was to rotate the ROI using the angle between the mouth center and
the left mouth center. This also failed, as the distance between these two points
was too small ( 20 pixels), and any slight mistake in the localisation phase gave
large errors.
6.3 Profile vs Frontal Lipreading
Following extracting the profile mouth ROI image from each frame, the same
visual feature extraction process based on a cascade of appearance features which
was used for the frontal view was used on the synchronous profile view data. The
profile features were modeled using a HMM with the same topology and train/test
sequences as the frontal data (see Chapter 3.5 and 3.6.2 for full details). Figure
6.8, shows the lipreading performance of the profile features for the three stages
of the static feature capture (i.e. DCT, MRDCT and intra-frame LDA steps)
compared to their frontal synchronous counterparts. It appears from these results
120 Chapter 6. Frontal vs Profile Lipreading
10 20 30 40 50 60 70 80 90 10040
45
50
55
60
65
70
75
80
85
Number of features per feature vector (M for DCT & MRDCT, N for LDA)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of frontal vs profile performance for different stages of static features
DCT−FrontDCT−RightMRDCT−FrontMRDCT−RightLDA−FrontLDA−Right
Figure 6.8: Results comparing the front and profile lipreading performance atvarious stages of the static feature capture.
that the same trend that occurred in the frontal pose is happening in the profile
pose, that is the FMN step of removing the mean ROI greatly improves speech
classification over the DCT features (WER of 64.02% compared to 87.12% for M
= 40) and the intra-frame LDA step again improves performance (WER of 56.61%
for P = 10 compared to 64.02 for M = 40). These profile results are however,
significantly worse than the frontal pose for all number of features. This result
was expected though, in line with the human lipreading experiments reported in
[84].
When the temporal information is included via the inter-frame LDA step, the
profile speech features mimic the same trend in the frontal domain, as can be seen
in Figure 6.9 3. For the profile features (Figure 6.9(b)), it can be seen that the
lipreading performance improves when the temporal window J is increased from
1 to 2. When the value of J is increased past 2, the performance appears to level
off with no real improvement gained from using a larger temporal window. From
this plot it can be seen that the best lipreading performance for the profile view
is gained with P = 40 features, with J = 2, achieving a WER of 38.88%. This
is compared to the frontal view, where the the best WER of 27.66%, also using
3It is worth noting that the best lipreading performance was obtained using N = 30 inputfeatures, which was also the case for the frontal view.
6.3. Profile vs Frontal Lipreading 121
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
44
46
48
50
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=30
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
44
46
48
50
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of profile features with varying temporal information (J), with input vector of size N=30
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4
(a) (b)
Figure 6.9: Comparison of the lipreading performance between the frontal (a)and profile (b) dynamic and final features using various values for J and P usingM = 30 input features.
P = 40 features and J = 2 (Figure 6.9(a)). From these results, the difference
in lipreading performance between synchronous frontal and profile views can be
quantified in terms of WER to be 11.22%.
The discrepancy between the frontal and profile lipreading performances can
be attributed to the amount of visible articulators that the lipreading systems
has exposure to with respect to the respective views. For example, in the frontal
scenario, the lipreading system has full exposure to all the possible visible artic-
ulators such as teeth, lips, tongue and jaw. Conversely, for the profile view, the
lipreading system only has the lips and jaw available, which is only a portion of
the visual information. Another restriction with extracting visual speech from
the profile view lies in the background. In the frontal scenario, a slight locali-
sation/tracking error does not cause significant appearance change, due to the
somewhat uniform background around the lips (i.e., the skin of the speaker). In
contrast, in the profile case, poor localisation/tracking may capture an excessive
amount of the background behind the speaker’s mouth causing the appearance of
the ROI to look unlike a speaker’s mouth. This non-uniformity in the profile view
may also be cause of degradation of the lipreading performance, which suggests
that the profile view may be more susceptible to the front-end effect that the
frontal view.
Considering the difficulties that lie within extracting visual speech from the
profile view, a lipreading performance of 38.88% is still an extremely useful result.
122 Chapter 6. Frontal vs Profile Lipreading
As such, there are many benefits associated with using the profile view. Firstly,
maybe the only view available for use is the profile view. Such a situation could
arise in a car scenario where the only place a camera could be placed is to the
side of the driver. Even though the profile view does not contain as much visual
speech information as the frontal view, in this situation a lipreading performance
of 38.88% is still much better than pure chance, especially if this information
was to be fused with the audio channel. Although the combined audio-visual
scenario is outside the scope of this thesis, it is worth mentioning that when the
profile visual speech data was fused with the audio channel, substantial gains
were achieved over the audio-only results, especially in the presence of noise (see
[105] for full details).
Another benefit to using the profile visual speech information might lie in
combining different viewpoints together to form a combined representation of vi-
sual speech. As there is essentially different information contained various views
available, there may contain some complementary information which exists in one
view that does not exist in the other view. By fusing the various views together,
this may give an improvement over the dominant viewpoint. This is examined at
the end of the chapter, with the introduction of a multi-view lipreading system.
The multi-view system is developed by combining the profile features together
with the frontal features in an attempt to achieve better lipreading performance
than the frontal-only system. The next section however, performs patch-based
analysis of the profile ROIs to determine which areas of the ROI are more perti-
nent for profile lipreading.
6.4 Patch-Based Analysis of Profile Visual Speech
An interesting point to note is that the profile view is not laterally symmetrical as
it was in the frontal view. This constitutes a much different problem then, as there
might be some areas of the profile ROI which contain much more visual speech
information than the others areas. Like the patch-based experiments conducted
for the frontal view, weighting the more pertinent patches of the profile view
higher than the other areas may be of more benefit in the profile scenario. In the
6.4. Patch-Based Analysis of Profile Visual Speech 123
1
7
4
8 9
2 3
5 6
Bottom
Top
Left Right
(a) (b)
Figure 6.10: Examples of the ROI broken up into: (a) top, bottom, left and rightside patches; and (b) 9 patches, starting from the top, refer to patches 1, 2 and3; the middle band refer to patches 4, 5 and 6; and the bottom band of patchesrefer to patches 7, 8 and 9.
ROI Region Word Error Rate (WER %)
Top 56.94
Bottom 46.86
Left 51.19
Right 45.29
Holistic 38.88
Table 6.2: Lipreading performance of the various regions of the profile ROI
this subsection, patch-based analysis of the profile view is undertaken to explore
this possibility. Like the frontal pose, the patches are numbered in the same
fashion, although they correspond to different features as examples in Figure
6.10 show.
The lipreading performance for the profile patch sides depicted in Figure
6.10(a) are given in Table 6.2. From this, all the patches achieve reasonable
lipreading performance, albeit being well behind the performance of the holis-
tic representation. Like the frontal pose, the top of the ROI seems to give the
least amount useful visual speech information, again probably due to the lack
movement within this region compared to the other regions. Also, the left patch
appears to be more useful than the right. A possible reason for this could be the
124 Chapter 6. Frontal vs Profile Lipreading
ROI Region HiLDA SMSHMM
(WER %) (WER %)
Top & Bottom 41.50 39.97
Left & Right 40.31 39.56
All Patches 41.22 40.56
Holistic 38.88
Table 6.3: Lipreading performance of fusing the various side patches of the profileROI together.
fact that the right patch contains a lot of the background and not lip informa-
tion as depicted in Figure 6.10(a). Fusing these patches together using both the
HiLDA and SMSHMM integration strategies did not yield improved performance
over the holistic representation as given in Table 6.2. It is worth noting that all
these patch-based experiments were conducted identically to those for the frontal
pose (see Chapter 5.6.1 for experiment description).
The results for the 50% overlapping, smaller 16 × 16 patches depicted in
Figure 6.10(b) are given in Table 6.4. From viewing these results, it shows that
the region containing the lips and jaw are the most useful for lipreading (patches
5,6 and 8). This again backs up the hypothesis that the movement of the visible
articulators are of most benefit to recognising visual speech. As for the frontal
case, the nose region appears to be of little value for lipreading (patch 2), as well
as the regions which contain the background (patches 1 and 7) or the skin around
the lips (patches 3 and 9), although the relative importance of these patches can
not be quantified yet. This is because some patches may contain important lip
information which is only evident occasionally, such as the background patches
1 and 7 which may contain important lip protrusion information, which may be
complimentary to the frontal pose.
To determine if any information in the holistic representation is lost by in-
cluding the less pertinent areas of the profile ROI, fusion of each of the patches
were performed with the holistic representation using the SMSHMM. The results
for these experiments are shown in Table 6.5. Like the frontal scenario, only a
slight improvement over the holistic-only performance is gained from fusing the
6.4. Patch-Based Analysis of Profile Visual Speech 125
WER (%)
Patches 1-3 69.94 64.97 61.93
Patches 4-6 55.32 48.48 49.38
Patches 7-9 58.60 49.67 66.49
Holistic 38.88
Table 6.4: Lipreading performance of the smaller 16 × 16 pixel patches of theprofile ROI (overlapping by 50%)
WER (%)
Patches 1-3 39.83 39.34 39.20
Patches 4-6 39.04 38.51 38.89
Patches 7-9 39.27 38.91 39.53
Holistic 38.88
Table 6.5: Lipreading performance of the each individual patch fused togetherwith the holistic representation of the profile ROI using the SMSHMM
middle patch (patch 5) with a WER of 38.71% compared to 38.88%. For all other
patches, similar or worse performance was gained for this multi-stream approach
which suggests that little or no additional information is included by using this
approach.
Even though some improvement was sought fusing the salient patches of the
profile ROI with the holistic representation, the additional performance expected
by weighting the various parts of the ROI were not as much as hoped for, with only
a marginal improvement in performance obtained. Hence, this type of approach
would not be viable in a full AVASR system as the extra complexity required to
implement this multi-stream patch approach would not be worth the negligible
gain.
126 Chapter 6. Frontal vs Profile Lipreading
6.5 Multi-view Lipreading
From Section 6.3, it was shown that the frontal view contained far more visual
speech information then the profile view (WER of 27.66% compared to WER
of 28.88%). Even though there is a significant difference in the lipreading per-
formance between the two views (11.22%), there may exist some elements in the
profile view which are not found in the frontal view. As such, it would seem likely
that fusing the holistic representations of both the frontal and profile features to-
gether would improve the overall visual speech intelligibility. As this multi-modal
approach is the key motivation behind AVASR, it is intuitive to follow a similar
path by fusing these two synchronous views together. This gives rise to multi-
view lipreading. Formally, multi-view lipreading can be defined as the scenario
when there are two or more synchronous views of a speaker’s ROI which can be
fused together to form a combined representation of the visual speech. A block
diagram of the multi-view lipreading system is depicted in Figure 6.11
The multi-view system presented in this thesis, uses two methods of combining
the views. The first was the HiLDA feature fusion approach, whilst the second
was the synchronous multi-stream HMM (SMSHMM). As mentioned previously,
the HiLDA approach is a single stream approach and as such the two views
can not be weighted. Contrastingly, the SMSHMM is a multi-stream approach,
which allows for the different views to be weighted. Both these approaches have
dimensionality constraints to allow the HMM to converge. As such, for these
experiments the total dimensionality of the combined features was constrained to
a total of 40 features. This was also done so that a fair comparison between the
multi-view and single view systems of the various views could take place.
For the multi-view lipreading system experiments, the parameters which yielded
the best performing visual features for each view were used. For both views they
were the M = 100, N = 30, J = 2 and P = 40 (final number of features used).
For HiLDA fusion, both 40 feature streams were concatenated into a single fea-
ture of dimensionality 80. HiLDA was then performed on this feature vector
using the same class set, C as used previously, yielding a final 40 dimensional
feature vector. For the multi-stream approach, the top 20 features from each
6.5. Multi-view Lipreading 127
Video 1Frontal
Video 2Profile
Visual FeatureExtraction
Visual FeatureExtraction
HMM(Profile)
HMM(Frontal)
ConcatenateFeatures
HMM(Combined)
RecogniseMulti-view
Speech
RecogniseFrontal-only
Speech
RecogniseProfile-only
Speech
Figure 6.11: Block diagram depicting the various lipreading systems that canfunction when 2 cameras are synchronously capturing a speaker from differentviews. The lipreading system can use only one view (either frontal or profile inthis case), or combine both views to form a multi-view lipreading system (whichis depicted by the dashed lines and bold typeface). The multi-view features caneither be fused at an early stage using feature fusion or in the intermediate levelvia a synchronous multi-stream HMM (SMSHMM).
view were used as the input into the SMSHMM with the weights α for the frontal
and 1−α for the profile view used as the respective stream weights. The optimal
stream weights were found heuristically and these were α = 0.8. The multi-view
lipreading results for both approaches are given in Table 6.6.
From these results it can be seen that combining the two views together using
both the fusion methods mentioned gave an improvement in lipreading perfor-
mance over the frontal view system. In particularly, the SMSHMM approach
achieved the best performance with an WER of 25.36%, which is an improve-
ment over the frontal results by 2.3%. This result is quite significant due to the
fact that it does demonstrate that there exists some information in the profile
view which is not captured by the frontal-view. Although it is not known for
sure, the added information could be that of lip protrusion. This seems to be the
most likely scenario as the lip protrusion information is only contained within the
profile view.
128 Chapter 6. Frontal vs Profile Lipreading
Viewpoint WER (%)
Frontal 27.66
Profile 38.88
Multi-view (HiLDA) 27.50
Multi-view (SMSHMM) 25.36
Table 6.6: Multi-view lipreading performance compared against the single viewperformance.
The multi-view system presented in this section assumed that both camera
views had the speaker’s mouth in frame for the entire set of utterances. This
may not be the case in a realistic scenario though, as the speaker may randomly
move in and out of shot for a portion of time, or one particular view may be
partially or fully occluded by some object, such as the speaker’s hand. In a single
camera system, this would mean that the visual speech information would be lost.
However, using a multi-view type approach, there would be more of a chance that
the speaker’s mouth would be in at least one of the camera views. This highlights
another benefit of employing a multi-view lipreading system. Future research is
needed to look into this aspect of the multi-view lipreading system.
6.6 Summary
In this chapter, a lipreading system which is capable of extracting and recog-
nising visual speech information from profile views was presented. These results
were compared to their synchronous counterparts in the frontal view. This con-
stituted the first published work which quantifies the performance degradation of
lipreading in the profile view compared to the frontal view. In the experiments
presented, it was demonstrated that profile views obtain significant visual speech
information, however, it is of course less pronounced than when using the frontal
view. This profile information was found not to be totally redundant to the
frontal video though, as the “multi-view” experiments demonstrated. In addition
to this work, patch based analysis was also conducted on the profile view. From
the results, it was shown like the frontal view, that the most pertinent areas of
the ROI in terms of visual speech was the center.
Chapter 7
Pose-Invariant Lipreading
7.1 Introduction
In the previous chapter, lipreading was applied across multiple views. This gave
rise to the multi-view lipreading system, which refers to a scenario when there
are two or more views of a speaker’s lips which can be combined together to form
a representation of the visual speech. From those experiments, it was shown
that improvement in lipreading performance can be gained when the views were
of different poses (frontal and right profile). However, the multi-view work was
constrained with each viewpoint having its own dedicated lipreading system (i.e.
two systems, one dedicated for the frontal view and another for the profile view).
A more “real-world” solution to this problem would be to have a single lipreading
system recognise visual speech regardless of head pose. 1. This particular problem
is termed pose-invariant lipreading. Formally, pose-invariant lipreading can be
defined as the ability of the lipreading system to recognise visual speech across
multiple poses given a single camera. An example of this is given in Figure 7.1.
Pose-invariant lipreading can either occur when the speaker is stationary (i.e.
the speaker is fixed to one particular pose for the duration of the utterance)
or continuous (i.e. the speaker is not restricted to any one pose and can move
1In this thesis, the term pose is used instead of view so as to distinguish lipreading systemsusing a single camera and multiple cameras. The term pose is used to denote the head positionof a speaker when there is only one camera used in the lipreading system. Conversely, the termview is used to denote the head position of a speaker in each of the cameras when there is morethan one camera used in the lipreading system.
129
130 Chapter 7. Pose-Invariant Lipreading
Frontor
Profile
Video
Figure 7.1: Given one camera, the lipreading system has to be able to lipreadfrom any pose. In this example, those poses are either frontal or profile poses.
their face during the spoken utterance). The former is the focus of the first part
of this chapter, whilst the latter scenario is what is referred to as continuous
pose-invariant lipreading and is heavily dependent on the accuracy of the visual
front-end which incorporates a pose-estimator. This will be investigated at the
end of this chapter.
The implications of a pose-invariant lipreading system is of major benefit to
lipreading and AVASR in general. By loosening the constraint on the speaker’s
pose, it allows for a much more more pervasive or “real-world” technology to de-
velop, which would be of major benefit to in-car AVASR, for example. Perversely,
by allowing more flexibility in the system, it also introduces more complexity. A
possible solution to this would be to model and recognize each pose indepen-
dently of the other, thus minimising the train/test mismatch. Unfortunately,
this is complicated to achieve in a continuous setting so a one model for all ap-
proach is usually employed. Having one model which can generalise over all poses
is also problematic, as it may over generalize, causing large train/test mismatch.
Train/test mismatch can drastically affect the performance of a classifier.
Given that only one model is used, if some sort of invariance in the feature space
of an input signal is provided then the entire system will benefit. A number
of approaches have been devised in the acoustic speech domain to lessen the
train/test mismatch caused by channel conditions and noise, such as cepstral
mean subtraction (CMS) [110] and RASTA processing [74]. This type of approach
has been used similarly in the visual domain for face recognition, where techniques
7.2. Pose-Invariant Techniques 131
such as linear regression have been used to project the unwanted non-frontal pose
face image into a frontal face image. Blanz et al. [9] cite the major advantage of
doing this is because most state-of-the-art face recognition systems are optimised
for frontal poses of faces only, and their performance drops significantly if the
faces in the input images are shown from non-frontal poses due to large variation
in train/test mismatch. Linear regression has also been used in AVASR, with
Goecke et al. [58] using a linear regression matrix to gain an estimate of the
clean audio features from a combined audio-visual feature vector for audio-only
speech enhancement.
Motivated by these works, this chapter describes a “pose-invariant” lipreading
system, which makes use of linear regression to normalise the visual speech fea-
tures into a single pose. As no prior work on pose-invariant lipreading has been
carried out before, this chapter is also concerned with investigating the various
problems associated with this task.
7.2 Pose-Invariant Techniques
Blanz et al. [9] cites two possible ways of performing pose-invariant face recog-
nition, either via a viewpoint-transformed or a coefficient-based approach. The
viewpoint-transform approach acts in a pre-processing manner to transform/warp
a face representation (i.e. image or feature vector) of an unwanted pose into the
desired pose. Coefficient-based recognition attempts to estimate the face rep-
resentation under all poses given a single pose(i.e. frontal and profile in this
case), otherwise called the lightfield of the face [67]. Although it is not clear
which approach is superior, for the lipreading system presented in this thesis, the
viewpoint-transform approach was employed. The reason behind this particular
approach is that almost all lipreading systems to date have been optimised for
the frontal pose (such as the system described in Chapters 4 and 5). This is sim-
ilar to the motivation cited by Blanz et al. [9] for their face recognition system.
The most common way to perform the viewpoint-transform approach is via linear
regression. This is described in the following subsection.
132 Chapter 7. Pose-Invariant Lipreading
W
Linear Regression/Tranformation MatrixCalculation - Offline
x
t = W x
t
T X
Figure 7.2: Schematic of the proposed pose-invariant lipreading scheme: Visualspeech features xn extracted from an undesired pose (e.g. profile) are transformedinto visual features tn in the target pose space (e.g. frontal) via a linear regressionmatrix W, calculated offline based on synchronised multi-pose training data Tand X of features extracted from the different poses.
7.2.1 Linear Regression for Pose-Invariant Lipreading
The goal of regression is to predict the value of a target variable given an input
variable [8]. This is normally performed via a linear function, which gives rise to
linear regression. For the problem of pose-invariant lipreading, a linear regression
or transformation matrix W can be found which predicts the target features t of
the desired pose given the features of undesired pose x by
t = y(x,W) = Wx (7.1)
where x is of dimension P and t is of dimension Q. A example of this process is
shown in Figure 7.2, where the x is the features of the unwanted profile pose and
t is the desired frontal pose features. It would be prudent however, to express
Equation 7.1 as a predictive distribution as this displays the uncertainty about
the predicted value of t, for any new value of x. This can be done in such a way
to minimise the expected value of a chosen loss function. A common choice for
the loss function is the squared loss function, for which the optimal solution is
given by the conditional expectation of t [8]. As such, assume t is given by a
deterministic function y(x,W) with additive Gaussian noise so that
7.2. Pose-Invariant Techniques 133
t = y(x,W) + εI (7.2)
where ε is a zero mean Gaussian random variable with precision (inverse variance)
β. As such, Equation 7.2 can be written as
p(t|x,W, β) = N (t|y(x,W), β−1I)
= N (t|Wx, β−1I) (7.3)
Now given a set of training set consisting of N offline input examples of the
undesired pose X = {x1, . . . ,xN} and their synchronised target examples in the
wanted pose T = {[t1, 1]′, . . . , [tN , 1]′}, Equation 7.3 can be expressed as the
following log likelihood function
ln(T|X,W, β) =N∑
n=1
lnN (tn|Wxn, β−1I)
=NQ
2ln
(β
2π
)− β
2
N∑n=1
‖tnWxn‖2 (7.4)
where a unit bias has been added to T to allow for any fixed offset in the data. No
such bias was given to the input matrix X. Maximising Equation 7.4 with respect
to W, turns into just maximising the sum of squares error function defined by
ED(W) =1
2
N∑n=1
‖tn −Wxn‖2 (7.5)
A problem of using linear regression however, is that it is overly prone to overfit-
ting [8]. A method of overcoming this phenomenon is by introducing a regulari-
sation term so that the total error function can take the form of
ET (W) = ED(W) + λEW (W) (7.6)
where λ is the regularisation term that controls the relative importance of the
data-dependent error ED(W) and the regularisation term EW (W) [8]. One of
the simplest forms of reguliser is given by the sum-of-squares
134 Chapter 7. Pose-Invariant Lipreading
EW (W) =1
2‖W‖2 (7.7)
The total error now becomes
ET (W) =1
2
N∑n=1
‖tn −Wxn‖2 +λ
2‖W‖2
= ‖T−WX‖2 +λ
2‖W‖2 (7.8)
The sum-of-squares regulariser given in Equation 7.7 encourages weight values
to decay towards zero, unless supported by data, and in machine learning circles
is known as a weight decay regulariser [8]. Minimising the sum-of-squares error
function given in Equation 7.8 with respect to W, yields the maximum likelihood
function. Consequently, the maximum likelihood solution to W can be found be
minimising Equation 7.8 with respect to W via
∂ET (W)
∂W=
∂
∂W[tr [(T−WX)′(T−WX)] + λtr(W′W)]
=∂
∂W[−2tr(TX′W′) + tr(WXX′W′) + λtr(W′W)] (7.9)
The above derivatives can be solved individually by using matrix identities. The
second derivative,
∂
W[tr(WXX′W′)] =
∂
W[tr(W′XX′W)] (7.10)
can be solved using the sum rule as follows. Let
B = XX′W (7.11)
therefore
∂
∂Wtr[W′B] = B = XX′W (7.12)
as the identity ∂∂B
tr[BA′] = ∂∂B
tr[AB′] = A. Now let A = W′XX′, so that
∂
∂Wtr[AW] =
∂
∂Wtr[W′A′] = A = W′XX′ (7.13)
7.2. Pose-Invariant Techniques 135
as per the identity used to find Equation 7.12. Therefore
∂
∂Wtr[W′XX′W] =
∂
∂Wtr[W′B] +
∂
∂Wtr[AW]
= XX′W + XX′W
= 2XX′W (7.14)
Using this result back into in Equation 7.9 yields
∂ET (W)
∂W= −2TX′ + 2WXX′ + 2λW (7.15)
Setting this to zero, gives
0 = −2TX′ + 2WXX′ + 2λW
TX′ = W(XX′ + λI)
W = TX′(XX′ + λI)−1 (7.16)
The matrix W found above, was used to project all visual speech features of
the unwanted pose into the wanted pose domain, in an attempt to normalise for
pose. The next section details the experiments that were conducted in this thesis
to determine if this step was of benefit to a pose-invariant lipreading system. But
before describing these experiments, the importance of the regularisation term λ,
is investigated in the next subsection.
7.2.2 The Importance of the Regularisation Term (λ)
The regularisation term, λ, was introduced in the previous subsection to control
the problem of overfitting. Overfitting refers to the situation where a a model
has too many parameters. The common behaviour of such an occurrence is when
the model is overly tuned to the training data giving a perfect fit to this data.
However, when this model is fitted to the test data, wild oscillations occur. One
way of alleviating the problem of overfitting is to limit the number of parameters
in the model. However as Bishop cites, there is something rather unsatisfying
about limiting the number of parameters in the model as it would seem more
reasonable to choose the complexity of the model according to the complexity of the
136 Chapter 7. Pose-Invariant Lipreading
problem being solved [8]. A method of doing this is to adopt the linear regression
approach using a regularisation term, which was given in the previous subsection.
By utilising this approach, a complex model can be produced as the regularisation
term, λ, is able to weight values which were not supported by the training data
towards zero. However, this begs the following questions
• “what value of λ should be used ?”, and
• “what impact does the amount of training data have on λ?”
To answer these questions, a demonstration of the effectiveness of the linear
regression technique over various values of λ across a different number of training
images is given. For this demonstration, the linear regression matrix W was
learnt from the frontal and profile grayscale ROI images (32× 32). Different Ws
were calculated across various values of λ = {10−2, 100, 102} and for numerous
values of training images N = 1k, 10k, 75k. These training images were randomly
selected from the entire training set (' 200k images) 2. These different Ws, were
used to project the unwanted profile ROI images, into the wanted frontal domain.
The results of this demonstration are given in Figure 7.3
As can be seen from Figure 7.3, an unwanted profile image can be projected
into the wanted frontal image via the linear regression transformation matrix.
However, the likeness between the actual frontal ROI image and the projected
ROI image varies according to the number of training images used and the value
of λ. For example, when only 1k training images were used and the value of
λ = 10−2, the projected profile ROI resembled a noisy ghost-like ROI which is a
far-cry from the original frontal ROI. This is a prime example of overfitting. In
comparision, when the value of λ was increased to 100 and 102 using 1k training
images, the respective projected profile ROIs looked a lot more like the original
ROI.
This result using 1k training images is in stark contrast to the situation where
the number of training images used was increased (10k and 75k), with the value
2In the stationary pose-invariant experiments it should be noted that visual speech featurevectors were used instead of the images as only the speech information was of interest. Calcu-lating the transformation matrices using the image data would have been prohibitive for thefull training set due to the increase in dimensionality as well.
7.2. Pose-Invariant Techniques 137
λ = 10-2 λ = 100λ = 10+2
Number of Trainging
Images1k
10k
75k
Unwanted Profile Image
(Known)
Wanted FrontalImage
(Unknown)
Projected Profile Image into the Frontal Domain
Projected Image = W x Profile Image
Figure 7.3: Given one camera, the lipreading system has to be able to lipreadfrom any pose. In this example, those poses are either frontal or profile poses.
of λ having little to no observable difference on the projected profile ROIs with
all of them looking similar to the original ROI. This result is intuitive as when the
number of training examples (> 10k) is far greater than the number of parameters
(1k), a generalised model is usually obtained.
This demonstration highlights the importance of the regularisation term λ as
it alleviates the problem of overfitting when the number of training examples is
limited. However, when there is an abundance of training examples, the value
of the regularisation term is insignificant as the large amount of training data
ensures that a model which generalises well across the data can be obtained. As
such, for the experiments conducted in the next section the value of λ was set to
100, even though it was not important as the number of training examples was
close to 200k.
138 Chapter 7. Pose-Invariant Lipreading
7.3 Stationary Pose-Invariant Experiments
7.3.1 Experimental Setup
As it was assumed that prior knowledge of the pose of the speaker was known,
localisation and tracking of the ROI’s was done as per Chapter’s 4 and 6 for the
frontal and profile poses respectively. In a “real-world” scenario, the pose of the
speaker would have to be estimated prior to any ROI localisation detection (this
is investigated later on in Section 7.4). However, as the pose was constant during
each entire utterance and for the purposes of demonstrating the pose-invariant
technique, this approach was deemed to be valid. Similar to the ROI extraction,
the visual features were extracted via the baseline cascading appearance features
described in Chapter 5.3.
In the first round of experiments, three lipreading systems were tested. These
systems were trained on the following data:
• 100% frontal
• 100% profile
• 50% frontal and 50% profile (“combined(50-50)”)
As per the past experiments in this thesis, the same multi-speaker train and
test sets described in Chapter 3.6.2 were utilised, i.e. 1198 training sequences
and 242 test sequences. The frontal system was trained solely on the frontal
features. Similarly, the profile system was trained solely on the profile features.
The training set for the combined(50-50) system was made up of 50% of frontal
features (599) and 50% right (599) profile features. For the combined(50-50)
system, all of the different 1198 sequences were accounted for by randomly sub-
stituting the frontal sequences with profile sequence. These systems were tested
on frontal, profile, projected profile, projected frontal and combined test sets.
Similar to the training sets, the combined test set was made up of 50% frontal
(121) and 50% profile (121) data and also termed combined(50-50). Additional
test sets consisting of 50% frontal and 50% projected profile into frontal features
(“combined-projected profile(50-50)”), and 50% profile and 50% frontal projected
into profile features (“combined-projected frontal(50-50)”) were also included.
7.3. Stationary Pose-Invariant Experiments 139
10 20 30 40 50 6020
30
40
50
60
70
80
90
Number of features used, Q
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Performance of Frontal and Combined(50−50) systems for varying number of features
Front(Front)Profile(Front)Proj−Prof(Front)Front(Comb50−50)Profile(Comb50−50)
10 20 30 40 50 6020
30
40
50
60
70
80
90
Number of features used, Q
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Performance of Profile and Combined(50−50) systems for varying number of features
Front(Profile)Profile(Profile)Proj−Front(Profile)Front(Comb50−50)Profile(Comb50−50)
(a) (b)
Figure 7.4: Plots showing the impact that normalising the pose has on lipreadingperformance for the: (a) frontal and combined(50-50) systems; and (b) profileand combined(50-50) systems. These systems are tested across various numbersof features Q = 10 − 60. In the legend, the first label refers to the test set andthe label within the bracket denotes the system’s name.
In these experiments, one of the goals was to see what effect the number of
features had on the transformed features. To do this, all three systems were
trained and tested using features of dimension Q = 10 to 60. It is worth noting
that two transformation matrices were calculated offline for these experiments
via Equation 7.16. The transformation matrix, WP , which projects the profile
features into the frontal pose was calculated with the full set of training frontal
features as the target variable T, and the full set of training profile features as
the input variable X. The transformation matrix, WF , which projects the frontal
features into the profile pose was found using the opposite configuration.
7.3.2 Experimental Results
The results given in Figure 7.4 show the impact of projecting the features into a
single pose has on the lipreading performance. In (a), the frontal system is com-
pared to the combined(50-50) system, while in (b) the profile system is compared
to the combined(50-50) system. In the former, it can be seen that the frontal
system achieves the lowest WER when it is tested on the frontal data (best is
27.66% for Q = 40), while for the former plot the profile system obtains the
lowest WER for the profile data (best is 38.88% for Q = 40). However, when
each system is tested on features of the other pose, the features are essentially
140 Chapter 7. Pose-Invariant Lipreading
recognised as noise due to the large train/test mismatch (both severely degrad-
ing to approximately 87%). It can be seen by projecting the profile features into
the frontal domain (a), or by projecting the frontal features into profile domain
(b), the mismatch between the features and the models are greatly reduced. For
Q = 20, the improvements are quite significant with the WER reducing from
87.07% down to 54.85% for the frontal system in (a) and 87.45% down to 42.97%
for the profile system in (b). However, when the number of features is increased
from Q = 20 to 60, the performance of the projected features steadily degrades,
with the WER increasing from 54.85% to 74.78% and 42.97% to 67.97% in (a)
and (b) respectively.
The drop off in performance as the number of features increases from Q = 20
to 60, highlights one of the characteristics of using maximum likelihood linear
regression. As the solution for the transformation matrix found in Equation 7.16
was obtained via a Bayesian approach, this meant that the effective number of
parameters were adapted automatically to the size of the data set [8]. For these
experiments, it appears that the number of effective parameters or features were
constrained to Q = 20 to 30.
Another observation worth noting is that the performance of the projected
profile test set of the frontal system is well behind the profile test set for the
combined(50-50) system in (a), with a performance difference range of 7.46%
at Q = 20 to 27.01% for Q = 60. This is in contrast to the improvement the
frontal system enjoys over the combined(50-50) system for the frontal test set,
with an average improvement of 8% gained. A similar trend is experienced in (b),
with the combined(50-50) system for the frontal test set outperforming the profile
system for the projected frontal test set with a performance difference range of
6.56% at Q = 20 to 30.85% for Q = 60, while the profile system outperforms the
combined(50-50) system for the profile test set by an average of 8%.
As it was hard to ascertain which overall system is better, it was necessary
to test the system on the combined(50-50) test set. The results for this ex-
periment is illustrated in Figure 7.5. From this plot, it appears that there is
not much difference between the frontal, profile and combined(50-50) systems
for the combined-projected profile(50-50), combined-projected frontal(50-50) and
7.3. Stationary Pose-Invariant Experiments 141
10 20 30 40 50 6020
30
40
50
60
70
80
90
Number of feature used, Q
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)orm
ance
Performance of Frontal, Profile and Combined(50−50) systems for varying number of features
Comb50−50(Frontal)Comb−ProjProf50−50(Frontal)Comb50−50(Profile)Comb−ProjFront50−50(Profile)Comb50−50(Comb50−50)
Figure 7.5: Plot showing the impact that normalising the pose has on lipreadingperformance for the frontal, profile and combined(50-50) systems. These systemsare tested across various numbers of features Q = 10−60. In the legend, the firstlabel refers to the test set and the label within the bracket denotes the system’sname.
combined(50-50) test sets respectively. This is especially the case for Q = 10 to
30. For the frontal system, the best result achieved was with a WER of 42.02%
for Q = 20. Similarly with Q = 20, the profile system achieved its best WER
of 42.42%. However, the best overall result was from the combined(50-50) sys-
tem with a WER of 40.83%, with Q = 30. A possible reason for this is that
combined(50-50) system is trained on both sets of data equally and by doing this
the system is able to model both poses equally well, thus yielding better overall
results.
By combining the features of the different poses, a model is created which
effectively averages or generalises across the poses. This generalisation has come
at a cost though, as was mentioned earlier with the frontal system degrading by
an average of approximately 8% and the profile system degrading by an average
of approximately 7%. However for these experiments, having a system which can
generalise over the two poses is still better than a system which normalises all
poses into an uniform one. This is highlighted by the fact that the performance of
the combined(50-50) system for the frontal and profile test sets heavily outperform
the performance of the projected features for frontal and profile systems, which
142 Chapter 7. Pose-Invariant Lipreading
can be seen back in Figures 7.4(a) and (b). Even though generalising across both
sets of features yielded the best results for these experiments, it must be noted
that this was for the scenario where both poses were equally likely. Generalisation
can be particularly costly however, if one pose is more prevalent than the other.
This is the focus of the experiments given in the next subsection.
7.3.3 Biased Towards Frontal Pose
Most lipreading systems are set up for fully frontal faces. This is due to the fact
that nearly all audio-visual speech databases have been restricted to the frontal
pose due to the high cost associated with capturing video data (see Chapters 2
and 3). Even though this has been widely acknowledged throughout this thesis,
what has not been recognised until now is “why” most databases chose the frontal
pose over the profile pose. The reason is quite obvious, as most lipreading appli-
cations would expect the speaker to be in the frontal pose for the majority of the
time. Consequently, to reflect this fact, it would be intuitive that a lipreading
system be trained more on the frontal pose than the profile pose to cater for this
bias. A bonus of adapting this approach is that the frontal pose yields better
lipreading performance than the profile pose (27.66% to 38.88% WER), so the
overall lipreading performance should improve.
To see what impact biasing the system to the frontal pose over the profile pose
has on the lipreading performance, it was decided that a second set of experiments
would be conducted to reflect this scenario. To do this, it was estimated a speaker
would be in the frontal pose for approximately 80% of the time and in the profile
pose for about 20%. For these extended experiments, the frontal system was
still trained solely on 100% the frontal features. A new system was introduced
however, which was called the “combined(80-20)”. As the name suggests, the
combined(80-20) system was trained up on 80% of the frontal data (958 utterance)
and 20% of the profile data (240 utterances). The profile system was not tested
as part of these experiments as they were biased towards the frontal pose. As
such, the frontal system and combined(80-20) systems were tested on the frontal,
profile, projected profile test sets. The combined(80-20) test set was made up of
80% frontal test sequences (194) and 20% of profile test sequences (48). Similarly,
7.3. Stationary Pose-Invariant Experiments 143
10 20 30 40 50 6020
30
40
50
60
70
80
90
Number of features used, Q
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Performance of Frontal and Combined(80−20) systems for varying number of features
Front(Front)Profile(Front)Proj−Prof(Front)Front(Comb80−20)Profile(Comb80−20)Proj−Prof(Comb80−20)
Figure 7.6: Plot showing the impact that biasing the system to the frontal posehas on the lipreading performance for the frontal and combined(80-20) systems.These systems are tested across various numbers of features Q = 10 − 60. Inthe legend, the first label refers to the test set and the label within the bracketdenotes the system’s name.
the combined-projected profile(80-20) test set consisted of 80% frontal and 20%
projected profile test sequences. It is worth noting that the regression training
sets remained the same due to the limited number of synchronised examples.
Figure 7.6 shows the lipreading performance for the frontal and combined(80-
20) systems for the frontal, profile and projected profile test sets. From this plot
it should be visible that the overall lipreading performance of the frontal system
has greatly improved. For the projected profile features, the best result sees the
WER come down to 33.90% from 42.02% for Q = 20. The improvement in the
overall performance of the frontal system can be attributed to the fact that this
system is solely trained on the frontal pose, which is the pose these experiments
are biased towards. This frontal system result outperforms the combined(80-20)
system, with a best WER of 36.61% also for Q = 20. This mark is much better
than the combined(50-50) result recorded in the previous experiment, which was
40.83% for Q = 30. This improvement can also be attributed to the biasing of the
system to the frontal data, as well as lessening the impact of generalisation. This
can be seen in Figure 7.6, as the combined(80-20) system curve for the frontal
test set is relatively close to the frontal system curve. This has come at a cost
144 Chapter 7. Pose-Invariant Lipreading
10 20 30 40 50 6032
34
36
38
40
42
44
46
48
Number of features used, Q
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Performance of Frontal, Combined50−50 and Combined80−20 systems for varying number of features
Comb50−50(Comb(50−50)Comb80−20(Front)Comb−ProjProf80−20(Front)Comb80−20(Comb80−20)Comb−ProjProf80−20(Comb80−20)
Figure 7.7: Plot showing the impact that normalising the pose has on lipreadingperformance for the frontal, profile and combined(50-50) systems. These systemsare tested across various numbers of features Q = 10−60. In the legend, the firstlabel refers to the test set and the label within the bracket denotes the system’sname.
though, as the profile performance of the combined(80-20) system has degraded
significantly compared to the combined(50-50) system, with the WER increasing
to the range of 62.11% to 62.55%, from the range of 45.53% to 47.77% shown in
Figure 7.4(a). This suggests that the profile data is not adequately represented
in the combined(80-20) system model and as such the projected profile features
achieve a better WER than the profile features with 57.78% for Q = 20, compared
to 62.11% for Q = 40. As this is the case, it is not surprising that the frontal
system is now the superior system, which is depicted in Figure 7.7. Also, from this
figure it is visible that the projected profile features are now better performing
than the profile features, with the optimal value of Q = 20.
From these experiments, it is evident that when the models are biased towards
one particular pose, such as the frontal one, it is advantageous to normalise all
poses into the strongly trained pose. It would be expected that when the number
of non-dominant poses are increased, this result will be even more dramatic, as
these non-dominant poses increase the amount of variation in the train/test set.
This is the focus of the next experiment, which includes the other profile pose.
7.3. Stationary Pose-Invariant Experiments 145
Left or
Frontor
RightVideo
Figure 7.8: In these experiments, the lipreading system has to lipread from thefrontal, right and left profile poses, instead of just the frontal and profile (right)poses.
7.3.4 Inclusion of Additional Pose
In the previous section, it was hypothesised that when the number of poses is
increased, the benefit of pose normalisation will be even more pronounced. The
experiments performed in the following section are designed to illustrate this
point. To do this, the left profile pose data was included. With the introduction
of the left profile pose data, the lipreading paradigm has shifted from the one
depicted in Figure 7.1 to the one given in Figure 7.8.
An additional lipreading system was developed to accommodate the additional
pose data. Like in the previous experiment for the combined(80-20) system, the
“combined(80-10-10) system” was trained on data which was biased towards the
frontal data, with 80% being of frontal (958) data, 10% being for the right pro-
file (120) and 10% being for the left profile (120) data. To see the benefit of
the pose normalising over the combined model with the additional pose, the
frontal, combined(80-20) and combined(80-10-10) systems were tested. These
systems were tested on the front, right profile, left profile, right projected pro-
file, left projected profile, combined(80-20), combined-projected profile(80-20),
“combined(80-10-10)” and “combined-projected left and right profile(80-10-10)”
test sets. Like the combined(80-20) and combined-projected profile(80-20) test
146 Chapter 7. Pose-Invariant Lipreading
sets, the combined(80-10-10) and combined-projected left and right profile(80-
10-10) test sets contained the 3 different poses, consisting of 80% for the frontal
(194), 10% for the right profile and projected (24) and 10% for the left profile
and projected (24) data respectively. It is worth noting that the right profile data
refers to the profile data mentioned in the previous experiments.
For these experiments, the left profile data set was constructed by horizontally
mirroring the right profile ROI images. Once these ROIs were obtained, the visual
feature extraction step was performed as normal. As the left profile ROIs were just
the mirrored right profile images, this meant that the features were effectively the
same due to the DCT step in the visual feature extraction process. As the DCT
is a laterally symmetrical function (see Chapter 5.5), this meant that the only
difference between the left and right profile features were that the odd frequency
components were of opposite polarity, which in turn resulted in essentially the
same visual feature vectors being obtained for both the profile poses. As such,
the lipreading results for each of these poses were identical and as this was the
case they were referred to just as profile in the results.
Table 7.1 shows the results for these experiments. As the results from the
previous experiment showed that optimal performance was gained with Q = 20,
this number of features were used for this experiment. From these results it can
be seen that when data of another pose added, the benefit of the normalising the
pose is more substantial when compared to the combined systems. When only two
poses were used, the performance on the combined(80-20) test set showed that the
combined(80-20) system obtained a WER of 37.33%. Contrastingly, when there
was 3 poses included on the combined(80-10-10) test set, the combined(80-10-10)
system obtained a worse WER of 39.96%, which is a degradation of around 2.6%.
It can be seen that the frontal and the projected profile performance remains con-
stant, however, the performance of the profile degrades from a WER of 62.55% for
the two poses to 69.74% for three poses. Like the previous experiments, this can
be attributed to the lack of classification power the system possesses to accurately
model features across the different poses. In comparison, projecting the features
into an uniform pose does not alter the performance of the lipreading systems at
all. It would be expected that further degradations would occur to the combined
7.4. Continuous Pose-Invariant Lipreading 147
System Tested on
System Frontal Profile Proj Comb Comb Comb Comb
Trained Profile (80-20) Proj (80-10-10) Proj
(80-20) (80-10-10)
Frontal 29.18 87.07 54.85 40.09 33.90 40.07 33.81
Comb(80-20) 32.46 62.55 57.98 37.33 36.61 41.23 40.76
Comb(80-10-10) 32.51 69.74 58.02 38.19 37.31 39.96 36.82
Table 7.1: Lipreading results in WER (%) showing the effect that an additionalpose has on performance for Q = 20. As the left and right profile WER werethe same, profile refers to both poses. The combined(80-10-10) test set refers tofrontal (80%), right (10%) and left (10%) profile poses.
systems when more poses are included into the system (i.e ±30o,±45o,±60o etc.).
However, by utilising the pose-normalising step as described in this chapter, the
degradation to the overall lipreading performance can be minimised.
7.3.5 Limitations of Pose-Normalising Step
The linear regression method mentioned in this section works quite well in min-
imising the train/test mismatch between the visual speech features of the various
poses. However, it must be mentioned that it is anticipated that this method
would only be useful for the task of small vocabulary tasks, as the single tran-
formation matrix can only learn to a certain extent the differences between the
frontal and non-frontal poses. For applications such as large vocabulary lipreading
with many speakers (> 100), it is expected that this method would be prohibitive
due to the large amount of data required to train up a single transformation ma-
trix. It is also unrealistic to think that a single transformation matrix could be
able to learn the differences between all the different speakers and visual sounds
as well.
7.4 Continuous Pose-Invariant Lipreading
At the start of the chapter it was stated that the main motivation behind pose-
invariant lipreading was to make the lipreading system more “real-world” by
allowing the speaker to be in more than one pose during the utterance. However,
148 Chapter 7. Pose-Invariant Lipreading
Video In
Pose Estimate?(from face)
ROI Localisation
ROI Localisation
ROI Localisation
Visual FeatureExtraction
Visual FeatureExtraction
Visual FeatureExtraction
HMM
NormalisePose
NormalisePose
Recognise VisualSpeech
RightProfileFront
LeftProfile
Figure 7.9: Block diagram of the continuous pose-invariant lipreading system.
all the work presented in this chapter thus far has dealt with the scenario where
the speaker has remained in the same pose for the entire sequence. In addition
to this, the pose of the speaker was assumed to be known, which hardly make
the lipreading system “real-world”. In an attempt to remedy this situation, this
section of the thesis is dedicated to the development of a continuous pose-invariant
lipreading system. To accommodate this work, the CUAVE database [130] was
used as it contained continuous video of speakers talking in three different poses:
frontal, left profile and right profile (see Chapter 3.6.3 for full details).
The deployment of a continuous pose-invariant lipreading system is very sim-
ilar to the stationary scenario, albeit with one modification. This modification is
the inclusion of a pose-estimator at the front of the visual front-end, as depicted
7.4. Continuous Pose-Invariant Lipreading 149
in Figure 7.9. It can be seen that once the pose of the speaker has been estimated,
this estimation is used to direct the system to locate the ROI of that particu-
lar pose. Once the ROI has been extracted, visual feature extraction can take
place and the features can be combined into a single model or normalised into
the frontal pose as described in the previous section. It must be noted that the
addition of a pose-estimator at the start of the lipreading system may seem like
a simple enough solution, however this type of approach can be problematic as it
provides another avenue to introduce error into the lipreading system due to the
front-end effect. Ideally, perfect pose-estimation would be achieved which would
result in the lipreading performance not being affected at all. Unfortunately,
this is extremely difficult to achieve and as a consequence, it is expected that
some error will be introduced into the lipreading system via the pose-estimator.
Through a number of experiments on the CUAVE database, the impact of the
pose-estimation module on the entire lipreading system is analysed in this sec-
tion. Prior to this analysis though, a full description of the pose-estimation and
multi-pose visual front-end is given.
7.4.1 Pose Estimation
In Chapter 4, many different visual front-ends were discussed and whilst all have
some advantages associated with them, the Viola-Jones algorithm [180] was se-
lected to be used in this thesis as it is extremely rapid, accurate and was able
to be used for non-frontal poses as well as frontal. Throughout this thesis, the
benefit of using this algorithm has been illustrated, however, it has only centered
on locating faces and facial features of one specific pose (both frontal and pro-
file). For continuous pose-invariant lipreading, a multi-pose paradigm has to be
visited. This highlights another benefit of the Viola-Jones framework, as it is able
to accommodate for the multi-pose scenario by the inclusion of a pose-estimator,
which still allows for extremely quick localisation of faces and features [82].
According to Jones and Viola [82], the multi-pose visual front-end depicted in
Figure 7.9 is the preferred option amongst researchers. A reason they gave was
that a holistic approach, where a single classifier is trained to detect all poses of
a face, is unlearnable with existing classifiers. In their informal experiments they
150 Chapter 7. Pose-Invariant Lipreading
found that using the holistic approach yielded extremely inaccurate results. The
initial work in this thesis using this holistic approach also backs up this assertion.
It would appear that like the previous section where the combined HMM classifier
suffered from over generalisation, the boosted cascade of simple classifiers suffer
from the same problem. Another disadvantage of using a single classifier across
all poses is that there is no information about the speaker’s pose is gained. This
means that a pose normalising step using linear regression that was described in
the previous section can not be utilised.
The pose-estimation of a speaker’s face is essentially a chicken or the egg
problem. Firstly, the location of the face has to be known to determine its pose,
but the pose of the face has to be known to find the face. A prudent strategy
to achieve this would be to solve both of these problems simultaneously. To do
this, a face classifier for each pose has to be constructed, then this classifier has
to be scanned exhaustively for each position and scale in the image. As this is
extremely expensive in terms of computation, a rapid detection framework like
the Viola-Jones framework has to be employed. In [82], Jones and Viola did
such a thing by building different detectors for different poses of the face. These
classifiers were then placed in a decision tree to determine the pose of the given
window being tested. Rowley et al. [155], employed a similar strategy but was
reported to not be as quick and accurate as the one devised by Jones and Viola
[82].
For this thesis, a similar strategy to Jones and Viola was used to develop the
pose estimator. A diagram of the devised pose estimator is depicted in Figure
7.10. From this figure it can be seen that given a frame of a speaker’s face,
all the face classifiers are applied to the image to determine the location of the
face. Once a face has been located by a pose specific classifier, this face and
pose information is then used by the continuous pose-invariant lipreading system
which is described by Figure 7.9. This procedure works well when only one of
the poses is estimated, however, it gets complicated when there is more than
one pose estimated as there is no way of knowing which pose is the correct
one. To counteract this problem, the nearest neighbour variable is used. The
nearest neighbour variable is a parameter in OpenCV’s generic object detector
7.4. Continuous Pose-Invariant Lipreading 151
Video Int
Set Nearest Neighbour = 1
Check Front PoseFace Classifier
Check Left PoseFace Classifier
Check Right PoseFace Classifier
How ManyPoses
Estimated?
Run Visual Front-End for Estimated Pose
Increase Nearest Neighbour
Parameter +2
> 1Pose Estimation
Failure
Use PreviousPose
= 1
= 0
Figure 7.10: Block diagram of the pose estimator which incorporates the poseestimation with the face localisation.
[128], which essentially regulates how much an object has to look like the object
of interest before it is recognised as that object. When an object in an image
looks like the object of interest (i.e. face), the object detector puts a number of
rectangles around the object. The more likely the object looks like the object of
interest, the more rectangles are around the object. For example, in Figure 7.11
a speaker’s face is detected as such by a face classifier, which is symbolised by the
three rectangles around the speaker’s face. If the nearest neighbour parameter is
set to three or less, then the face is deemed to be a face. However, if the nearest
neighbour parameter is set to four or above, then the face is not deemed to be a
face.
For the pose-estimator given in Figure 7.10, the nearest neighbour parameter
is set to one and all the pose specific face classifiers are tested on the given frame.
152 Chapter 7. Pose-Invariant Lipreading
Figure 7.11: Example showing the function of the nearest neighbour variable inthe face localiser.
If only one face/pose is found then that information is used by the lipreading
system. However, if there is more than one pose estimated, the nearest neighbour
parameter is increased by two to determine which is the more likely pose. This
process is continued until only one face/pose is found. If no face/pose is found,
the face and pose information from the previous frame is used.
7.4.2 Experimental Setup
For all the experiments in this section, the isolated digit tasks from the individual
section of the CUAVE database was used. A full description of the database
protocol is given in Chapter 3.6.3. However, it is worth noting that in this data
four different tasks were tested on, i.e. normal (frontal), moving (frontal), left
profile and right profile. For the lipreading results, each of these individual tasks
were compared against the combined performance. For the training of the pose-
estimator and pose specific visual front-ends, only the frontal, left-profile and
right-profile poses were considered. The face and facial feature classifiers for each
pose were trained up on 500 manually annotated positive examples and 2000
negative examples, in the same manner as the previous experiments. The set of
500 positive examples for each pose were taken from all the 33 subjects. This was
because there were not enough speakers to create classifiers to achieve accurate
localisation for the ten different train/test sets devised in Chapter 3.6.3. As such,
only one variant of the pose-estimator and visual front-ends were developed for
these experiments. The set of positive examples for each pose were augmented
by including rotations of ±5o,±10o, providing a set of 2500 positive examples. A
7.4. Continuous Pose-Invariant Lipreading 153
Pose Correctly False Alarm Miss Alarm
Estimated (%) Rate (%) Rate (%)
Front 92.31 0.00 7.69
Right Profile 87.17 5.13 7.69
Left Profile 89.74 5.13 5.13
Total 89.74 3.42 6.84
Table 7.2: Pose Estimate results on the CUAVE validation which consisted of 39images for each pose.
separate validation set of 39 annotated images for each specific pose were used
to test the pose-estimator and pose specific visual front-ends. These results are
presented in the following subsections.
7.4.3 Pose Estimate Results
The pose estimate results are shown in Table 7.2. Determination of whether the
pose of the speaker was correct or not was done by manual inspection. From the
results, it can be seen that the pose-estimator/face localiser achieves reasonable
results, however, it is far from ideal as if it gets it wrong at this stage it will
cause erroneous localisation of the ROI and thus incorrectly recognise the visual
speech. This again shows the impact of the front-end effect.
Most of the false and miss alarms occur when the pose is in transition (i.e. not
quite frontal and not profile). Examples of the pose estimation/face localisation
are shown in Figure 7.12. The top two rows of this figure show the results for
the frontal pose. The more difficult frames for the frontal pose were selected for
testing. This was done as it was expected that these frames would cause the
most trouble for the pose-estimator and for the multi-pose visual front-end to
operate successfully, it would need good performance on such frames. As can be
seen from the frames in the last column, a few miss alarms were incorporated
due to the irregular rotation of the speakers face (i.e. the first one the speaker is
looking upwards, the second is in between front and left profile pose). Overall, it
can be said that the performance for the frontal pose was quite good with only a
small number of miss alarms and no false alarms, however, due to the small size
154 Chapter 7. Pose-Invariant Lipreading
Figure 7.12: Examples of results from the pose estimator. The first two rows giveresults for the frontal pose. The third and fourth rows give the results for theright profile pose and the last two rows give the results for the left profile pose.The last column gives examples of false estimates and miss estimates.
of the validation set this can not be said with any great confidence. The third
and fourth rows give examples for the right profile pose, whilst the bottom two
rows give examples for the left profile pose. The right profile pose gave the worse
performance at of all poses but this was only marginal. For the left and right
profile poses, there were a few false alarms with these getting confused with the
frontal pose. As the speaker’s do not have a definite frontal or profile pose, the
pose-estimator is getting confused with the ambiguity which causes the errors.
7.4.4 Multi-Pose Localisation Results
Each pose specific visual front-end was developed in the same fashion to those
developed for the experiments conducted on the frontal and profile poses of the
IBM smart-room databases (see Chapter 4.6 and Chapter 6.2 for full details).
7.4. Continuous Pose-Invariant Lipreading 155
Facial Feature Accuracy (%)
Frontal Right Profile Left Profile Total
Right Eye 87.17 - 82.05 84.61
Left Eye 84.62 82.05 - 83.34
Nose 79.49 76.92 79.49 78.63
Right Mouth Corner 82.05 - 79.49 80.77
Top Mouth 81.08 74.36 71.79 75.74
Left Mouth Corner 82.05 79.49 - 80.77
Bottom Mouth 76.92 71.79 74.36 74.36
Center Mouth 82.05 79.49 79.49 80.34
Chin 61.54 53.85 58.97 58.12
Table 7.3: Facial feature localisation accuracy results for all poses on the CUAVEvalidation set.
The localisation results are given in Table 7.3. It is worth noting that a feature
was deemed to be successfully located if it was within 10% of the manually an-
notated distance between the eyes for the frontal pose, and 10% of the manually
annotated distance between the nose and chin for the profile poses. From the
results it appears that the localisation of the facial features is not as good as
the experiments conducted on the IBM smart-room data. These results can be
misleading though, as the actual facial feature localisation performance is pretty
much on par with the smart-room data performance. However, the cause of the
performance degradation was due to the previous step of pose-estimation/face
localisation. In another example of the front-end effect, the false or miss alarms
of the pose-estimator/face localisation module filtered down to the facial feature
localisation step, which caused the degraded results. This was to be expected
though, as this task is much more difficult than the tasks associated with the
smart-room data as the variable of head pose movement is introduced.
Examples of the localised faces and facial features are given in Figure 7.13.
The top row gives examples from the frontal pose, the second row gives examples
from the right profile pose and the third row gives examples from the left profile
pose. The bottom row gives the associated examples of the extracted 32 × 32
156 Chapter 7. Pose-Invariant Lipreading
Figure 7.13: Examples of face and facial feature localisation from the multi-posevisual front-end. The bottom row gives the associated examples of the extracted32× 32 ROI’s
ROI’s. For the frontal pose, scale and rotation normalisation was performed us-
ing the left and right mouth corners. As all sequences started with the speaker
in the frontal pose, scale normalisation for the profile poses was determined from
the scale metric determined in the initialisation of the visual front-end. Unfortu-
nately, no rotation normalsation for the profile poses was performed due to lack
of horizontally aligned points (see Chapter 6.2). Once the ROIs were extracted,
the same visual feature extraction process performed throughout this thesis were
conducted for these experiments.
7.4.5 Continuous Pose-Invariant Lipreading Results
The experiments in this section were broken up into two sections. The first sec-
tion investigated the lipreading performance of the four individual tasks; normal,
moving, right profile and left profile. For each of these individual tasks, the mod-
els were trained and tested solely on the data which referred to their respective
task. In the second section, the pose-invariant lipreading task were investigated.
For these experiments, one model which was trained up on all the different tasks
was used for testing. This was termed the “combined all” result. In addition
to this, depending on the result from the pose-estimator the features were nor-
malised into the frontal pose, using the pose-invariant technique based on linear
7.4. Continuous Pose-Invariant Lipreading 157
Task WER (%)
Normal 46.88
Moving 67.26
Right Profile 71.95
Left Profile 71.54
Combined Individual 57.97
Combined All 61.20
Pose Normalised 61.49
Table 7.4: The upper part of the table shows the average lipreading performancefor each individual task, whilst the bottom part compares the performance forthe combined individual, combined all and pose normalised tasks, across the 10different train/test sets.
regression introduced at the start of this chapter. As no synchronous data was
available in the CUAVE database to develop the linear regression matrices, the
left and right profile matrices from the IBM smart-room database were utilised
for this task. These results were referred to as the “pose normalised” results.
Both the “combined all” and “pose normalised” results were compared to the
average of the individual results which was termed “combined individual”.
The results for the continuous pose-invariant lipreading system are given in
Table 7.4. For the individual tasks, the normal task achieved the best performance
with an average lipreading WER of 46.88%. This was to be expected as this
was the easiest task to perform, due to the speaker being relatively stationary.
Even though the moving task had the speaker in the frontal pose, having the
speaker move their head back and forth whilst speaking degraded the lipreading
performance markedly to 67.26%. As this task had the speaker moving their
head quite fast, it can be assumed that a major reason for this poor performance
is due to poor tracking of the ROI. The left and right profile tasks achieved
even worse WERs of 71.54% and 71.95% respectively. It must be noted that the
WERs of 46.88% and 71.95% achieved in this experiment for the normal and
right profile tasks are significantly worse than the 27.66% and 38.88% WERs
achieved in the IBM smart-room database for the similar scenario. There are
two reasons for this. Firstly, due to the small size of the CUAVE database, a
speaker-independent lipreading paradigm had to be used in these experiments
158 Chapter 7. Pose-Invariant Lipreading
using 10 different train/test sets, compared to the multi-speaker paradigm used
in the IBM smart-room experiments. Better lipreading performance is expected
for the multi-speaker paradigm as speakers which are in the train set are also
contained in the test set. Secondly, and probably most importantly, due to the
relatively small size of the CUAVE database there was not enough speech data
to adequately train the models for each task. This would be the case especially
for the profile models, as only ten digits were available from each speaker. This
corresponds to only 250 words available to train the models, which would cause
the models to be grossly undertrained.
For the continuous pose-invariant or combined experiments, it can be seen
that “combined individual” which was the average of the individual tasks yielded
the best lipreading performance with a WER of 57.97%. However, the “combined
all” results were not far behind with a WER of 61.20%. The pose-invariant step
using linear regression did not improve the performance, achieving a WER of
61.49%. This is probably due to the fact that the transformation matrices that
were used were trained on the IBM smart-room database.
Due to the small amount of visual speech data contained within the CUAVE
database, it is hard to qualify the significance of the lipreading results obtained
from these experiments. It must be said however, that they do give an indica-
tion that the goal of continuous pose-invariant lipreading is indeed attainable as
an achieved WER of 61.20% is much better than pure chance. Regardless of
the lipreading results, it is evident that the development of a continuous pose-
invariant lipreading system is the next step in deploying a fully functional “real-
world” AVASR system and the key dilemma to this problem is developing a robust
visual front-end which has an extremely accurate pose-estimator.
7.5 Summary
In this chapter, a very novel and useful contribution to the field of AVASR was
presented with the introduction of a pose-invariant lipreading system. Pose-
invariant lipreading refers to the situation where given a single camera, the system
can recognise visual speech regardless of pose. Two scenarios of the problem were
7.5. Summary 159
visited, i.e. stationary and continuous. The first part of the chapter dealt with
the stationary situation, which refers to the case where the speaker is in the one
pose (frontal or right profile) for the entire utterance. These experiments were
conducted on the IBM smart-room database. In these experiments it was shown
that when the features of one pose were tested on the other pose, the train/test
mismatch between the two was large and the lipreading performance severely
degraded as a consequence. To overcome this problem, a pose-invariant or pose
normalising technique using linear regression was used to project all the features
of the unwanted pose into the wanted pose. This technique was shown to reduce
the train/test mismatch between the different poses, and was shown to be of
particular benefit when one pose was more prevalent than the other (i.e. frontal
over right profile) due to over generalisation. In some extended experiments, it
was shown that the effect of this pose-invariant technique is more pronounced
when more poses are included (left profile), again due to over generalisation.
However, a caveat on this approach was the dimensionality of the features used
to determine the linear regression matrix. It was shown once the dimension is
greater than 30, the benefit of using the pose-invariant technique is diminished
and better performance is gained through a combined model of the different poses.
In the latter part of the chapter, the more realistic continuous scenario was
investigated. Continuous pose-invariant lipreading refers to the speaker changing
their head pose whilst they are speaking. This constituted a much more diffi-
cult problem than the stationary scenario as the pose of the speaker had to be
estimated every frame. In this novel system, the pose-estimator was developed
in conjunction with the face localiser and achieved reasonable results. As the
pose-estimation step was at the front of the lipreading system, it introduced ex-
tra error which affected the overall lipreading performance due to the front-end
effect. The results for these experiments, which were conducted on the CUAVE
database, show this to be the case as they are somewhat behind the lipreading
performance of the IBM smart-room database.
160 Chapter 7. Pose-Invariant Lipreading
Chapter 8
Conclusions and Future Research
8.1 Summary of Contributions
Over the past twenty years, literally hundreds of articles have been dedicated to
illustrating the benefit of using the visual speech information from a speaker’s
mouth in addition to the audio signal for the task of speech recognition (see Chap-
ter 2). Even though all these works have shown that including the visual channel
to the speech recognition system greatly improves the recognition performance in
the presence of acoustic noise, no serious attempts have been taken in deploying
an AVASR system which can be used in realistic noisy environments, such as an
in-car scenario. A major reason for this is that nearly all the current work carried
out within this field has failed to focus on unwanted variabilities that lie within
the visual domain, such as head pose. In an attempt to remedy this situation,
the work in this thesis has concentrated on researching and developing methods
to recognise visual speech across multiple views. Within this broad problem, the
following specific goals were set as the main objectives for this thesis:
1. Recognise visual speech from profile views and compare it to its synchronous
counterpart in the frontal view,
2. Determine if there is any complimentary information contained within the
profile viewpoint by combining both frontal and profile features together to
form a multi-view lipreading system, and
3. Develop a pose-invariant lipreading system which can recognise visual speech
regardless of the head pose from a single camera.
161
162 Chapter 8. Conclusions and Future Research
All the work contained in this thesis was performed with the intention of address-
ing these novel and previously unsolved objectives. The major original contribu-
tions resulting from this work are summarised as follows:
(i) Prior to any work on lipreading from non-frontal views being conducted, a
thorough investigation of the cascade of appearance based features, which
is the current state-of-the-art visual feature extraction technique, was un-
dertaken on the frontal section of the IBM smart-room database in Chapter
6. In this novel investigation, analysis on each stage of the cascade of
appearance based feature was performed, which displays the problem of di-
mensionality in lipreading using a HMM classifier. Through this analysis
it was shown that certain measures can be taken to maximise the amount
of speech information extracted from the visual domain through the use of
the DCT and LDA techniques. The impact of feature mean normalisation
(FMN) was also quantified in this analysis, with the FMN step shown to
eliminate redundant speaker information which greatly affected the lipread-
ing performance.
(ii) A visual front-end based on the extremely rapid Viola-Jones algorithm was
developed which could locate and track a speaker’s mouth ROI from both
the frontal and profile views (Chapters 4 and 6). For both the frontal and
profile views, a hierarchical approach was utilised which used the previous
located facial feature points for ROI extraction. For the frontal pose, the
left and right mouth corners were used for scale and rotation normalisation.
For the profile pose, the left eye and left mouth corner were used for scale
normalisation. Unfortunately, no rotation normalisation could be performed
on the profile pose as no reliable horizontal facial feature points could be
located.
(iii) The lipreading performance from a speaker’s profile view was quantified in
Chapter 6. This lipreading performance was then compared against its syn-
chronous frontal counterpart. This comparison was novel and unique as it
was the first that showed reasonable lipreading performance can be obtained
from the profile view, albeit, degraded when compared to the frontal view
(38.88% vs 27.66% WER).
8.1. Summary of Contributions 163
(iv) A novel analysis technique using patches was employed on both the frontal
and profile mouth ROIs to determine the saliency of the various regions
of both the ROIs to the task of lipreading. In this innovative analysis, it
was shown that the middle patch containing the most visible articulators,
such as lips, teeth and tongue gave the most visual speech information for
the frontal view. Similarly, in the profile view, the middle patch was also
the most informative patch, however it was hypothesised that in addition
to the lip, teeth and tongue information, the lip protrusion information
was also of benefit. From this patch-based analysis, a new multi-stream
representation of visual speech was developed which fused the most salient
patches of the ROI together via the synchronous multi-stream HMM. Using
this novel approach, it was found that slight gains could be made over the
holistic patch by fusing the holistic patch with the middle patch. This work
was conducted in Chapters 5 and 6.
(v) At the end of Chapter 6, a novel system which fuses both the frontal and
profile synchronous features together was described. This was referred to
as a multi-view lipreading system. The multi-view system presented in this
thesis was unique as it is the first lipreading system to have more than
one camera as its input. From the multi-view experiments, it was shown
that there does exist complimentary information in the profile view, which
in turn improved the overall lipreading performance (multi-view WER =
25.66% compared to frontal WER = 27.66%).
(vi) A unified approach to lipreading in Chapter 7 was presented, by normalis-
ing all poses to a single uniform pose. Given only one camera, this pose-
invariant lipreading system used a transformation matrix based on linear
regression to project the features of the unwanted pose (profile) into the
wanted pose (frontal). These experiments were performed for the station-
ary scenario, where the speaker was fixed in one pose (i.e. frontal or profile)
for the entire utterance and the pose of the speaker was assumed. This
pose-normalising step was shown to lessen the train/test mismatch between
the two poses and was shown to be of particular benefit when the speaker
164 Chapter 8. Conclusions and Future Research
was in one pose more than the other (i.e. frontal over profile). When more
non-dominant poses were included (such as the other profile pose), the pose-
normalising step also proved to be of benefit.
(vii) A more realistic continuous pose-invariant lipreading system, which allows
the speaker to move their head during the utterance was proposed at the
end of Chapter 7. This constituted a much more difficult problem than the
stationary scenario as the pose of the speaker had to be estimated every
frame. In this novel system, the pose-estimator was developed in conjunc-
tion with the face localiser and achieved reasonable results. As the pose-
estimation step was at the front of the lipreading system, it introduced extra
error which affected the overall lipreading performance due to the front-end
effect. The results for these experiments, which were conducted on the
CUAVE database, show this to be the case as they are somewhat behind
the lipreading performance of the IBM smart-room database.
8.2 Future Research
In this thesis, solutions towards the problem of lipreading from multiple views
were investigated, with results from a multitude of experiments involving non-
frontal views presented for the small-vocabulary task of connected-digit recogni-
tion. Although this is a major problem, other variabilities such as illumination,
appearance, speaking style, image alignment (registration) and speaker emotion
and expression need to be investigated as well. A much more robust AVASR
system could be obtained if research into lipreading across these variables were
investigated. In addition to this, future research needs to be conducted on large-
vocabulary data for this technology to become a viable option, However, to fa-
cilitate this research, databases which are for large-vocabulary tasks as well as
containing these visual variabilities need to become available.
In Chapters 4 and 6, as part of the visual front-end, the Viola-Jones algorithm
[82, 180] was used for locating a speaker’s face and facial features for both frontal
and non-frontal views. The main motivation behind using this algorithm was
8.2. Future Research 165
that is was extremely quick and was reasonably accurate. Recently, a fast imple-
mentation of active appearance models (AAMs) using a variant of the gradient
descent algorithm has emerged which can run in real-time. As AAMs fit a 3-D
mesh onto a speaker’s face, this method promises to improve locating/tracking
performance as well as improve the pose-estimation process. Future research
needs to be conducted into this area as accuarate location of a speaker’s ROI is
central to the success of a lipreading system.
In Chapter 7, a viewpoint-transformed method using linear regression to
project visual features from an unwanted viewpoint into a wanted viewpoint was
developed. In addition to the viewpoint-transformed method, coefficient-based
methods, such as the light-field type approach [67], exist to perform the same type
of task. Future research is required to compare the coefficient-based methods to
the viewpoint-transformed methods, to get some kind of indication to which type
of approach is more suited to the task of lipreading.
In the far distant future, it is possible that lipreading could evolve into one
of the key technologies that are used online. With the recent advent of the
extremely popular YouTube 1, users across the world have access to literally
billions and billions of video clips on the internet. Having a lipreading system
which can automatically detect who is speaking, when they are speaking and what
they are saying within a video clip would be of major benefit for automatically
authenticating and possibly censoring these video clips. Even though this task is
outside the scope of this thesis, it is worth noting some of the potential that this
technology possesses.
1http://www.youtube.com
166 Chapter 8. Conclusions and Future Research
Bibliography
[1] A. Adjoudani and C. Benoit, “On the integration of auditory and visual
parameters in an HMM-based ASR,” in Speechreading by Humans and Ma-
chines (D. G. Stork and M. E. Hennecke, eds.), pp. 461–471, Berlin, Ger-
many: Springer, 1996.
[2] A. Adjoudani, T. Guiard-Marigny, B. LeGoff, L. Reveret, and C. Benoit,
“A multimedia platform for audio-visual speech processing,” in Proceed-
ings of the European Conference on Speech Communication and Technology,
(Rhodes, Greece), pp. 1671–1674, 1997.
[3] P. Aleksic, J. Williams, Z. Wu, and A. Katsaggelos, “Audiovisual speech
recognition using MPEG-4 compliant visual features,” EURASIP Journal
of Applied Signal Processing: Special Issue on Joint Audio-Visual Speech
Processing, vol. 2002, no. 11, pp. 629–642, 2002.
[4] E. Aronson and S. Rosenblum, “Space perception in early infancy: percep-
tion within a common auditory-visual space,” Science, vol. 172, pp. 1161–
1163, 1971.
[5] J. P. Barker and F. Berthommier, “Estimation of speech acoustics from
visual speech features: A comparison of linear and non-linear models,” in
Proceedings of the International Conference on Auditory-visual Speech Pro-
cessing, (Santa Cruz, USA), pp. 112–117, 1999.
[6] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs Fisherfaces:
Recognition using class specific linear projection,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
167
168 Bibliography
[7] C. Benoit, T. Guiard-Martigny, B. L. Goff, and A. Adjoudani, “Which
components of the face do humans and machines best speechread?,” in
Speechreading by Humans and Machines (D. Stork and M. Hennecke, eds.),
Berlin, Germany: Springer, 1996.
[8] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[9] V. Blanz, P. Grother, P. Phillips, and T. Vetter, “Face recognition based on
frontal views generated from non-frontal images,” in Proceedings of the In-
ternational Conference on Computer Vision and Pattern Recognition, vol. 2,
(San Diego, CA, USA), pp. 454–461, 2005.
[10] H. Bourland and S. Dupont, “A new ASR approach based on independent
processing and recombination of partial frequency bands,” in Proceedings
of International Conference on Spoken Language Processing, (Philadelphia,
PA, USA), pp. 426–429, 1996.
[11] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models
for complex action recognition,” in Proceedings of the International Confer-
ence on Computer Vision and Pattern Recognition, (San Juan, Puerto Rico),
pp. 994–999, 1997.
[12] C. Bregler, H. Hild, S. Manke, and A. Waibel, “Improving connected letter
recognition by lipreading,” in Proceedings of the International Conference on
Acoustics, Speech and Signal Processing, (Minneapolis, USA), pp. 557–560,
1993.
[13] C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” in Pro-
ceedings of the International Conference on Acoustics, Speech and Signal
Processing, vol. 2, (Adeliade, Australia), pp. 669–672, 1994.
[14] N. Brooke and A. Summerfield, “Analysis, synthesis, and perception of vis-
ible articulatory movements,” Journal of Phonetics, pp. 63–76, 1983.
[15] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15,
no. 10, pp. 1042–1052, 1993.
Bibliography 169
[16] U. Bub, M. Hunke, and A. Waibel, “Knowing who to listen to in speech
recognition: Visually guided beamforming,” in Proceedings of International
Conference on Acoustics, Speech, and Signal Processing, (Detroit, MI, USA),
pp. 848–851, 1995.
[17] R. Campbell, “Seeing brains reading speech: A review and speculations,” in
Speechreading by Humans and Machines (D. Stork and M. Hennecke, eds.),
pp. 115–133, Berlin, Germany: Springer-Verlag, 1996.
[18] M. Cathiard, M. Lallouache, and C. Abry, “Does movement on the lips
mean movement in the mind?,” in Speechreading by Humans and Machines
(D. Stork and M. Hennecke, eds.), pp. 211–219, Berlin, Germany: Springer-
Verlag, 1996.
[19] M. Chan, “HMM-based audio-visual speech recognition integrating geomet-
ric and appearance-based visual features,” in Proceedings of the Interna-
tional Workshop on Multimedia Signal Processing, (Cannes, France), pp. 9–
14, 2001.
[20] M. Chan, Y. Zhang, and T. S. Huang, “Real-time lip tracking and bimodal
continuous speech recognition,” in Proceedings of the International Work-
shop on Multimedia Signal Processing, (Los Angeles, CA, USA), pp. 65–70,
1998.
[21] D. Chandramohan and P. Silsbee, “A multiple deformable template approach
for visual speech recognition,” in Proceedings of the International Conference
on Spoken Language Processing, (Philadelphia, PA, USA), pp. 50–53, 1996.
[22] T. Chen, “Audiovisual speech processing,” IEEE Signal Processing Maga-
zine, pp. 9–31, 2001.
[23] T. Chen, H. Graf, and K. Wang, “Lip synchronization using speech-assisted
video processing,” IEEE Signal Processing Letters, vol. 2, no. 4, pp. 57–59,
1995.
[24] C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal
170 Bibliography
recognition”,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–37,
2002.
[25] C. Chibelushi, S. Gandon, J. Mason, F. Deravi, and D. Johnston, “Design
issues for a digital integrated audio-visual database,” in IEE Colloquium on
Integrated Audio-Visual Processing for Recognition, Synthesis and Commu-
nication, (London, UK), pp. 7/1–7/7, 1996.
[26] CHIL: Computers in the Human Interaction Loop.
http://chil.server.de
[27] G. Chiou and J. Hwang, “Lipreading from color video,” IEEE Transactions
on Image Processing, vol. 6, pp. 1192–1195, August 1991.
[28] S. Chu and T. Huang, “Bimodal speech recognition using couple hidden
Markov models,” in Proceedings of the International Conference on Spoken
Language Processing, (Beijing, China), pp. 747–750, 2000.
[29] S. Chu and T. Huang, “Audio-visual speech modeling using coupled hidden
Markov models,” in Proceedings of International Conference on Acoustics,
Speech and Signal Processing, (Orlando, Fl, USA), pp. 2009–2012, 2002.
[30] M. Cohen and D. Massaro, “What can visual speech synthesis tell visual
speech recognition?,” in Proceedigns of Asilomar Conference on Signals, Sys-
tems, and Computers, (Pacific Grove, CA, USA), 1994.
[31] A. Colmenarez and T. Huang, “Face detection with information-based
maximum discrimination,” in Proceedings of the International Conference
on Computer Vision and Pattern Recognition, (San Juan, Puerto Rico),
pp. 782–787, 1997.
[32] J. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Veli-
pasalar, “A real-time prototype for small-vocabulary audio-visual ASR,” in
Proceedings of the International Conference on Multimedia Expo, vol. 2, (Bal-
timore, MD, USA), pp. 469–472, 2003.
Bibliography 171
[33] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,” in Pro-
ceedings of the European Conference on Computer Vision, vol. 2, (Freiburg,
Germany), pp. 484–498, 1998.
[34] T. Cootes, A. Hill, C. Taylor, and J. Haslam, “Use of active shape models
for locating structures in medical images,” Image and Vision Computing,
vol. 12, pp. 355–365, July/August 1994.
[35] E. Cosatto, G. Potamianos, and H. Graf, “Audio-visual unit selection for the
synthesis of photo-realistic talking-heads,” in Proceedigns of International
Conference on Multimedia and Expo, (New York, NY, USA), pp. 1097–1100,
2000.
[36] P. Cosi and E. Caldognetto, “Lips and jaw movements for vowels and con-
sonants: Spatio-temporal characteristics and bimodal recognition applica-
tions,” in Speechreading by Humans and Machines (D. Stork and M. Hen-
necke, eds.), pp. 291–313, Berlin, Germany: Springer-Verlag, 1996.
[37] S. Cox, I. Matthews, and J. A. Bangham, “Combining noise compensation
with visual information in speech recognition,” in Proceedings of the Work-
shop on Audio-Visual Speech Processing, (Rhodes, Greece), 1997.
[38] D. Cristinacce, T. Cootes, and I. Scott, “A multi-stage approach to facial
feature detection,” in Proceedings of the British Machine Vision Conference,
(London, England), pp. 277–286, 2004.
[39] P. D. Cuetos, C. Neti, and A. Senior, “Audio-visual intent to speak detection
for human computer interaction,” in Proceedings of International Conference
on Acoustics, Speech, and Signal Processing, (Istanbul, Turkey), pp. 1325–
1328, 2000.
[40] L. Czap, “Lip representation by image ellipse,” in International Conference
on Spoken Language Processing, vol. 4, (Beijing, China), pp. 93–96, 2000.
[41] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of
multi-stream HMMs for audio-visual speech recognition,” in Proceedings of
Interspeech (accepted), (Antwerp, Belgium), 2007.
172 Bibliography
[42] D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused
HMMs for speaker recognition,” in Proceedigns of Second Workshop on Mul-
timodal User Authentication, (Toulouse, France), 2006.
[43] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete
data via the EM algorithm,” Royal Statistical Society, vol. 39, pp. 1–38, 1977.
[44] B. Dodd, “The acquisition of lip-reading skills by normally hearing children,”
in Hearing by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell,
eds.), pp. 163–175, London, England: Lawerence Erlbaum Associates Ltd,
1987.
[45] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating
automatic speech recognition and lip reading,” in Proceedings of the Interna-
tional Conference on Spoken Language and Processing, (Yokohama, Japan),
pp. 547–550, 1994.
[46] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley
and Sons, Inc., 2nd ed., 2001.
[47] S. Dupont and J. Luettin, “Audio-visual speech modeling for continu-
ous speech recognition,” IEEE Transactions on Multimedia, vol. 2, no. 3,
pp. 141–151, 2000.
[48] K. Finn, An investigation of visible lip information to be used in automated
speech recognition. PhD thesis, Georgetown University, Washington DC,
USA, 1986.
[49] Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line
learning and an application to boosting,” in Computational Learning Theory:
Eurocolt ’95, pp. 23–37, Springer-Verlag, 1995.
[50] H. Frowein, G. Smoorenburg, L. Pyters, and D. Schnikel, “Improved speech
recognition through video telephony: Experiments with the hard of hearing,”
IEEE Journal of Selected Areas in Communications, vol. 9, pp. 611–616, May
1991.
Bibliography 173
[51] K. Fukunaga, Introduction to statistical pattern recognition. Academic Press
Inc., 2nd ed., 1990.
[52] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan, “Multimodal
multispeaker probabilistic tracking in meetings,” in Proceedings of the In-
ternational Conference on Multimodal Interfaces, 2005.
[53] L. Girin, A. Allard, and J. Schwartz, “Speech signals separation: A new
approach exploiting the coherence of audio and visual speech,” in Proceed-
ings on the Workshop on Multimedia Signal Processing, (Cannes, France),
pp. 631–636, 2001.
[54] L. Girin, G. Feng, and J. Schwartz, “Noisy speech enhancement with fil-
ters estimated from the speaker’s lips,” in European Conference on Speech
Communication and Technology, (Madrid, Spain), pp. 1559–1562, 1995.
[55] L. Girin, J. Schwartz, and G. Feng, “Audio-visual enhancement of speech
in noise,” Journal of the Acoustical Society of America, vol. 109, no. 6,
pp. 3007–3020, 2001.
[56] R. Goecke, A stereo vision lip tracking algorithm and subsequent statistical
analysis of the audio-video correlation in Australian English. PhD thesis,
Australian National University.
[57] R. Goecke and J. Millar, “A detailed description of the AVOZES data cor-
pus,” in Proceedings of the 10th Australian International Conference on
Speech Science and Technology, (Sydney, Australia), pp. 486–491, 2004.
[58] R. Goecke, G. Potamianos, and C. Neti, “Noisy audio feature enhancement
using audio-visual speech data,” in International Conference on Acoustics,
Speech and Signal Processing, (Orlando, FL, USA), pp. 2025–2028, 2002.
[59] A. J. Goldschen, O. N. Garcia, and E. Petajan, “Continuous optical au-
tomatic speech recognition by lipreading,” in Proceedings of the Asilomar
Conference on Signals, Systems and Computers, (Pacific Grove, CA, USA),
pp. 572–577, 1994.
174 Bibliography
[60] A. Goldschen, O. Garcia, and E. Petajan, “Rationale for phoneme-viseme
mapping and feature selection in visual speech recognition,” in Speechreading
by Humans and Machines (D. Stork and M. Hennecke, eds.), pp. 505–515,
Berlin, Germany: Springer, 1996.
[61] R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for
classification,” in Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, (Seattle, WA, USA), pp. 661–664, 1998.
[62] J. Gowdy, S. Amarnag, C. Bartels, and J. Bilmes, “DBN based multi-stream
models for audio-visual speech recognition,” in Proceedings of the Interna-
tional Conference on Acoustics, Speech and Signal Processing, (Montreal,
Canada), pp. 993–996, 2004.
[63] G. Gravier, G. Potamianos, and C. Neti, “Asynchrony modeling for audio-
visual speech recognition,” in Proceedings of the Human Language Technol-
ogy Conference, (San Diego, CA, USA), pp. 1–6, 2002.
[64] M. Gray, J. Movellan, and T. Sejnowski, “A comparison of local versus global
image decompositions for visual speechreading,” in Fourth Joint Symposium
on Neural Computation, pp. 92–98, 1997.
[65] M. Gray, J. Movellan, and T. Sejnowski, “Dynamic features for visual speech-
reading: A systematic comparision,” in Advances in Neural Information
Processing (M. Mozer, M. Jordan, and T. Petsche, eds.), pp. 751–757, Cam-
bridge, MA: MIT Press, 1997.
[66] K. Green, “The use of auditory and visual information in phonetic percep-
tion,” in Speechreading by Humans and Machines (D. Stork and M. Hen-
necke, eds.), pp. 55–77, Berlin, Germany: Springer-Verlag, 1996.
[67] R. Gross, I. Matthews, and S. Baker, “Appearance-based face recognition
and light-fields,” IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 26, pp. 449–465, April 2004.
Bibliography 175
[68] S. Gurbuz, Z. Tufekci, E. Patterson, and J. Gowdy, “Application of affine-
invariant fourier descriptors to lipreading for audio-visual speech recogni-
tion,” (Salt Lake City, UT, USA), pp. 177–180, 2001.
[69] M. Heckmann, F. Berthommier, and K. Kroschel, “A hybrid ANN/HMM
audio-visual speech recognition system,” in Proceedings of the International
Conference on Auditory-Visual Speech Processing, (Aalborg, Denmark),
pp. 190–195, 2001.
[70] M. Heckmann, F. Berthommier, and K. Kroschel, “Optimal weighting of
posteriors for audio-visual speech recognition,” in Proceedings of the Inter-
national Conference on Acoustics, Speech, and Signal Processing, vol. 1, (Salt
Lake City, UT, USA), pp. 161–164, 2001.
[71] M. Heckmann, F. Berthommier, and K. Kroschel, “Noise adaptive stream
weighting in audio-visual speech recognition,” EURASIP Journal on Applied
Signal Processing, vol. 2002, no. 11, pp. 1260–1273, 2002.
[72] M. Heckmann, K. Kroschel, and C. Savariaux, “DCT-based video features for
audio-visual speech recognition,” in Proceedings of International Conference
on Spoken Language and Processing, (Denver, CO, USA), pp. 1925–1928,
2002.
[73] M. Hennecke, D. Stork, and K. Prasad, “Visionary speech: Looking ahead to
practical speechreading systems,” in Speechreading by humans and machines
(D. Stork and M. Hennecke, eds.), pp. 331–349, Berlin, Germany: Springer-
Verlag, 1996.
[74] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans-
actions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.
[75] F. Huang and T. Chen, “Consideration of lombard effect for speechread-
ing,” in Proceedings of Workshop on Multimedia Signal Processing, (Cannes,
France), pp. 613–618, 2001.
[76] J. Huang, Z. Liu, Y. Wang, and E. Wong, “Integration of multimodal
features for video scene classification based on HMM,” in Proceedings of
176 Bibliography
the Workshop on Multimedia Signal Processing, (Copenhagen, Denmark),
pp. 53–58, 1999.
[77] J. Huang, G. Potamianos, J. Connell, and C. Neti, “Audio-visual speech
recognition using an infrared headset,” Speech Communication, vol. 44, no. 4,
pp. 83–96, 2004.
[78] A. Hyvarinen and E. Oja, “Independent component analysis: Algorithms
and applications,” Neural Networks, vol. 13, no. 4-5, pp. 411–430, 2000.
[79] A. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: A review,”
IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 15,
no. 1, pp. 4–37, 2000.
[80] O. Jesorsky, K. Kirchberg, and R. Frischholz, “Robust face detection using
the Hausdorff distance,” in Proceedings of the International Conference on
Audio and Video Biometric Person Authentication, (Halmstad, Sweden),
pp. 90–95, June 2001.
[81] J. Jiang, G. Potamianos, H. Nock, G. Iyengar, and C. Neti, “Improved face
and feature finding for audio-visual speech recognition in visually challenging
environments,” in Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, vol. 5, (Montreal, Canada), pp. 873–876, 2004.
[82] M. Jones and P. Viola, “Fast multi-view face detection,” Tech. Rep. TR2003-
96, MERL, June 2003.
[83] T. Jordan and P. Sergeant, “Effects of facial image size on visual and audio-
visual speech,” in Hearing by Eye II (R. Campbell, B. Dodd, and D. Burn-
ham, eds.), pp. 155–176, Hove: Psychology Press Ltd. Publishers, 1998.
[84] T. R. Jordan and S. M. Thomas, “Effects of horizontal viewing angle on
visual and audiovisual speech recognition,” in Journal of Experimental Psy-
chology: Human Perception and Performance, vol. 27, pp. 1386–1403, 2001.
[85] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker
verification,” in Proceedings of the International Conference on Audio- and
Video-Based Biometric Person Authentication, pp. 319–326, 1997.
Bibliography 177
[86] J. Junqua, “The lombard reflex and its role on human listeners and auto-
matic speech recognizer,” Journal of Acoutic Society of America, vol. 93,
pp. 510–524, 1993.
[87] T. Kanade and A. Yamada, “Multi-subregion based probabilistic approach
towards pose-invariant face recognition,” vol. 2, (Kobe, Japan), pp. 954–959,
2003.
[88] M. Kaynak, Q. Zhi, A. Cheok, K. Sengupta, Z. Jian, and K. Chung, “Lip geo-
metric features for human-computer interaction using bimodal speech recog-
nition: Comparison and analysis,” Speech Communication, vol. 43, no. 1-2,
pp. 1–16, 2004.
[89] E. Kreyszig, Advanced Engineering Mathematics. John Wiley and Sons, Inc,
7 ed., 1993.
[90] G. Krone, B. Talle, A. Wichert, and G. Palm, “Neural architectures for sen-
sor fusion in speech recognition,” in European Tutorial Workshop on Audio-
Visual Speech Processing, (Rhodes, Greece), pp. 57–60, 1997.
[91] K. Kumar, T. Chen, and R. Stern, “Profile view lip reading,” in Proceedings
of the International Conference on Acoustics, Speech and Signal Processing,
vol. 4, (Honolulu, Hawaii), pp. 429–432, 2007.
[92] F. Lavagetto, “Converting speech into lip movements: a multimedia tele-
phone for hard of hearing people,” IEEE Transactions on Rehabilitation
Engineering, vol. 3, no. 1, pp. 90–102, 1995.
[93] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu,
and T. Huang, “AVICAR: An audiovisual speech corpus in a car environ-
ment,” in Proceedings of the International Conference on Spoken Language
Processing, (Jeju Island, Korea), pp. 2489–2492, 2004.
[94] R. Leinhart and J. Maydt, “An extended set of Haar-like features,” in Pro-
ceedings of the International Conference on Image Processing, (Rochester,
NY, USA), pp. 900–903, 2002.
178 Bibliography
[95] M. Lew, “Information theoretic view-based and modular face detection,” in
Proceedings of the International Conference on Automatic Face and Gesture
Recogntion, (Killington, VT, USA), pp. 198–203, 1996.
[96] S. Li, J. Sherrah, and H. Liddell, “Multi-view face detection using support
vector machines and eigenspace modelling,” in Proceedings of the Interna-
tional Conference on Knowledge-Based Intelligent Engineering Systems and
Allied Technologies, (Brighton, UK), pp. 241–244, 2000.
[97] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “Statistical learn-
ing of multi-view face detection,” in Proceedings of the European Conference
on Computer Vision, (Copenhagen, Denmark), pp. 38–44, May 2002.
[98] L. Liang, X. Liu, Y. Zhao, X. Pi, and A. Nefian, “Speaker independent audio-
visual continuous speech recognition,” in Proceedings of the International
Conference on Multimedia and Expo, vol. 2, (Lausanne, Switzerland), pp. 25–
28, August 2002.
[99] M. Lievin and F. Luthon, “Unsupervised lip segmentation under natural
conditions,” in Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, (Phoenix, AZ, USA), pp. 3065–3068, March
1999.
[100] F. Liu, R. Stern, X. Huang, and A. Acero, “Efficient cepstral normalization
for robust speech recognition,” in Proceedings of the Workshop on Human
Language Technology, (Morristown, NJ, USA), pp. 69–74, 1993.
[101] S. Lucey, Audio-Visual Speech Processing. Phd thesis, Queensland Univer-
sity of Technology, Brisbane, Australia, 2002.
[102] S. Lucey, “An evaluation of visual speech features for the tasks of speech
and speaker recognition,” in Proceedings of the International Conference of
Audio- and Video-Based Person Authentication, (Guildford, U.K.), pp. 260–
267, 2003.
[103] S. Lucey and T. Chen, “Learning patch dependencies for improved pose
mismatched face verification,” in Proceedings of the International Conference
Bibliography 179
on Computer Vision and Pattern Recognition, vol. 1, (New York, NY, USA),
pp. 909–915, June 2006.
[104] P. Lucey, D. Dean, and S. Sridharan, “Problems associated with current
area-based visual speech feature extraction techniques,” in Proceedings of the
International Conference on Auditory-Visual Speech Processing, (Vancouver
Island, Canada), pp. 73–78, 2005.
[105] P. Lucey and G. Potamianos, “Lipreading using profile versus frontal
views,” in Proceedings of the IEEE International Workshop on Multimedia
Signal Processing, (Victoria, BC, Canada), pp. 24–28, 2006.
[106] P. Lucey and S. Sridharan, “Patch-based representation of visual speech,”
in HCSNet Workshop on the Use of Vision in Human-Computer Interaction,
(VisHCI 2006) (R. Goecke, A. Robles-Kelly, and T. Caelli, eds.), vol. 56 of
CRPIT, (Canberra, Australia), pp. 79–85, ACS, 2006.
[107] J. Luettin, G. Potamianos, and C. Neti, “Asynchronous stream modeling
for large vocabulary audio-visual speech recognition,” in Proceedings of the
International Conference on Acoustics, Speech, and Signal Processing, vol. 1,
(Salt Lake City, UT, USA), pp. 169–172, 2001.
[108] J. Luettin, N. Thacker, and S. Beet, “Speaker identification by lipreading,”
in Proceeding of the International Conference on Spoken Language Process-
ing, vol. 1, (Philadelphia, PA, USA), pp. 62–65, 1996.
[109] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading using shape
and intensity information,” in Proceedings of the International Conference
on Spoken Language Processing, vol. 1, (Philadelphia, PA, USA), pp. 58–61,
1996.
[110] R. J. Mammone, X. Zhang, and R. P. Ramachandran, “Robust speaker
recognition: A feature based approach,” IEEE Signal Processing Magazine,
vol. 13, pp. 58–70, September 1996.
[111] A. M. Martinez, “Recognizing imprecisely localized, partially occluded, and
expression variant faces from a single sample per class,” IEEE Transactions
180 Bibliography
on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 748–763,
2002.
[112] K. Mase and A. Pentland, “Automatic lipreading by optical-flow analysis,”
Systems and Computers in Japan, vol. 22, no. 6, pp. 67–76, 1991.
[113] I. Matthews, J. Bangham, and S. Cox, “Audio-visual speech recognition us-
ing multiscale nonlinear image decomposition,” in International Conference
on Spoken Language Processing, (Philadelphia, PA, USA), pp. 38–41, 1996.
[114] I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey, “Extraction
of visual features for lipreading,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 24, no. 2, pp. 198–213, 2002.
[115] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, “Lipreading
using shape, shading and scale,” in Proceedings of the International Confer-
ence on Auditory-Visual Speech Processing, (Sydney, Australia), pp. 73–78,
1998.
[116] I. Matthews, G. Potamianos, C. Neti, and J. Luettin, “A comparison
of model and transform-based visual features for audio-visual LVCSR,” in
Proceedings of International Conference on Multimedia and Expo, (Tokyo,
Japan), 2001.
[117] M. McGrath and Q. Summerfield, “Intermodal timing relations and audio-
visual speech recognition,” Journal of the Acoustical Society of America,
vol. 77, pp. 678–685, February 1985.
[118] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature,
pp. 746–748, December 1976.
[119] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB:
The extended M2VTS database,” in International Conference on Audio and
Video-based Biometric Person Authentication, (Washington D.C., USA),
1999.
Bibliography 181
[120] A. Mills, “The development of phonology in blind children,” in Hearing
by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell, eds.),
pp. 145–161, London, England: Lawerence Erlbaum Associates Ltd, 1987.
[121] B. Moghaddam and A. Pentland, “Probabilistic Visual Learning for Ob-
ject Representation,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 19, pp. 696–710, July 1997.
[122] J. R. Movellan and G. Chadderdon, “Channel separability in the audio
visual integration of speech: A bayesian approach,” in Speechreading by Hu-
mans and Machines (D. G. Stork and M. E. Hennecke, eds.), pp. 473–487,
Berlin: Springer, 1996.
[123] S. Nakamura, “Fusion of audio-visual information for integrated speech pro-
cessing,” in Audio and Video-based Biometric Person Authentication (J. Bi-
gun and F. Smearaldi, eds.), pp. 127–143, Berlin, Germany: Springer-Verlag,
2001.
[124] A. Nefian and M. Hayes, “Face detection and recognition using hidden
Markov models,” in Proceedings of the International Conference on Image
Processing, (Chicago, IL, USA), pp. 141–145, 1998.
[125] A. Nefian, L. Liang, X. Pi, X. Liu, and C. Mao, “A coupled HMM for audio-
visual speech recognition,” in Proceedings of the International Conference
on Acoustics, Speech and Signal Processing, (Orlando, FL, USA), pp. 2013–
2016, 2002.
[126] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri,
“Large-vocabulary audio-visual speech recognition: A summary of the Johns
Hopkins summer 2000 workshop,” in Proceedings of the Workshop on Multi-
media Signal Processing, Special Section on Joint Audio-Visual Processing,
(Cannes, France), 2001.
[127] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri,
182 Bibliography
J. Sison, A. Mashari, and J. Zhou, “Audio-Visual Speech Recognition, Fi-
nal Workshop 2000 Report,” tech. rep., Center for Language and Speech
Processing, The John Hopkins University, Baltimore, 2000.
[128] Open Source Computer Vision Library.
http://www.intel.com/research/mrl /research/opencv
[129] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object
detection,” in Proceedings of International Conference on Computer Vision,
(Bombay, India), 1998.
[130] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: A new audio-
visual database for multimodal human-computer interface research,” in Pro-
ceedings of the International Conference on Acoustics, Speech and Signal
Processing, (Orlando, FL, USA), 2002.
[131] A. Pentland, “Smart rooms, smart clothes,” in Proceedings of the Inter-
national Conference on Pattern Recognition, vol. 2, (Brisbane, Australia),
pp. 949–953, 1998.
[132] E. Petajan, “Automatic lipreading to enhance speech recognition,” in IEEE
Global Telecommunications Conference, (Atlanta, GA, USA), pp. 265–272,
IEEE, 1984.
[133] S. Pigeon and L. Vandendorpe, “The M2VTS multimodal face database,”
in Proceedings of the International Conference on Audio and Video-based
Biometric Person Authentication, (Crans-Montara, Switzerland), 1997.
[134] G. Potamianos, E. Cosatto, H. Graf, and D. Roe, “Speaker independent au-
diovisual database for bimodal ASR,” in Proceedings of the European Tuto-
rial Workshop on Audiovisual Speech Processing, (Rhodes, Greece), pp. 65–
68, 1997.
[135] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream
exponents for audio-visual speech recognition,” in Proceedings of the Inter-
national Conference on Acoustics, Speech and Signal Processing, (Seattle,
WA, USA), pp. 3733–3736, 1998.
Bibliography 183
[136] G. Potamianos and H. Graf, “Linear discriminant analysis for speechread-
ing,” in Proceedings of the Workshop on Multimedia and Signal Processing,
(Los Angeles, CA, USA), pp. 221–226, 1998.
[137] G. Potamianos, H. Graf, and E. Cosatto, “An image transform approach
for HMM based automatic lipreading,” in Proceedings of International Con-
ference on Image Processing, vol. 3, (Chicago, IL, USA), pp. 173–177, 1998.
[138] G. Potamianos and P. Lucey, “Audio-visual ASR from multiple views inside
smart rooms,” in Proceedings of the International Conference on Multisen-
sor Fusion and Integration for Intelligent Systems, (Heidelberg, Germany),
pp. 35–40, 2006.
[139] G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features
for audio-visual LVCSR,” in Proceedings of the International Conference on
Acoustics, Speech and Signal Processing, pp. 165–168, 2001.
[140] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant
features for lipreading,” in Proceedings of International Conference on Image
Processing, vol. 3, (Thessaloniki, Greece), pp. 250–253, 2001.
[141] G. Potamianos and C. Neti, “Audio-visual speech recognition in challenging
environments,” in Proceedings of the European Conference on Speech Com-
munication and Technology, (Geneva, Swizterland), pp. 1293–1296, 2003.
[142] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent
advances in the automatic recognition of audio-visual speech,” Proceedings
of the IEEE, vol. 91, no. 9, 2003.
[143] G. Potamianos, C. Neti, G. Iyengar, A. Senior, and A. Verma, “A cascade
visual front end for speaker independent automatic speechreading,” Inter-
national Journal of Speech Technology, vol. 4, no. 3-4, pp. 193–208, 2001.
[144] G. Potamianos and P. Scanlon, “Exploiting lower face symmetry in
appearance-based automatic speechreading,” in Proceedings of the Auditory-
Visual Speech Processing International Conference 2005, (British Columbia,
Canada), pp. 79–84, 2005.
184 Bibliography
[145] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade
image transform for speaker independent automatic speechreading,” in Pro-
ceedings of the International Conference on Multimedia and Expo, vol. 2,
(New York, NY, USA), pp. 1097–1100, 2000.
[146] L. R. Rabiner, “A tutorial on hidden Markov models and selected applica-
tions in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257–286,
February 1989.
[147] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Englewood
Cliffs, N.J.: Prentice Hall, 1993.
[148] M. Ramos Sanchez, J. Matas, and J. Kittler, “Statistical chromaticity mod-
els for lip tracking with B-splines,” in Proceedings of the International Con-
ference on Audio and Video based Biometric Person Authentication, (Crans-
Montara, Switzerland), pp. 69–76, 1997.
[149] R. Rao and R. Mersereau, “Lip modelling for visual speech recognition,” in
Proceedings of the Asilomar Conference on Signals, Systems and Computers,
vol. 1, (Pacific Grove, CA, USA), pp. 587–590, 1994.
[150] J. Robert-Ribes, J. Schwartz, T. Lallouache, and P. Escudier, “Comple-
mentarity and synergy in bimodal speech: Auditory, visual, and audio-visual
identification of french oral vowels in noise,” Journal of the Acoustical Society
of America, vol. 103, no. 6, pp. 3677–3689, 1998.
[151] L. Rosenblum and H. Saldaa, “An audiovisual test of kinematics primitives
for visual speech perception,” Journal of Experimental Psychology: Human
Perception and Performance, vol. 22, no. 2, pp. 318–331, 1996.
[152] L. Rosenblum and H. Saldana, “Time-varying information for visual speech
perception,” in Hearing by Eye II (R. Campbell, B. Dodd, and D. Burnham,
eds.), pp. 61–81, Hove, United Kingdom: Psychology Press Ltd. Publishers,
1998.
[153] L. Rothkrantz, J. Wojdel, and P. Wiggers, “Comparison between different
feature extraction techniques in lipreading applications,” in Proceedings of
Bibliography 185
the International Conference Speech and Computer, (St. Petersburg, Russia),
2006.
[154] H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detec-
tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 20, pp. 23–38, 1998.
[155] H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural-network
based face detection,” in Proceedings of the International Conference on
Computer Vision and Pattern Recognition, (Santa Barbara, CA, USA),
pp. 38–44, 1998.
[156] K. Saenko, “Articulatory features for robust visual speech recognition,”
Masters Thesis, Massachusetts Institue of Technology, MA, USA, 2004.
[157] K. Saenko, T. Darrel, and J. Glass, “Articulatory features for robust vi-
sual speech recognition,” in Proceedings of the International Conference on
Mulitmodal Interfaces, (State College, PA, USA), pp. 152–158, 2004.
[158] K. Saenko and K. Livescu, “An asynchronous DBN for audio-visual speech
recogntion,” in Proceedings of the Workshop on Spoken Language Technol-
ogy, (Palm Beach, Aruba), pp. 92–98, 2006.
[159] C. Sanderson, “The VidTIMIT database,” in IDIAP Communication,
(Martigny, Switzerland), 2002.
[160] C. Sanderson, Automatic person verfication using speech and face informa-
tion. PhD thesis, Griffths University, Brisbane, Australia, 2004.
[161] P. Scanlon and R. Reilly, “Feature analysis for automatic speechreading,” in
Proceedings of the Workshop on Multimedia and Signal Processing, (Cannes,
France), pp. 625–630, 2001.
[162] R. Schapire, Y. Freund, P. Bartlett, and W. Lee, “Boosting the margin: A
new explanation for the effectiveness of voitng methods,” in Proceedings of
the International Conference on Machine Learning, (Nashville, TN, USA),
pp. 322–330, 1997.
186 Bibliography
[163] H. Schneiderman and T. Kanade, “A histogram-based method for detec-
tion of faces and cars,” in Proceedings of the International Conference on
Computer Vision and Pattern Recognition, (Hilton Head Island, SC, USA),
pp. 504–507, 2000.
[164] P. Silsbee and A. Bovik, “Computer lipreading for improved accuracy in au-
tomatic speech recognition,” IEEE Transactions on Speech and Audio Pro-
cessing, pp. 337–351, 1996.
[165] Q. Su and P. Silsbee, “Robust audiovisual integration using semicontinuous
hidden Markov models,” in International Conference on Spoken Language
Processing, (Philadelphia, PA, USA), 1996.
[166] W. Sumby and I. Pollack, “Visual contribution to speech intelligibility,”
Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215,
1954.
[167] A. Summerfield, “Some preliminaries to a comprehensive account of audio-
visual speech perception,” in Hearing by Eye: The Psychology of Lip-Reading
(B. Dodd and R. Campbell, eds.), pp. 3–51, London, United Kingdom: Law-
erence Erlbaum Associates, 1987.
[168] A. Summerfield, “Lipreading and audio-visual speech perception,” Philo-
sophical Transactions of the Royal Society of London, Series B, pp. 71–78,
1992.
[169] A. Summerfield, A. MacLeod, M. McGrath, and M. Brooke, “Lips, teeth,
and the benefits of lipreading,” in Handbook of Research on Face Processing
(A. Young and H. Ellis, eds.), pp. 223–233, Amsterdam, The Netherlands:
Elsevier Science Publishers, 1989.
[170] K. Sung and T. Poggio, “Example-based learning for view-based human
face detection,” IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, vol. 20, pp. 39–51, 1998.
Bibliography 187
[171] S. Tamura, K. Iwano, and S. Furui, “Multi-modal speech recognition us-
ing optical-flow analysis for lip images,” Journal of VLSI Signal Processing
Systems, vol. 36, no. 2-3, pp. 117–124, 2004.
[172] P. Teissier, J. Robert-Ribes, J. Schwartz, and A. Gurin-Dugu, “Compar-
ing models for audiovisual fusion in a noisy-vowel recognition task,” Speech
Communication, vol. 7, no. 6, pp. 629–642, 1999.
[173] A. Teklap, Digital Video Processing. Prentice-Hall, 1995.
[174] Y. Tian, T. Kanade, and J. Cohn, “Robust lip tracking by combining shape
color and motion,” in Proceedings of the Asian Conference on Computer
Vision, (Taipei, Taiwan), pp. 1040–1045, 2000.
[175] M. Tomlinson, M. Russell, and N. Brooke, “Integrating audio and visual
information to provide highly robust speech recognition,” in Proceedings of
the International Conference on Acoustics, Speech and Signal Processing,
vol. 2, (Atlanta, GA, USA), pp. 821–824, 1996.
[176] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive
Neuroscience, vol. 3, no. 1, 1991.
[177] A. Varga and R. Moore, “Hidden Markov model decomposition of speech
and noise,” in Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, vol. 2, (Albuquerque, NM, USA), pp. 845–
848, 1990.
[178] E. Vatikiotis-Bateson, G. Bailly, and P. Perrier, eds., Audio-Visual Speech
Processing. MIT Press, 2006.
[179] E. Vatikiotis-Bateson, K. Munhall, M. Hirayama, Y. Lee, and D. Terzopou-
los, “The dynamics of audiovisual behaviour in speech,” in Speechreading
by Humans and Machines (D. Stork and M. Hennecke, eds.), pp. 221–232,
Berlin, Germany: Springer-Verlag, 1996.
[180] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
188 Bibliography
of simple features,” in Proceedings of the International Conference on Com-
puter Vision and Pattern Recognition, vol. 1, (Kauai, HI, USA), pp. 511–518,
2001.
[181] C. Wang and M. Brandstein, “Multi-source face tracking with audio and vi-
sual data,” in Proceedings of the Workshop on Multimedia Signal Processing,
(Copenhagen, Denmark), pp. 475–481, 1999.
[182] T. Wark, Multi-modal Speech Processing for Automatic Speaker Recogni-
tion. PhD Thesis, Queensland University of Technology, Brisbane, Australia,
2001.
[183] Wikipedia, “Kitt — Wikipedia, the free encyclopedia,” 2007.
http://en.wikipedia.org/wiki/KITT
[Online; accessed 02-September-2007]
[184] M. Yang, N. Abuja, and D. Kriegman, “Mixtures of linear subspaces for
face detection,” in Proceedings of the International Conference on Automatic
Face and Gesture Recognition, (Grenoble, France), pp. 70–76, 2000.
[185] G. Yang and T. Huang, “Human face detection in complex background,”
Pattern Recognition, vol. 27, no. 1, pp. 53–63, 1994.
[186] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: A
survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 24, no. 1, pp. 34–58, 2002.
[187] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio visual speech
recognition using lip movement extracted from side-face images,” in Pro-
ceedings of the Workshop on Auditory Visual Speech Processing, (St Jorioz,
France), pp. 117–120, 2003.
[188] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio visual speech
recognition using new lip features extracted from side-face images,” in Pro-
ceedings of the Workshop on Robustness Issues in Conversational Interac-
tion, (Norwich, England), 2004.
Bibliography 189
[189] S. Young, G. Everman, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ol-
lason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for HTK
Version 3.2.1). Entropic Ltd, 2002.
[190] A. Yuille, P. Hallinan, and D. Cohen, “Feature extraction from faces using
deformable templates,” International Journal of Computer Vision, vol. 8,
no. 2, pp. 99–111, 1992.
[191] X. Zhang, C. Broun, R. Mersereau, and M. Clements, “Automatic
speechreading with applications to human-computer interfaces,” EURASIP
Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1228–1247, 2002.
[192] Z. Zhang, G. Potamianos, S. Chu, J. Tu, and T. Huang, “Person tracking
in smart rooms using dynamic programming and adaptive subspace learn-
ing,” in Proceedings of the International Conference on Multimedia and Expo,
(Toronto, Canada), pp. 2061–2064, 2006.
190 Bibliography
Appendix A
Dynamic Parameter Analysis
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=10
MRDCT/LDA2 J=4MRDCT/LDA2 J=5MRDCT/LDA2 J=6MRDCT/LDA2 J=7
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)Comparison of features with varying temporal information (J), with input vector of size N=10
MRDiff/LDA2 J=4MRDiff/LDA2 J=5MRDiff/LDA2 J=6MRDiff/LDA2 J=7
(a) (b)
Figure A.1: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 10 input features.
As mentioned in Chapter 5.4.2, many different permutations of input features
to the inter-frame LDA step were used to determine the optimal lipreading results.
The obtain the best lipreading performance from the final dynamic feature vector,
their has to be a trade-off between the length of the input static feature vector,
N , and the number of adjacent frames J used. This fine balance is required as
calculating the transformation matrix, WIILDA, is quite computationally expensive
and there is a limit on how large the input matrix XI can be (approximately
< 6M element matrix). In Figures A.1(a) and (b), only N = 10 static input
features were used across J = 4 to 7 adjacent frames. From these figures it can
be seen that the performance for the MRDCT static features hovers just below
191
192 Appendix A. Dynamic Parameter Analysis
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=20
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4MRDCT/LDA2 J=5
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=20
MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3MRDiff/LDA2 J=4MRDiff/LDA2 J=5
(a) (b)
Figure A.2: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 20 input features.
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=30
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of feature used per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=30
MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3MRDiff/LDA2 J=4
(a) (b)
Figure A.3: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 30 input features.
the 30% WER, whilst the MRDiff static features are just above the 30% WER
mark. Compared to the best case accuracy of 27.66% using N = 30 and J = 2,
it can be seen that these parameters do not give the optimal performance. Also,
it is also worth noting that there is no discernable distinction of performance
between using parameters J = 4 to 7.
In Figures A.2(a) and (b), N = 20 input static features are used across J = 1
to 5 adjacent frames. In Figure A.2(a), it is visible that when the temporal win-
dow J is increased from 1 to 2 the lipreading performance improves significantly
(by an average of 5%). When the value of J was increased past 2, no real improve-
ment was gained. In Figure A.2(b), as some temporal information was already
193
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=40
MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3
10 20 30 40 50 60 70
28
30
32
34
36
38
40
42
Number of features per feature vector (P)
Lipr
eadi
ng p
erfo
rman
ce, W
ER
(%
)
Comparison of features with varying temporal information (J), with input vector of size N=40
MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3
(a) (b)
Figure A.4: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 40 input features.
included in the difference features, no real benefit was sought from increasing
the amount of temporal information included in the final dynamic feature vector.
Across both plots, the best lipreading performance gained was around 28.5%.
Even though an improvement was gained in increasing the amount of input
static features from N = 10 to 20, the best lipreading performance was sought
using N = 30 input MRDCT features with a WER of 27.66% for P = 40. This
can be seen in Figure A.3(a), with J = 2. The number of static features was
increased to N = 40 in Figures A.4(a) and (b), however the performance using
these parameters were not quite as good as those for N = 30. As a result of these
experiments, the optimal parameters to be used for this thesis were therefore
chosen to be M = 100, N = 30 and P = 40, using a temporal window of
J = 2. These parameters were used for all experiments in this thesis, unless
stated otherwise.