LIPREADING ACROSS MULTIPLE VIEWS...1.1 Block diagram of an AVASR system, which is a combination of...

Speech, Audio, Image and Video Technology Laboratory

School of Engineering Systems

LIPREADING ACROSS MULTIPLE VIEWS

Patrick Joseph Lucey

B.Eng(Hons)

SUBMITTED AS A REQUIREMENT OF

THE DEGREE OF

DOCTOR OF PHILOSOPHY

AT

QUEENSLAND UNIVERSITY OF TECHNOLOGY

BRISBANE, QUEENSLAND

6 SEPTEMBER 2007

Keywords

Audio-visual automatic speech recognition, lipreading, frontal pose, profile pose,

multi-view, visual front-end, visual feature extraction, pose-invariance, multi-

stream fusion

i

Abstract

Visual information from a speaker’s mouth region is known to improve automatic

speech recognition (ASR) robustness, especially in the presence of acoustic noise.

Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal

images of the speaker’s face, which is a rather restrictive human-computer inter-

action (HCI) scenario. The lack of research into AVASR across multiple views has

been dictated by the lack of large corpora that contains varying pose/viewpoint

speech data. Recently, research has concentrated on recognising human be-

haviours within “meeting” or “lecture” type scenarios via “smart-rooms”. This

has resulted in the collection of audio-visual speech data which allows for the

recognition of visual speech from both frontal and non-frontal views to occur.

Using this data, the main focus of this thesis was to investigate and develop vari-

ous methods within the confines of a lipreading system which can recognise visual

speech across multiple views. This reseach constitutes the first published work

within the field which looks at this particular aspect of AVASR.

The task of recognising visual speech from non-frontal views (i.e. profile) is in

principle very similar to that of frontal views, requiring the lipreading system to

initially locate and track the mouth region and subsequently extract visual fea-

tures. However, this task is far more complicated than the frontal case, because

the facial features required to locate and track the mouth lie in a much more lim-

ited spatial plane. Nevertheless, accurate mouth region tracking can be achieved

by employing techniques similar to frontal facial feature localisation. Once the

mouth region has been extracted, the same visual feature extraction process can

take place to the frontal view. A novel contribution of this thesis, is to quantify

the degradation in lipreading performance between the frontal and profile views.

In addition to this, novel patch-based analysis of the various views is conducted,

and as a result a novel multi-stream patch-based representation is formulated.

iii

Having a lipreading system which can recognise visual speech from both

frontal and profile views is a novel contribution to the field of AVASR. How-

ever, given both the frontal and profile viewpoints, this begs the question, is

there any benefit of having the additional viewpoint? Another major contribution

of this thesis, is an exploration of a novel multi-view lipreading system. This

system shows that there does exist complimentary information in the additional

viewpoint (possibly that of lip protrusion), with superior performance achieved

in the multi-view system compared to the frontal-only system.

Even though having a multi-view lipreading system which can recognise visual

speech from both front and profile views is very beneficial, it can hardly consid-

ered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e.

front or profile). In an effort to make the lipreading system more realistic, a uni-

fied system based on a single camera was developed which enables a lipreading

system to recognise visual speech from both frontal and profile poses. This is

called pose-invariant lipreading. Pose-invariant lipreading can be performed on

either stationary or continuous tasks. Methods which effectively normalise the

various poses into a single pose were investigated for the stationary scenario and

in another contribution of this thesis, an algorithm based on regularised linear

regression was employed to project all the visual speech features into a uniform

pose. This particular method is shown to be beneficial when the lipreading sys-

tem was biased towards the dominant pose (i.e. frontal). The final contribution

of this thesis is the formulation of a continuous pose-invariant lipreading system

which contains a pose-estimator at the start of the visual front-end. This system

highlights the complexity of developing such a system, as introducing more flex-

ibility within the lipreading system invariability means the introduction of more

error.

All the works contained in this thesis present novel and innovative contribu-

tions to the field of AVASR, and hopefully this will aid in the future deployment

of an AVASR system in realistic scenarios.

iv

Contents

Keywords i

Abstract iii

List of Tables ix

List of Figures xi

Acronyms & Abbreviations xix

Authorship xxi

Acknowledgements xxiii

1 Introduction 1

1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Original Contributions of Thesis . . . . . . . . . . . . . . . . . . . 6

1.5 Publications Resulting from Research . . . . . . . . . . . . . . . . 8

1.5.1 Book Chapters . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.2 International Conference Publications . . . . . . . . . . . . 9

2 A Holistic View of AVASR 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The History of AVASR . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Anatomy of the Human Speech Production System . . . . . . . . 15

2.4 Linguistics of Visual Speech . . . . . . . . . . . . . . . . . . . . . 17

v

2.5 Visual Speech Perception by Humans . . . . . . . . . . . . . . . . 18

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Classification of Visual Speech 23

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Classifiers for Lipreading . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . 25

3.3.1 Viterbi Recognition . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 HMM Parameter Estimation . . . . . . . . . . . . . . . . . 28

3.4 Stream Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Feature Fusion Techniques . . . . . . . . . . . . . . . . . . 33

3.4.2 Decision Fusion Techniques . . . . . . . . . . . . . . . . . 34

3.5 HMM Parameters Used in Thesis . . . . . . . . . . . . . . . . . . 37

3.5.1 Measuring Lipreading Performance . . . . . . . . . . . . . 38

3.6 Current Audio-Visual Databases . . . . . . . . . . . . . . . . . . . 39

3.6.1 Review of Audio-Visual Databases . . . . . . . . . . . . . 39

3.6.2 IBM Smart-Room Database . . . . . . . . . . . . . . . . . 42

3.6.3 CUAVE Database . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Visual Front-End 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Front-End Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Visual Front-End Challenges . . . . . . . . . . . . . . . . . . . . . 51

4.4 Brief Review of Visual Front-Ends . . . . . . . . . . . . . . . . . . 53

4.5 Viola-Jones algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5.3 Cascading the Classifiers . . . . . . . . . . . . . . . . . . . 62

4.6 Visual Front-End for Frontal View . . . . . . . . . . . . . . . . . 64

4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vi

5 Visual Feature Extraction 73

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Review of Visual Feature Extraction Techniques . . . . . . . . . . 74

5.2.1 Appearance Based Representations . . . . . . . . . . . . . 75

5.2.2 Contour Based Representations . . . . . . . . . . . . . . . 77

5.2.3 Combination of Features . . . . . . . . . . . . . . . . . . . 78

5.2.4 Appearance vs Contour vs Combination . . . . . . . . . . 79

5.3 Cascading Appearance-Based Features . . . . . . . . . . . . . . . 81

5.3.1 Static Feature Capture . . . . . . . . . . . . . . . . . . . . 82

5.3.2 Dynamic Feature Capture . . . . . . . . . . . . . . . . . . 91

5.4 Lipreading from Frontal Views . . . . . . . . . . . . . . . . . . . . 92

5.4.1 Static Feature Analysis . . . . . . . . . . . . . . . . . . . . 93

5.4.2 Dynamic Feature Analysis . . . . . . . . . . . . . . . . . . 96

5.5 Making use of ROI Symmetry . . . . . . . . . . . . . . . . . . . . 98

5.5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . 101

5.6 Patch-Based Analysis of Visual Speech . . . . . . . . . . . . . . . 104


5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Frontal vs Profile Lipreading 111

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2 Visual Front-End for Profile View . . . . . . . . . . . . . . . . . . 113

6.3 Profile vs Frontal Lipreading . . . . . . . . . . . . . . . . . . . . . 119

6.4 Patch-Based Analysis of Profile Visual Speech . . . . . . . . . . . 122

6.5 Multi-view Lipreading . . . . . . . . . . . . . . . . . . . . . . . . 126

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7 Pose-Invariant Lipreading 129

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2 Pose-Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . 131

7.2.1 Linear Regression for Pose-Invariant Lipreading . . . . . . 132

7.2.2 The Importance of the Regularisation Term (λ) . . . . . . 135

7.3 Stationary Pose-Invariant Experiments . . . . . . . . . . . . . . . 138

vii

7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 138


7.3.3 Biased Towards Frontal Pose . . . . . . . . . . . . . . . . . 142

7.3.4 Inclusion of Additional Pose . . . . . . . . . . . . . . . . . 145

7.3.5 Limitations of Pose-Normalising Step . . . . . . . . . . . . 147

7.4 Continuous Pose-Invariant Lipreading . . . . . . . . . . . . . . . . 147

7.4.1 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . 149

7.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 152

7.4.3 Pose Estimate Results . . . . . . . . . . . . . . . . . . . . 153

7.4.4 Multi-Pose Localisation Results . . . . . . . . . . . . . . . 154

7.4.5 Continuous Pose-Invariant Lipreading Results . . . . . . . 156

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8 Conclusions and Future Research 161

8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 161

8.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Bibliography 166

A Dynamic Parameter Analysis 191

viii

List of Tables

2.1 The mapping of the 44 phonemes from the HTK set, to 13 visemes

used in the John Hopkin’s University summer workshop [127]. . . 18

4.1 Facial feature point detection accuracy results for frontal pose . . 68

5.1 Lipreading performance of the various regions of the ROI . . . . . 106

5.2 Lipreading performance of fusing the various side patches of the

ROI together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3 Lipreading performance of the smaller 16× 16 pixel patches of the

ROI (overlapping by 50%) . . . . . . . . . . . . . . . . . . . . . . 108

5.4 Lipreading performance of the each individual patch fused together

with the holistic representation of the ROI using the SMSHMM . 109

6.1 Facial feature localisation accuracy results on the validation set of

profile images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.2 Lipreading performance of the various regions of the profile ROI . 123

6.3 Lipreading performance of fusing the various side patches of the

profile ROI together. . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4 Lipreading performance of the smaller 16× 16 pixel patches of the

profile ROI (overlapping by 50%) . . . . . . . . . . . . . . . . . . 125

6.5 Lipreading performance of the each individual patch fused to-

gether with the holistic representation of the profile ROI using

the SMSHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.6 Multi-view lipreading performance compared against the single

view performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

ix

7.1 Lipreading results in WER (%) showing the effect that an addi-

tional pose has on performance for Q = 20. As the left and right

profile WER were the same, profile refers to both poses. The

combined(80-10-10) test set refers to frontal (80%), right (10%)

and left (10%) profile poses. . . . . . . . . . . . . . . . . . . . . . 147

7.2 Pose Estimate results on the CUAVE validation which consisted

of 39 images for each pose. . . . . . . . . . . . . . . . . . . . . . . 153

7.3 Facial feature localisation accuracy results for all poses on the

CUAVE validation set. . . . . . . . . . . . . . . . . . . . . . . . . 155

7.4 The upper part of the table shows the average lipreading perfor-

mance for each individual task, whilst the bottom part compares

the performance for the combined individual, combined all and

pose normalised tasks, across the 10 different train/test sets. . . . 157

x

List of Figures

1.1 Block diagram of an AVASR system, which is a combination of an

audio-only and visual-only speech recognition (lipreading) system.

For this thesis, the modules within the lipreading system will be

focussed on. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Schematic representation of the complete physiological mechanism

of speech production highlighting the externally visible area (taken

from Rabiner and Juang [147]). . . . . . . . . . . . . . . . . . . . 16

2.2 Examples showing that the phonemes /p/, /b/ and /m/ look

visemically similar. Each of these visemes are shown in images

(a), (b) and (c) respectively. . . . . . . . . . . . . . . . . . . . . 17

2.3 Examples showing that the visemes of the acoustically similar

phonemes /m/ and /n/ , look different in the visual domain. The

viseme /m/ is shown in (a) and /n/ is shown in (b). . . . . . . . . 17

3.1 Block diagram of a lipreading system. . . . . . . . . . . . . . . . 24

3.2 Discrete states in a Markov model are represented by nodes and

the transition probability by links. . . . . . . . . . . . . . . . . . 25

3.3 The IBM smart room developed for the purpose of the CHIL

project. Notice the fixed and PTZ cameras, as well as the far-

field table-top and array microphones. . . . . . . . . . . . . . . . 43

3.4 Examples of image views captured by the IBM smart room cam-

eras. In contrast to the four corner cameras (two upper rows), the

two PTZ cameras (lower row) provide closer views of the lecturer,

albeit not necessarily frontal (see also Figure 3.3). . . . . . . . . . 44

xi

3.5 Examples of synchronous frontal and profile video frames of four

subjects from the IBM smart-room database. . . . . . . . . . . . . 45

3.6 Examples of sequences from the CUAVE database, which consists

of 36 individual speakers and 20 group speakers. The top line

give examples of the individual sequences, whilst the bottom gives

examples of the group speaker sequences. . . . . . . . . . . . . . . 46

3.7 Examples of the CUAVE individual sequences. The top three rows

give examples of the speaker rotating from left profile to right pro-

file. The bottom three rows give examples of the speaker moving

whilst in the frontal pose. . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Block diagram of a visual front-end for a lipreading system. It is

essentially a three-step process, face localisation being step 1 and

step 2 consisting of located the mouth ROI. Step 3 is tracking the

ROI over the video sequence. . . . . . . . . . . . . . . . . . . . . 50

4.2 Depiction of the cascading front-end effect. . . . . . . . . . . . . . 51

4.3 Comparison of the feature sets used by: (a) Viola and Jones with

the original 4 haar-like features; and (b) Lienhart and Maydt with

their extended set of 14 haar-like features including their rotated

features. It is worth noting that the diagonal line feature in (a) is

not utilised in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Example of how the integral image can be used for computing

upright rectangular features. . . . . . . . . . . . . . . . . . . . . . 60

4.5 Example of how the rotated integral image can be used for com-

puting rotated features. . . . . . . . . . . . . . . . . . . . . . . . . 60

4.6 Example of the first feature selected by AdaBoost. It has selected

the feature across the eye, nose and cheek areas, possibly due to

the contrast in colour. . . . . . . . . . . . . . . . . . . . . . . . . 63

4.7 Example of a face localiser based on a boosted cascade of 20 simple

classifiers. If the hit rate for each classifier is 0.9998 and the false-

alarm rate is set to 0.5 then the overall localiser should be able

to yield a hit rate of 0.999820 = 0.9960 and a false-alarm rate of

0.520 = 9.54× 10−7. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

xii

4.8 Points used for facial feature localisation on the face: (a) right eye,

(b) left eye, (c) nose, (d) right mouth corner, (e) top mouth, (f)

left mouth corner, (g) bottom mouth, (h) mouth center, and (i) chin. 65

4.9 Example of the 16 × 16 frontal faces from the IBM smart-room

database used for this thesis. . . . . . . . . . . . . . . . . . . . . . 66

4.10 Example of the negative images used for training of the face classifier. 67

4.11 Example of the templates used for the training of the frontal facial

features. The ROI shown on the right is an example of the mouth

center template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.12 Example of negative images used for the training of the frontal

facial feature classifiers. . . . . . . . . . . . . . . . . . . . . . . . . 69

4.13 Block diagram of the visual front-end for the frontal pose. . . . . 70

4.14 Mouth ROI extraction examples. The upper rows show examples

of the localised face, eyes, mouth region and mouth corners. The

lower row shows the corresponding normalised mouth ROI’s (32×32 pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1 Appearance based features utilise the entire ROI given on the left.

Contour based features require further localisation to yield features

based on the physical shape of the mouth, such as mouth height

and width which is depicted on the right. . . . . . . . . . . . . . . 77

5.2 Block diagram depicting the cascading approach used by Potmi-

anos et al. [145] to extract appearance based features from the

mouth ROI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Block diagram showing the capturing of the static features of a

ROI frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Diagram showing the zig-zag scheme used in reading in the coeffi-

cients from an encoded the two-dimensional DCT image. . . . . . 84

5.5 Examples showing the reconstructed ROI’s using the top M coef-

ficients from the DCT: (a) original, (b) M = 10, (c) M = 30, (d)

M = 50 and (e) M = 100. . . . . . . . . . . . . . . . . . . . . . . 84

5.6 Plot showing the speaker information contained within the features

without normalisation, for the digits “zero”, “one” and “two”. . . 85

xiii

5.7 Block diagram showing the feature mean normalisation (FMN)

step of the cascading process, resulting in yIIt . . . . . . . . . . . . 86

5.8 Plot showing that with FMN the unwanted speaker information

contained within the features is effectively removed, for the digits

“zero”, “one” and “two”. . . . . . . . . . . . . . . . . . . . . . . . 87

5.9 Block diagram showing the augmented static feature capture sys-

tem using the FMN in the image domain rather than the feature

domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.10 Block diagram showing the capturing of the dynamic features cen-

tered at each ROI frame. . . . . . . . . . . . . . . . . . . . . . . . 91

5.11 Plot showing the effect that FMN has on the lipreading performance. 94

5.12 Plot comparing the lipreading performance of both the image based

and feature based FMN methods. . . . . . . . . . . . . . . . . . . 95

5.13 Plot of the lipreading results showing the effect that LDA has

on improving speech classification on the final static features over

various values of N . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.14 Plots of the lipreading results for the dynamic and final features

on the MRDCT (a) and MRDiff (b) features using various values

for J and P using N = 30 input features. . . . . . . . . . . . . . 97

5.15 Examples showing the reconstructed ROI’s using the top M coef-

ficients for: (a) original, (b) M = 10, (c) M = 30, (d) M = 50

and (e) M = 100. The images on top refer to the reconstructed

ROI’s using MRDCT coefficients. The images on bottom refer to

the reconstructed ROI’s using the MRDCT with the odd frequency

components removed (MRDCT-OFR). . . . . . . . . . . . . . . . 100

5.16 Examples showing the reconstructed half ROI’s using the top M

coefficients from the MRDCT for each side: (a) original, (b) M =

10, (c) M = 30, (d) M = 50 and (e) M = 100. The top refers to

the reconstructed images of the right side of the ROI. The bottom

refers to the reconstructed images of the left side of the ROI. These

images are all of size 16× 32 pixels . . . . . . . . . . . . . . . . . 101

xiv

5.17 Results showing that removing the odd frequency components of

the MRDCT features helps improve lipreading performance. . . . 102

5.18 Plot of the results showing that LDA effectively nullifies the benefit

of the MRDCT-OFR in the previous step. . . . . . . . . . . . . . 103

5.19 Examples of the ROI broken up into: (a) top, bottom, left and

right side patches; and (b) 9 patches, starting from the top, refer

to patches 1, 2 and 3; the middle band refer to patches 4, 5 and 6;

and the bottom band of patches refer to patches 7, 8 and 9. . . . 105

6.1 Synchronous (a) frontal and (b) profile views of a subject recorded

in the IBM smart room (see Chapter 3). In the latter, visible

facial features are “compacted” within approximately half the area

compared to the frontal face case, thus increasing tracking difficulty.112

6.2 Example of the points labeled on the face: (a) left eye, (b) nose,

(c) top mouth, (d) mouth center, (e) bottom mouth, (f) left mouth

corner, and (g) chin. The center of depicted bounding box around

the eye defines the actual feature location. . . . . . . . . . . . . . 114

6.3 Examples of the facial feature templates of the profile view used

to train up the respective facial feature classifiers. . . . . . . . . . 115

6.4 Examples of the profile face templates used to train up the profile

face classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.5 Block diagram of the face and mouth localisation and tracking

system for profile views. . . . . . . . . . . . . . . . . . . . . . . . 117

6.6 (a) An example of face localisation. (b) Based on the face lo-

calisation result, a search area to located the left eye and nose

is obtained. The face box is lengthened or shortened according to

metric1. (c) The left mouth corner is located within the generalised

mouth region. The ratio (metric2 ) is then used for normalising the

ROI. (d) An example of the scaled normalised located ROI of size

(48× 48) ·metric2 pixels. . . . . . . . . . . . . . . . . . . . . . . 118

xv

6.7 Examples of accurate (a-d) and inaccurate (e,f) results of the lo-

calisation and tracking system. In (f), it can be seen that the

subject exhibits a somewhat more frontal pose compared to the

profile view of the other subjects. . . . . . . . . . . . . . . . . . . 119

6.8 Results comparing the front and profile lipreading performance at

various stages of the static feature capture. . . . . . . . . . . . . . 120

6.9 Comparison of the lipreading performance between the frontal (a)

and profile (b) dynamic and final features using various values for

J and P using M = 30 input features. . . . . . . . . . . . . . . . 121

6.10 Examples of the ROI broken up into: (a) top, bottom, left and

right side patches; and (b) 9 patches, starting from the top, refer

to patches 1, 2 and 3; the middle band refer to patches 4, 5 and 6;

and the bottom band of patches refer to patches 7, 8 and 9. . . . 123

6.11 Block diagram depicting the various lipreading systems that can

function when 2 cameras are synchronously capturing a speaker

from different views. The lipreading system can use only one view

(either frontal or profile in this case), or combine both views to form

a multi-view lipreading system (which is depicted by the dashed

lines and bold typeface). The multi-view features can either be

fused at an early stage using feature fusion or in the intermediate

level via a synchronous multi-stream HMM (SMSHMM). . . . . . 127

7.1 Given one camera, the lipreading system has to be able to lipread

from any pose. In this example, those poses are either frontal or

profile poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.2 Schematic of the proposed pose-invariant lipreading scheme: Vi-

sual speech features xn extracted from an undesired pose (e.g. pro-

file) are transformed into visual features tn in the target pose space

(e.g. frontal) via a linear regression matrix W, calculated offline

based on synchronised multi-pose training data T and X of fea-

tures extracted from the different poses. . . . . . . . . . . . . . . 132

xvi

7.3 Given one camera, the lipreading system has to be able to lipread

from any pose. In this example, those poses are either frontal or

profile poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.4 Plots showing the impact that normalising the pose has on lipread-

ing performance for the: (a) frontal and combined(50-50) systems;

and (b) profile and combined(50-50) systems. These systems are

tested across various numbers of features Q = 10 − 60. In the

legend, the first label refers to the test set and the label within the

bracket denotes the system’s name. . . . . . . . . . . . . . . . . . 139

7.5 Plot showing the impact that normalising the pose has on lipread-

ing performance for the frontal, profile and combined(50-50) sys-

tems. These systems are tested across various numbers of features

Q = 10−60. In the legend, the first label refers to the test set and

the label within the bracket denotes the system’s name. . . . . . . 141

7.6 Plot showing the impact that biasing the system to the frontal pose

has on the lipreading performance for the frontal and combined(80-

20) systems. These systems are tested across various numbers of

features Q = 10 − 60. In the legend, the first label refers to the

test set and the label within the bracket denotes the system’s name.143

7.7 Plot showing the impact that normalising the pose has on lipread-

ing performance for the frontal, profile and combined(50-50) sys-

tems. These systems are tested across various numbers of features

Q = 10−60. In the legend, the first label refers to the test set and

the label within the bracket denotes the system’s name. . . . . . . 144

7.8 In these experiments, the lipreading system has to lipread from the

frontal, right and left profile poses, instead of just the frontal and

profile (right) poses. . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.9 Block diagram of the continuous pose-invariant lipreading system. 148

7.10 Block diagram of the pose estimator which incorporates the pose

estimation with the face localisation. . . . . . . . . . . . . . . . . 151

7.11 Example showing the function of the nearest neighbour variable in

the face localiser. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xvii

7.12 Examples of results from the pose estimator. The first two rows

give results for the frontal pose. The third and fourth rows give

the results for the right profile pose and the last two rows give the

results for the left profile pose. The last column gives examples of

false estimates and miss estimates. . . . . . . . . . . . . . . . . . 154

7.13 Examples of face and facial feature localisation from the multi-pose

visual front-end. The bottom row gives the associated examples of

the extracted 32× 32 ROI’s . . . . . . . . . . . . . . . . . . . . . 156

A.1 Plots of the lipreading results for the dynamic and final features












xviii

Acronyms & Abbreviations

AAM Active appearance model

ANN Artificial neural network

ASM Active shape models

ASR Automatic speech recognition

AVASR Audio-visual automatic speech recognition

CMS Cepstral mean subtraction

CUAVE Clemson university audio-visual experiments database

DBN Dynamic Bayesian Network

DCT Discrete cosine transform

Diff Discrete cosine transform of difference images

DTW Dynamic time warping

DWT Discrete wavelet transfrom

EI Early integration

EM Expectation-maximisation

FA False alarm

FMN Feature mean normlisation

GMM Gaussian mixture model

HiLDA Hierarchical linear discriminant analysis

HMM Hidden Markov model

HTK Hidden Markov model toolkit

LDA Linear discriminant analysis

LI Late integration

MI Middle integration

MRDCT Mean removed discrete cosine transform

xix

MRDiff Mean removed discrete cosine transform on difference images

PCA Principal component analysis

ROI Region of interest

SMSHMM Synchronous multi-stream hidden Markov model

SNR Signal to noise ratio

WER Word error rate

xx

Authorship

The work contained in this thesis has not been previously submitted for a degree

or diploma at any other higher education institution. To the best of my knowledge

and belief, the thesis contains no material previously published or written by

another person except where due reference is made.

Signed:

Date:

xxi

Acknowledgements

It is not possible to thank everybody who has had an involvement with me during

the course of my PhD. However, there are some people who must be thanked.

Firstly and most importantly, I would like to thank my parents who have been,

and still are my biggest supporters. They have sacrificed so much to give me every

opportunity to succeed in life. Their unwavering belief in my ability, their never

ending support, as well as their guidance, comfort, compassion and perspective

have allowed me to achieve more than I ever thought I could. I am forever

indebted to them for everything they have done for me and they will never know

how much of a positive influence they have been on my life. I should be so lucky

to turn out to be half the people they are.

I would also like to thank my principal supervisor, Professor Sridha Sridharan

for his guidance and encouragement throughout my course of study. The research

environment he has created in the SAIVT laboratory, as well as the many oppor-

tunities to visit foreign institutions and international conferences, is testimony to

his commitment to excellence in research and development, and for that I am very

thankful. It should also be mentioned that part of this PhD was was supported

by Australian Research Council Grant No. LP0562101.

During my PhD, I was fortunate to visit two overseas research institutions. I

would like to thank Dr. Gerasimos “Makis” Potamianos for giving me the chance

to work with him at IBM’s T.J. Watson Research Center in New York in 2006.

My time at IBM proved to be one of the best experiences in my life and proved

to be a turning point in my research career. His constant feedback and flexibility

in allowing me to focus on various aspects of visual speech has been a major

reason why this thesis could be completed. I would also like to thank Professor

Tsuhan Chen for giving me the opportunity to come and study at the prestigious

xxiii

Carnegie Mellon University in 2005. This was an invaluable experience, which

really opened up my eyes to how research should really be conducted.

The past and present members of the SAIVT laboratory must also be ac-

knowledged, for the great atmosphere they created as well as their expertise in

research which has made it a pleasure to be their colleague and friend. I would

particularly like to thank my colleague David Dean for his help throughout my

thesis as he was so often my co-pilot in trying to disambiguate the many prob-

lems associated with audio-visual speech processing. Special mention must also

go to Terry Martin, Brendan Baker, Chris McCool, Robbie Vogt, Jason Dowling,

David Dean, Frank Lin, Simon Denman, Jamie Cook, Michael Mason, Tristan

Kleinschmidt, Eddie Wong, Clinton Fookes, Ivan Drago and Ruwan Lakemond. A

lasting memory will be the endless days myself and Jason spent playing, talking,

debating, living and reliving our various cricket dreams.

The person I would like to thank the most though, is my big brother Simon.

He has most certainly being the biggest help through my PhD and words can’t

describe how thankful I am for having such a brilliant and helpful mentor. How-

ever, despite his brilliance it still baffles me to this day how he lacks the ability

to bowl a standard orthodox leg break. I would also like to thank my brother

Owen for the laughter and support you have given my over the years. You will

never know how proud I am of the man and father you have become. Your son

Caelin is the most amazing person I have encountered, which is a true reflection

of you. I would also like to thank my brother Jedrow who is simply an unique

and loyal being.

Finally, I would like to acknowledge my extended family and friends who have

put up with me over the years. Sorry for regurgitating so many Simpsons quotes,

I promise I will come up with some unique material one day.

xxiv

Chapter 1

Introduction

1.1 Motivation and Overview

As computer technology is becoming more and more advanced, consumers are

seeking ways to interact with it to make their lives more comfortable. One of

the key technologies which allows human-to-computer interaction (HCI) to take

place is automatic speech recognition (ASR). ASR has the lofty goal of allowing

a user to interface with a computer by understanding the content of the users

instructions and then carrying them out. Probably the best example of ASR in

action is the car KITT 1 from the 1980’s television series “Knight Rider”. In this

show, KITT is capable of conducting a natural conversation with the driver as

well as enacting on any command given to it. Unfortunately, current ASR systems

like KITT are a long way off as nearly all of them rely solely on the audio channel

for input, which is often corrupted by a number of environmental factors, most

notably acoustic noise. As most “real-world” applications involve some type of

noise, these ASR systems are of limited use in these applications due to their poor

performance. Invariably, these audio-only systems fail to make use of the bimodal

nature of speech. As visual speech is immune to these acoustic environmental

factors, utilising this visual information in conjunction with the ASR system has

the potential to make systems like KITT a very real possibility in the future. This

area of research is called audio-visual automatic speech recognition (AVASR).

1KITT stands for Knight Industries Two Thousand and is the name of a fictional computerthat controls the high-tech Knight 2000, a black Pontiac Firebird Trans Am T-top automobilein the science fiction television series Knight Rider [183].

1

2 Chapter 1. Introduction

AVASR is by no means a new research field. In actual fact, the first work in the

field was conducted over fifty years ago and continuous research in this field has

been ongoing for the past twenty years with notable progress being made. Over

this period of time, the need for the visual modality in ASR systems has been

established theoretically and most of the issues involved with AVASR have been

identified. Prototype systems have been built that have demonstrated improved

performance over audio-only systems under laboratory conditions. However, the

practical deployment of AVASR systems which will be useful in a variety of “real-

world” applications, have not yet emerged. As the main benefit of using the visual

modality in ASR systems is to counteract the problems associated with “real-

world” environments, it is quite interesting to see that the majority of research

conducted in AVASR neglected this fact.

The major reason behind the lack of progress in getting a “real-world” AVASR

system deployed, is that most research that has been conducted has neglected

addressing variabilities in the visual domain such as viewpoint, with nearly all of

the present work being conducted on video of a speaker’s fully frontal face. This

is mainly due to the lack of any large corpora that can accommodate poses other

than frontal. However, as more work is being concentrated within the confines of

a “meeting room” or “smart room” environment [52, 131], data is now becoming

available that allows visual speech recognition or lipreading from multiple views

to become a viable research avenue. This point has provided the motivated for

the work in this thesis.

The implications of having a system which can lipread from any viewpoint or

pose is of major benefit to AVASR. By loosening the constraint on the speaker’s

pose, it allows a more pervasive or “real-world” technology to develop, which

would be of major benefit to many applications. Other than the smart room

scenario, this type of technology would be of benefit for; in-car AVASR, video

conferencing (via the internet or video phone) and transcribing speech data. How-

ever, allowing more flexibility in the system by including non-frontal visual speech

data introduces more complexity. All aspects of developing a lipreading system

which can cope with these added complexities are investigated in this thesis.

1.2. Scope of Thesis 3

Acoustic FeatureExtraction (MFCCs)

Acoustic FeatureClassification

Audio-OnlySpeech Recognition

tVideo In

Audio SignalIn

Visual Front-End

Visual FeatureClassification

Visual-OnlySpeech Recognition

(Lipreading)

Visual Feature Extraction

Audio-Visual FeatureClassification

Audio-VisualSpeech Recognition

Lipreading System

Figure 1.1: Block diagram of an AVASR system, which is a combination of anaudio-only and visual-only speech recognition (lipreading) system. For this thesis,the modules within the lipreading system will be focussed on.

1.2 Scope of Thesis

An AVASR system is the combination of an audio-only speech recognition system

and a lipreading system, as depicted in Figure 1.1. A major reason stymiing the

full deployment of an AVASR system in “real-world” applications, is the lack of

research being conducted in the field of AVASR that focuses on the unwanted

variabilities that lie within the visual domain, most notably head pose. In an at-

tempt to remedy this situation, the work in this thesis has solely concentrated on

researching and developing methods within the lipreading portion of an AVASR

system to allow visual speech to be recognised across multiple views. Within this

multi-faceted problem, the scope of this thesis was constrained to the following

objectives:

1. Recognise visual speech from profile views and compare it to its synchronous

counterpart in the frontal view,

2. Determine if there is any complimentary information within the profile view-

point by combining both frontal and profile features together to form a

multi-view lipreading system, and

3. Develop a pose-invariant lipreading system which can recognise visual speech

regardless of the head pose from a single camera.


All the work contained in this thesis is designed to address each of these novel

and previously unsolved problems.

1.3 Outline of Thesis

The remainder of this thesis is organised as follows:

Chapter 2 gives a high-level overview on the various topics of AVASR, detail-

ing its history as well as the physiological, linguistic and psychological as-

pects. The many questions pertaining to why the visual modality is useful

to recognising speech, as well as what visual representations are effective

for lipreading are addressed. This formulates the motivation behind the

lipreading system presented in this thesis.

Chapter 3 provides an in-depth review of current classifier theory. In this chap-

ter the topic of classifying visual speech is broached, with the hidden Markov

model (HMM) being detailed as the classifier of choice. Various integration

strategies that can be employed for combining synchronous visual features

together using feature fusion methods or decision fusion methods are also

discussed. The chapter also gives a relatively thorough review of the current

audio-visual databases which are currently available. Specifically, the IBM

smart-room and CUAVE databases, which are the two databases used in

this thesis are described as well as their respective protocols.

Chapter 4 gives a comprehensive evaluation of various visual front-ends, which

can automatically locate and track a speaker’s mouth ROI. This task is

shown to be difficult due to the many variations the visual front-end has

to deal with such as pose, illumination, appearance and occlusion. These

variations can effect the overall lipreading performance due to the front-

end effect. With these variations in mind, the visual front-end is developed

using the Viola-Jones algorithm and this system is presented for the frontal

pose scenario. This method is shown to be extremely rapid, and accurate

which is imperative for a real-time application such as lipreading.

1.3. Outline of Thesis 5

Chapter 5 gives an detailed review of all visual feature extraction techniques

for lipreading. From this review it is shown that the appearance based fea-

tures are the representation of choice, and the cascade of appearance based

features are revealed as the current state-of-art technique. Novel analysis

of each stage of the cascade is then conducted on the frontal view data,

which shows the impact each stage of the cascade has on the lipreading

performance. This analysis includes an observation on the effect that the

feature mean normalisation (FMN) step has on lipreading performance, as

well as the dimensionality of input feature vectors. A variant of the FMN

step is then introduced, showing that performing the normalisation step in

the image domain rather than the feature domain is slightly advantageous.

As the ROI for the frontal-pose is symmetrical, an algorithm presented by

Potamianos and Scanlon [144] is implemented making use of this charac-

teristic. It is shown that making use of this characteristic can improve

lipreading at an early level within the cascading framework. Motivated by

this work, analysis of the various regions of the ROI is then conducted using

patches, which is the first analysis of its type. As a means of making use

of this prior knowledge, a novel patch-based multi-stream representation of

the ROI is introduced.

Chapter 6 develops a lipreading system which is capable of extracting and

recognising visual speech information from profile views. These results are

compared to their synchronous counterparts in the frontal view. This consti-

tutes the first published work which quantifies the performance degradation

of lipreading in the profile view compared to the frontal view. In the exper-

iments, it is demonstrated that the profile view contains significant visual

speech information. However, it is less pronounced than the frontal view.

This profile information is not totally redundant to the frontal video, as the

multi-view lipreading system shows. The multi-view system presented is

unique to the field of AVASR as it is the first lipreading system published

which has more than one camera at its input. Patch-based analysis of the

profile ROIs is also conducted, and the pertinent regions of the ROIs are

fused together to gain a better representation of the profile speech.


Chapter 7 introduces the novel problem of pose-invariant lipreading. Two sce-

narios of the problem are visited, i.e. stationary and continuous. The first

part of the chapter deals with the stationary scenario. In the experiments it

is shown that when the features of one pose were tested on the other pose,

the train/test mismatch between the two is large and the lipreading per-

formance severely degrades as a consequence. To overcome this problem, a

pose-invariant or pose-normalising technique using linear regression is used

to project all the features of the unwanted pose into the wanted pose. This

technique is shown to reduce the train/test mismatch between the different

poses, and is shown to be of particular benefit when one pose is more preva-

lent than the other (i.e. frontal over right profile) due to over generalisation.

In the latter part of the chapter, the more realistic continuous scenario is

investigated. In this novel contribution, the pose-estimator is developed in

conjunction with the face localiser. For these experiments, it is shown that

the addition of the pose-estimator impacts on the lipreading results due to

the front-end effect.

Chapter 8 summarises the work contained in this thesis, highlighting major

research findings. Avenues for future work and development are also dis-

cussed.

1.4 Original Contributions of Thesis

In this thesis a number of original contributions are made to the field of lipreading

and AVASR in general. These are summarised as:

(i) Generic single-stream and multi-stream combination strategies using HMMs

for the novel task of fusing multiple sets of synchronous visual features

together are proposed in Chapter 3.

(ii) Protocols for the IBM smart-room and CUAVE databases which contain

frontal as well as non-frontal views of a speaker’s face are presented in

Chapter 3.

1.4. Original Contributions of Thesis 7

(iii) A comprehensive evaluation of various visual front-ends, specifically for

lipreading, along with the formation of a complete visual front-end using

the Viola-Jones algorithm on the frontal view is undertaken in Chapter 4.

(iv) Results showing the effect each stage of the cascade of appearance based

features, which is the current state-of-the-art visual feature extraction for

lipreading, are presented in Chapter 5. The performance is also compared

against the number of features used, which displays the problem of dimen-

sionality in lipreading using a HMM classifier.

(v) Analysis of the feature mean normalisation (FMN) step is undertaken in

Chapter 5, showing the effect a person’s appearance has on the lipreading

performance. In this analysis, a comparison of the FMN step in the image

domain to the feature domain is conducted, showing that the image-based

approach is slightly superior.

(vi) Determining the saliency of the various regions of the frontal ROIs to

lipreading is undertaken in Chapter 5 via patch-based analysis. In this

innovative analysis, it is shown that the middle patch containing the most

visible articulators such as lips, teeth, tongue are the most salient.

(vii) A new lipreading approach, fusing the more salient patches of the mouth

together via single and multi-stream HMMs is proposed in at the end of

Chapter 5.

(viii) A novel visual front-end which is able to locate and track a profile mouth

ROI using the Viola-Jones algorithm is presented in Chapter 6.

(ix) A comparison of the synchronous frontal and profile lipreading performances

is given in Chapter 6. This comparison is unique as it shows that reasonable

lipreading performance can be obtained from the profile view, however, it

is degraded when compared to its frontal counterpart.

(x) In Chapter 6, patch-based analysis of the profile ROIs is conducted and

the most informative patch is shown to be the middle patch containing

the center of the mouth and the protrusion of the lips. The more salient

patches are then combined to gain a better representation of the profile

visual speech.


(xi) A multi-view lipreading system is presented in at the end of Chapter 6. This

novel approach to lipreading shows that by fusing the synchronous frontal

and profile visual features together, improved performance over the frontal

only scenario can be obtained.

(xii) A unified approach to lipreading in Chapter 7 is presented, by normalising all

poses to a single uniform pose. Given only one camera, this pose-invariant

lipreading system uses a transformation matrix based on linear regression

to project the features of the unwanted pose (profile) into the wanted pose

(frontal). These experiments were performed for the stationary scenario,

where the speaker was fixed in one pose (i.e. frontal or profile) for the

entire utterance and the pose of the speaker was assumed. This technique

is shown to be of benefit when the speaker is in one dominant pose such as

the frontal pose. When more non-dominant poses are included, the pose-

normalising step also proves to be of benefit.

(xiii) A continuous pose-invariant lipreading system, which allows the speaker

to move their head during the utterance is proposed in the latter part of

Chapter 7. In this system, a novel pose-estimator is developed in conjunc-

tion with the face localiser, which then cues the visual front-end for the

respective pose. As the pose-estimation step is at the front of the lipread-

ing system, it introduces extra error which affects the overall lipreading

performance.

1.5 Publications Resulting from Research

The following fully-referred publications have been produced as a result of the

work in this thesis:

1.5.1 Book Chapters

(i) P.Lucey, G. Potamianos and S.Sridharan, “Visual Speech Recognition Across

Multiple Views”, to appear in Visual Speech Recognition: Lip Segmentation

and Mapping (A. Liew and S. Wang, eds.), IGI Global, 2007 [proposal ac-

cepted].

1.5. Publications Resulting from Research 9

1.5.2 International Conference Publications

(i) P. Lucey, G. Potamianos and S. Sridharan, “A Unified Approach to Multi-

Pose Audio-Visual ASR”, to appear in Proceedings of Interspeech, (Antwerp,

Belgium), August 2007 [awarded best student paper ].

(ii) P.Lucey, G. Potamianos and S.Sridharan, “An Extended Pose-Invariant

Lipreading System”, to appear in Proceedings of the International Workshop

on Auditory-Visual Speech Processing (AVSP), (Hilvarenbeek, The Nether-

lands), August 2007 [abstract].

(iii) D. Dean, P. Lucey, S. Sridharan and T. Wark, “Fused HMM-Adaptation of

Multi-Stream HMMs for Audio-Visual Speech Recognition”, to appear in

Proceedings of Interspeech, (Antwerp, Belgium), August 2007.

(iv) D. Dean, P.Lucey, S.Sridharan and T. Wark, “Weighting and Normalisation

of Synchronous HMMs for Audio-Visual Speech Recognition”, to appear in

Proceedings of the International Workshop on Auditory-Visual Speech Pro-

cessing (AVSP), (Hilvarenbeek, The Netherlands), August 2007 [abstract].

(v) P. Lucey and G. Potamianos, “ Lipreading Using Profile Versus Frontal

Views”, in Proceedings of the International Workshop on Multimedia and

Signal Processing (MMSP), (Victoria, Canada), pp. 24-28, 2006.

(vi) P. Lucey and S. Sridharan,“Patch-based Representation of Visual Speech”,

in HCSNet Workshop on the Use of Vision in Human-Computer Interaction

(VisHCI 2006)), (R. Goecke, A. Robles-Kelly, and T. Caelli, eds.), vol. 56

of CRPIT, (Canberra, Australia), pp. 79 -85, ACS, 2006

(vii) G. Potamianos and P. Lucey, “Audio-Visual ASR from Multiple Views in-

side Smart Rooms”, Proceedings of the International Conference on Multi-

sensor Fusion and Integration for Intelligent Systems (MFI), (Heidelberg,

Germany), pp. 35-40, 2006.

(viii) P. Lucey, S. Lucey and S. Sridharan,“Using a Free-Parts Representation for

Visual Speech Recognition”, in Proceedings of Digital Imaging Computing:

Techniques and Applications (DICTA), (Cairns, Australia), pp. 379-384,

2005.


(ix) P. Lucey, D. Dean and S. Sridharan,“Problems associated with current area-

based visual speech feature extraction techniques”, in Proceedings of Inter-

national Conference on Auditory-Visual Speech Processing (AVSP), (British

Columbia, Canada), pp. 73-78, 2005.

(x) S. Lucey and P. Lucey,“Improved speech reading through a free-parts rep-

resentation”, in Proceedings of the International Conference on Auditory-

Visual Speech Processing (AVSP), (British Columbia, Canada), pp. 85-86,

2005.

(xi) D. Dean, P. Lucey and S. Sridharan,“Audio-Visual Speaker Identification

using the CUAVE Database”, in Proceedings of the International Conference

on Auditory-Visual Speech Processing (AVSP), (British Columbia, Canada),

pp. 97-101, 2005.

(xii) D. Dean, P. Lucey, S. Sridharan and T. Wark,“Comparing audio and visual

information for speech processing”, in International Symposium of Signal

Processing and its Applications (ISSPA), (Sydney, Australia), pp. 58-61,

2005.

(xiii) P. Lucey, T. Martin and S. Sridharan,“Confusability of phonemes grouped

according to their viseme classes in noisy environments”, in Proceedings

of the International Conference on Speech, Science and Technology (SST),

Sydney, Australia, pp. 265-270, 2004.

Chapter 2

A Holistic View of AVASR

2.1 Introduction

AVASR is a very broad and diverse research field. Areas such as linguistics,

psychology and physiology in addition to the machine learning/computer vision

area are all incorporated under the same AVASR umbrella. Being such a broad

area of work, it is imperative for researchers to have some grasp of the key elements

in each of these individual areas, so as to optimise the best representation of the

visual signal. This is necessary so that the performance of the final lipreading

system for this thesis is maximised.

This chapter is intended to give a holistic view of the field of AVASR . The first

part of this chapter traces the history of AVASR, initially giving a brief timeline

covering the last half century, focussing more on the key papers and research

that has led to the development of the current state-of-the-art AVASR system.

The review then concentrates on the recent advances that have been made over

the past five or so years in terms of the application of this technology. The

chapter then focuses on the linguistics and speech production aspects of audio-

visual speech. The final part of this chapter details the various psychological and

cognitive facets associated with the human perception of audio-visual speech.

Having some kind of insight into these “non-machine learning” areas will aid in

the understanding of the final structure of lipreading system proposed in this

thesis.

11

12 Chapter 2. A Holistic View of AVASR

2.2 The History of AVASR

Understanding speech in noisy environments has been a topic of interest for engi-

neers since the 1890s [167]. This interest heightened in the war years, especially

during the 1940s and 1950s with the rapid growth in military and civil aviation.

An important application that was of interest to engineers working in this field

at the time was improving ways that air traffic controllers could communicate

with pilots. All of this interest in this particular field led to the first known work

on audio-visual speech processing, which was published by Sumby and Pollack

in 1954 [166]. In this work, Sumby and Pollack examined the contribution of

visual factors to oral speech intelligibility as a function of the “speech-to-noise”

ratio and the size of the vocabulary. Their motivation for the work came about

through the observation that humans can tolerate higher noise levels in speech

when using lip information in comparison when no lip information was used and

also the phenomenon when the message or vocabulary size increased, the speech

intelligibility diminished. The results from this work found that seeing the face

of the talker was equivalent to an effective improvement in the speech to noise

ratio of up to 15 dB.

From the point of view of speech intelligibility, Sumby and Pollack showed

that adding the visual information to the audio signal improved it greatly. But it

wasn’t yet known how the visual modality contributed to the audio signal, until

the work on McGurk and MacDonald in 1976 [118]. In their paper, McGurk

and MacDonald were able to aptly demonstrate the bimodal nature of speech

via the McGurk effect. The McGurk effect essentially shows that when humans

are presented with conflicting acoustic and visual stimuli, the perceived sound

may not exist in either modality. It demonstrates the phenomenon when a per-

son sees the repeated utterances of the syllable /ga/ with the sound /ba/ being

dubbed onto the lip movements. Often the person does not perceive either /ga/

or /ba/, but instead percieves the sound /da/. This work highlights that not only

does the visual signal improve speech intelligibility but it does it by providing

complementary information, which is the key motivation behind AVASR.

2.2. The History of AVASR 13

It must be said that over this period of time, it was commonly acknowledged

that the hearing impaired used visual speech to increase speech intelligibility but

these pieces of work were not significant in terms of helping the deaf directly.

It did however, give an indication of what role the visual modality has to play

in terms of providing complementary information to the acoustic channel. This

fact motivated the first actual implementation of an AVASR system developed by

Petajan in 1984 [132]. In this initial system, Petajan extracted simple black and

white images of a speaker’s mouth and took the mouth height, width, perime-

ter and area as his feature. The next major progress in AVASR was a decade

later with Bregler and Konig [13] published their work using eigenlips. Shortly

following this work Duchnowski et al. [45] extended this technique by employing

linear discriminant analysis for the visual feature extraction. In the mid to late

90’s, most of the pioneering work in AVASR was coming from the Institute de

la Communication Parlee (ICP) in Grenoble, France [142]. At ICP, they inves-

tigated the problem of fusing the audio and visual modalities together and this

resulted in many benchmark papers by Adjoudani and Benoıt [1] and Adjoudani

et al. [2].

Although considerable work on the topic of AVASR was published in the

1990’s, they were all of little significance in terms of getting an AVASR sys-

tem deployed in a “real-world” scenario. A major restriction of this stemmed

from the lack of a large audio-visual corpus which could be used to develop

AVASR systems for the task of speaker-independent, large vocabulary continu-

ous speech recognition. In a major effort to remedy this situation, IBM’s Human

Language Technologies Department at the T.J. Watson Research Center coor-

dinated a workshop at the John Hopkins University in Baltimore, USA, where

leading researchers from around the world converged in the summer of 2000 to col-

lect such as database and to further improve techniques associated with AVASR.

A full description of this workshop as well as the results are given in [127].

As AVASR is a technology which is driven by data, most of the recent progress

in the field has centered on the work conducted by IBM due to their ability to

capture high quality audio-visual data. Most of the recent notable research out-

comes have stemmed from the work spearheaded by Gerasimos Potamianos and


his colleagues. In addition to the large vocabulary experiments, Potamianos et

al. in 2003 conducted AVASR experiments in challenging environments, where

data was captured in office and in-car scenarios [141]. In this work, they found

that the performance degraded in both modalities by more than twice their re-

spective word error rates, however, the visual modality still remained beneficial

in recognising speech [141].

In an effort to deploy a real-time AVASR system, the researchers at IBM

developed in 2003 a real-time prototype for small-vocabulary AVASR [32]. In

this work, they obtained real-time performance using a PentiumTM 4, 1.8GHz

processor. With the same goal in mind in 2004, IBM then produced an AVASR

system which used an infra-red headset [77]. As the extraction of visual speech

information from full-face videos is computationally expensive as well as being

difficult due to visual variabilities such as pose, lighting and background, the mo-

tive behind this work was to bypass these problems by using a special wearable

audio-visual headset which is constantly focussed on the speaker’s mouth [77].

The added benefit of using infra-red illumination was that it also provided ro-

bustness to severe lighting variations. In this work they found that this approach

gave comparable results to normal AVASR systems, which suggested this was a

viable approach.

In the last couple of years, due to the reduction in cost of capturing and

storing audio-visual data, more databases are becoming publicly available for

researchers to use that contains data which resembles data that would be en-

countered in “real-world” noisy conditions. This is in stark contrast to the case

five years ago, where all data captured was in ideal laboratory conditions. This is

essential as “real-world” phenomenons which can greatly affect the performance

of AVASR systems such as the “Lombard effect” [86], which is the phenomenon

where a speaker attempts to communicate more effectively in noisy environments,

can be investigated. Such an investigation was carried out by Huang and Chen

[75]. Examples of recently collected “real-world” databases include; the in-car

audio-visual data of the AVICAR database [93], the stereo data of the AVOZES

database [58], speaker movement in the CUAVE database [130], and the smart-

room data contained in the IBM smart-room database [138]. The availability of

2.3. Anatomy of the Human Speech Production System 15

the latter two databases have allowed for work to be completed that forms the

basis of this thesis (see Chapter 3.6 for full description of various audio-visual

corpora).

In addition to using the visual modality to improve speech recognition, the

video signal has been used for various other applications such as speaker recogni-

tion [24, 42, 182], visual text-to-speech [23, 30, 35], speech event detection [39],

video indexing and retreival [76], speech enhancement [55, 58], signal separation

[53] and speaker localisation [16, 181]. Improvements in these areas will result in

more robust and natural speech recognition and human-computer interaction in

general [142].

To summarise, compared to the state of AVASR a decade ago, the field of

AVASR can be now said to be becoming a more mature and substantial field of

research. So much so, that there are now many review papers [24, 142] and books

[178] solely focussed on this topic. However, for the future success of AVASR to

be realised, large databases like the IBM Via Voice database, which is suitable

for large vocabulary continuous speech recognition, have to be collected for use

in scenarios where it is hoped to be employed such as in-car environments.

2.3 Anatomy of the Human Speech Production

System

A comprehensive understanding of the human speech production system is imper-

ative in creating a successful lipreading system, so that the final system extracts

all the pertinent visual speech information emanating from the visible articula-

tors. The components which make up the human speech production system are

depicted in Figure 2.1. The human speech signal starts when air is forced out of

the lungs into the vocal tract, which consists of the pharyngeal and mouth cav-

ities. As air comes out of the lungs, it passes through the bronchi and trachea.

Once it passes the bronchi and trachea, it flows pass the vocal cords, which de-

termine whether the sound produced is either voiced or unvoiced. Voiced sounds

are produced when the vocal cords are tensed causing a vibration through the air

flow. Unvoiced sounds are caused when no vibration of the vocal cords occurs,

such as the case of whispering.


Figure 2.1: Schematic representation of the complete physiological mechanism ofspeech production highlighting the externally visible area (taken from Rabinerand Juang [147]).

After passing the vocal cords, the final sound is determined by the restrictions

placed in the vocal tract. The main components in the vocal tract responsible

for this are the velum, tongue, teeth, lips and jaw. Each of these can change very

quickly independently of each other, which allows for a large array of sounds to

be produced. From the overall speech production system depicted in Figure 2.1,

it is evident that the only visible articulators of this process are the; lips, teeth,

jaw and portion of the tongue, with the vocal cords, velum and full tongue shape

being unseen. As such, for the final lipreading system the area around the mouth

that contains these visible articulators should be extracted, which was the case

for this thesis (see Chapter 4 for details).

halla

This figure is not available online. Please consult the hardcopy thesis available from the QUT Library

2.4. Linguistics of Visual Speech 17

(a) (b) (c)

Figure 2.2: Examples showing that the phonemes /p/, /b/ and /m/ look visem-ically similar. Each of these visemes are shown in images (a), (b) and (c) respec-tively.

(a) (b)

Figure 2.3: Examples showing that the visemes of the acoustically similarphonemes /m/ and /n/ , look different in the visual domain. The viseme /m/ isshown in (a) and /n/ is shown in (b).

2.4 Linguistics of Visual Speech

The basic unit of acoustic speech is called the phoneme [146]. In the visual do-

main, the basic unit of visual speech is called the viseme [22]. Generally speak-

ing, there is a many-to-one mapping between phonemes and visemes, with many

phones being assigned to a viseme. For example, phonemes /p/, /b/, /m/ all

look similar in the visual domain and as such are assigned to the same viseme

class as can be seen in Figure 2.2. By the same token there are many visemes

that are acoustically ambiguous. For example, phonemes /n/ and /m/ sound

similar in the acoustic domain, but when viewing their respective visemes they

look distinctly different, as shown in Figure 2.3. This last example shows another

benefit of utilising the visual channel.

At the audio-visual speech recognition summer workshop held at the John

Hopkin’s University in 2000 [127], the 44 phonemes in the HTK phone set [189]

were mapped to 13 visemes. These phoneme to viseme mappings are given in

Table 2.1. However, this mapping between the two domains is unnecessary ac-

cording to Potamianos et al. [142], as having different classes for the audio and


Table 2.1: The mapping of the 44 phonemes from the HTK set, to 13 visemesused in the John Hopkin’s University summer workshop [127].

video components only complicates the fusion process, with unclear performance

gains. Because of this most of the research conducted in literature has used just

the acoustic phoneme classes for the visual modality. These different sub-word

classes did not affect the work in this thesis however, as word models were used.

2.5 Visual Speech Perception by Humans

Summerfield [167] cites the following three key reasons why lipreading benefits

human speech perception:

1. It helps speaker localisation

2. It provides complimentary information about the place of articulation, such

as tongue, teeth and lips

3. It contains segmental information that supplements the audio.

The first and second points are of particular benefit to those who have poor hear-

ing because they would normally use this lipreading information as the primary

source of speech information. Some people are so adept at lipreading, that they

can almost achieve perfect speech perception [168]. However, as the above three

points note and the McGurk effect shows [118], even people with normal hearing

use lipreading in conjunction with the audio signal to improve speech intelligi-

bility. This phenomenon is often heightened in the presence of acoustic noise as

first noted by Sumby and Pollack [166].

Using visual speech information to improve speech intelligibility is done by

humans at a very young age. Aronson and Rosenblum [4] noticed that infants as

halla

This table is not available online. Please consult the hardcopy thesis available from the QUT Library

2.5. Visual Speech Perception by Humans 19

young as 3 months old are aware of the bimodal nature of speech, while Dodd

[44] noticed that toddlers at the age of 19 months actually perform lipreading.

Mills [120] was able to show that blind children are slower in acquisition of speech

production for those sounds which have visible articulators than seeing children.

Even though all these facts make for some interesting reading, they really do not

provide any assistance in developing an automatic lipreading system. To obtain

such assistance, a number of questions need to be answered, such as:

• What parts of the face give the most speech information?

• How important is the temporal nature of the visual speech signal?

• How much of an impact does the integration of the audio and video signals

have on speech perception?

Each of these questions will be looked at in the following subsections with respect

to human perception studies. These findings will be of use when developing the

lipreading later on in the thesis.

Pertinent Areas of the Face for Lipreading

It is largely agreed that most information pertaining to visual speech stems from

the areas around the lips, even though visual speech is located throughout the

human face to some extent [92]. McGrath et al. [117] showed that the human

lips alone carry more than half the visual information provided by the face of an

English speaker. Benoıt et al. [7] found that the lips alone contain on average

two thirds of the speech intelligibility carried by a French speaker’s face. Benoıt

et al. [7] also showed that a combined lip/jaw model gave a noticeable gain in

performance over a lip only model. The combined lip/jaw model performance

was only slightly less than for the entire face model. Brooke and Summerfield

[14], found that the visible articulators such as teeth and tongue improved the

perception of vowels. Finn [48] found that for consonants the most important

features were the size and shape of the lips.

Brooke and Summerfield [14] performed perceptual tests using a synthetic

face, synthesising the outer and inner lip contours and the chin. Human speechread-

ing performance for vowels using the synthetic face proved to perform significantly


worse than using a natural face. It was concluded that additional cues such as

the visibility of the teeth and tongue were required for more accurate recognition

of vowels. Finn [48] sought to determine the appropriate oral-cavity features for

consonant recognition. The most important features determined were height and

width of oral cavity opening, the vertical spreading of the upper and lower lips

and the cornering of the lips (puckering).

Temporal Nature of Visual Speech

As speech is a temporal signal, it is intuitive that the temporal features would be

of most use. This is certainly the case for audio-only speech recognition where

delta and acceleration coefficients are appended to the static features to improve

speech recognition. In human perception studies, many experiments have been

carried on testing this theory on the visual features [17, 18, 36, 60, 66, 179]. In the

work carried out by Rosenblum and Saldana [151], this was found to be the case

where they found face kinematics to be more useful than shape parameters. The

frame rate of the visual speech is also important in lipreading as shown by Frowein

et al. [50]. In this work, they showed that speech recognition performance drops

markedly below 15Hz.

Impact of Audio-Visual Integration

Lavagetto [92] demonstrated that acoustic and visual speech stimuli are not syn-

chronous, at least at a feature based level. It was shown that visible articulators

during an utterance, start and complete their trajectories asynchronously, ex-

hibiting both forward and backward coarticulation with respect to the acoustic

speech wave. Intuitively, this makes a lot of sense, as visual articulators (i.e. lips,

tongue, jaw etc.) have to position themselves correctly before and after the start

and end of an acoustic utterance. This time delay is known as the voice-onset-

time (VOT) [47], which is defined as the time delay between the burst sound,

coming from the plosive part of a consonant, and the movement of the vocal folds

for the voiced part of a voiced consonant or subsequent vowel. McGrath et al.

[117] also found an audio lead of less than 80ms or lag of less than 140ms could

not be detected during speech. However, if the audio was delayed by more than

2.6. Summary 21

160ms it no longer contributed useful information. It was concluded that, in

practice, delays of up to 40ms are acceptable. In normal PAL video sample rate

represents a single frame of asynchrony, signifying the importance of some degree

of asynchrony and synchrony in continuous audio-visual speech perception.

Seeing that this thesis is concerned only with the visual channel, the apparent

asynchrony between the two modalities could have implications for the lipreading

system. This is because the models for the visual modality are generally boot-

strapped from the time labeled transcriptions taken from the audio-only channel.

Even though there is some error associated with this particular method, this still

appears to be the best way of approximating the initial visual models due to the

fact that given clean conditions, the audio channel will be the most reliable source

of information.

2.6 Summary

This chapter has given insights into the various aspects associated with AVASR.

The history of AVASR was detailed, citing the major works which have influenced

this field of research over the past fifty years or so. One of these major works

described, was that of the McGurk effect [118], which highlights the bimodal

nature of speech. From the timeline presented, it was shown that the main

driving force behind the progress in AVASR has been the availability of quality

data to train and test the various systems. The basic mechanics of the human

production system, as well as the linguistics associated with the audio and visual

modalities, were then presented. The complementary nature of speech of the

acoustic and visual signals were then analysed, extending the work conducted

by McGurk and McDonald. Questions pertaining to what visual articulators

and representations are effective for lipreading were then raised. Each of these

questions were explored in turn, and the answers used to form the motivation

behind the presented lipreading system presented later on in this thesis.

Chapter 3

Classification of Visual Speech

3.1 Introduction

A lipreading system can be considered as a sequential, modular system consisting

of the; visual front-end, visual feature extraction and visual feature classification,

as depicted in Figure 3.1. Even though the visual front-end and visual feature

extraction modules are the major focus for this thesis, classification remains a

central part of a lipreading system and this chapter is dedicated to this partic-

ular area. Classification is a very difficult task and is made even more difficult

by the addition of extra streams. Due to the complexity involved in adding ex-

tra streams, in AVASR literature almost all systems have utilised a two stream

approach, where one is dedicated to the audio stream and the other to the vi-

sual stream. However, such techniques are generic and can be applied to any

number of streams which is a property that will be made use of throughout this

thesis. While the first part of this chapter conducts a brief review on the clas-

sification techniques available to lipreading before detailing the hidden Markov

model (HMM), the second part of this chapter looks at the various integration

methods which can be used to fuse multiple streams of visual features together.

The final part of the chapter then conducts a thorough review of the currently

available audio-visual corpora, with particular emphasis placed on describing the

data and protocols of the the IBM smart-room and CUAVE databases, which are

the two databases which contain multi-view/pose visual speech data.

23

24 Chapter 3. Classification of Visual Speech

tVideo In

Visual Front-End

ClassificationVisual-Only

Speech Recognition(Lipreading)


Figure 3.1: Block diagram of a lipreading system.

3.2 Classifiers for Lipreading

In literature, the most widely used classifier for modelling and recognising audio

and visual speech data has been the hidden Markov model (HMM). Discriminant

classifiers such as artificial neural networks (ANN) have also been used [90].

Heckmann et al. [69] developed a combination of both the ANN and HMM to form

a hybrid ANN-HMM classifier. Similarly, Bregler et al. [12] and Duchnowski et

al. [45] devised a hybrid ANN-DTW (dynamic time warping) classifier. Recently,

Gowdy et al. [62] and Saenko et al. [157] have proposed the use of Dynamic

Bayesian Networks (DBNs) for AVASR. It should be noted though, that the

DBN is not a classifier as such, but a unifying framework for combining different

classifiers together. In [62] and [157], the DBN was used as a framework for

combining the single streams HMMs for the respective streams together.

Even though all the classifiers mentioned above have enjoyed some success for

their respective AVASR 1 tasks, the vast majority of systems employ HMMs with

a continuous observation probability density, modeled as a mixture of Gaussian

densities [142]. The reason for this is due to the simple fact that HMMs have

proven themselves to be the best classifier to model and recognise the temporal

nature of speech for both the audio and visual signals. As such, the HMM will

be used as the classifier of choice for this thesis. The next section gives a detailed

description of the HMM and how it can be utilised for the task of lipreading.

1As lipreading is a subset of the overall task of AVASR, it should be noted that it is impliedthat these classifiers mentioned above have been used for the sole task of lipreading.

3.3. Hidden Markov Models (HMMs) 25

3.3 Hidden Markov Models (HMMs)

Hidden Markov models (HMMs) are a powerful statistical tool which model a

temporal signal based on observations which are assumed to be of a Markovian

process whose internal states are unknown or hidden. A Markov process may

be described at any time as being in one of a set of N distinct states [146] as

depicted in Figure 3.2. At regularly spaced, sampled intervals, the Markov process

1 2 3

a11 a22 a33

a12 a23

a21 a32

a31

a13

Figure 3.2: Discrete states in a Markov model are represented by nodes and thetransition probability by links.

undergoes a change of state according to a set of probabilities associated with

the state. So given the sequence of states q, defined as

q = {q1, . . . , qT}, qt ∈ [1, . . . , N ] (3.1)

where qt is the state at time t, a probabilistic description that the model λ

generated the sequence q would require specification of the current state at time

t, as well as all the previous states. However, for the first order Markov chain,

only the preceding state is used such that

Pr(qt = j|qt−1 = i, qt−1 = k, . . .) = Pr(qt = j|qt−1 = i) (3.2)

Equation 3.2 can be simplified as the right side is independent of time, leading

to the set of state transition probabilities A = {aij} of the form

aij = Pr(qt = j|q + t− q = i) (3.3)


with the following properties

aij ≥ 0 ∀j, iN∑

j=1

aij = 1 ∀i

At time t = 1, there has to be initial state probabilities πi

πi = Pr(q1 = i), 1 ≤ i ≤ N (3.4)

So given a state sequence q and a Markov model λ = (A, π), a posteriori proba-

bility that q was created by λ can be gained via

Pr(q|λ) =T∏

t=1

Pr(qt|λ)

= πq1aq1q2aq2q3 . . . aqT−1qT(3.5)

Normally however, there is no way of knowing the state sequence q. Instead,

some observation feature vectors within the observation set O defined by

O = {o1,o2, . . . ,oT} (3.6)

where ot is a n dimensional vector at time t, can be used to estimate the state

sequence from the data that is being modeled. from the data that is being

modeled can be used to estimate the state sequence. Therefore, the a posteriori

probability that λ generated the observation O can be given by

Pr(O|λ) =∑

all q

Pr(O|q, λ)Pr(q|λ) (3.7)

Due to the class independent nature of a HMM it is easier to evaluate Equation

3.7 using the conditional density functions for the case of continuous observations.

This is what is referred to as a continuous HMM and it can be defined as

p(O|λ) =∑

all q

p(O|q, λ)Pr(q|λ) (3.8)

Modelling continuous observations, rather than using discrete quantised observa-

tions, is preferred as they have been found to more effective for the task of lipread-

ing and other speech processing applications [142, 146]. To evaluate Equation 3.8,


a value for p(O|q, λ) which can be expressed, assuming statistical independence

of observations as

p(O|q, λ) =T∏

t=1

p(ot|qt, λ)

=T∏

t=1

bqt(ot) (3.9)

where B = {bj(ot)} is the compact notation expressing the likelihood of obser-

vation ot lying in state j. The form of bj(ot) can be described as a mixutre of

Gaussians in the form of bjm(ot) by

bjm(ot) =

Mj∑m=1

cjmbjm(ot)

=

Mj∑m=1

cjmN (ot, µjm,Σjm) (3.10)

where c is the mixture weight, µ is the mixture mean and Σ is the mixture

covariance matrix for mixture m and state j. These are known as Gaussian

mixture models (GMMs). Now that parameters A, B and π have been defined,

the HMM λ can be represented by the compact parameter set of

λ = (A,B, π) (3.11)

Using the model parameters in Equation 3.11, the likelihood of the observation

O can be recognised using Equation 3.7. This is most often impossible to do

due to the complexity involved, so an approximation using the Viterbi decoding

algorithm is used. This is described in the next subsection.

3.3.1 Viterbi Recognition

Recognising the most likely hidden state sequence, q∗, using Equation 3.7 is

computationally prohibitive as it requires the likelihood of every possible possible

path to calculated. There are several ways to solve the problem of finding the

optimal path q∗ associated with a given observation sequence. The problem arises


in the definition of the optimal state sequence. The most common optimality

criterion [46, 146] is to chose states qt that are individually most likely at each

time t. Although this criterion is optimal locally there is no guarantee that the

path is a valid one, as it might not be consistent with the underlying model λ.

However, it has been shown [46, 146] that this locally optimal solution works

effectively in practice and can be formalised into what is known as the Viterbi

algorithm [46, 146, 147], which is given as follows

1. Initialisation:

δi(1) = πibi(ot), 1 ≤ i ≤ N

ψi(1) = 0 (3.12)

2. Recursion:

δj(t) = bj(ot)maxNi=1δi(t− 1)ai,j, 2 ≤ t ≤ T, 1 ≤ j ≤ N

ψj(t) = arg maxNi=1δi(t− 1)ai,j, 2 ≤ t ≤ T, 1 ≤ j ≤ N (3.13)

3. Termination:

p(O|q∗, λ) = maxNi=1δi(T )

q∗T = arg maxNi=1δi(T ) (3.14)

4. Path backtracking:

q∗t = ψq∗t+1(t + 1), t = T − 1, T − 2, . . . , 1 (3.15)

where δi(t) is the best score along a single path, ψi(t) is an array to keep track

of the argument that has the maximum value, all at time t. In practice a closely

related algorithm using logarithms [146] is employed, thus negating the need for

any multiplications reducing computation load considerably.

3.3.2 HMM Parameter Estimation

The parameters of a HMM, λ = (A,B, π), can be learnt from a set of training

observations Or, 1 ≤ r ≤ R, where R is the number of training sequence observa-

tions. Even though there is no method to analytically solve for the model param-

eter set that globally maximises the likelihood of the observation in a closed form


[146], a good solution can be determined by a straightforward technique known

as the Baum-Welch or forward-backward algorithm [46]. The Baum-Welch train-

ing algorithm is an instance of the generalised expectation maximisation (EM)

algorithm [43], which essentially chooses a λ that locally maximises the likeli-

hood p(O|λ). Since this algorithm requires a starting guess for λ, some form of

initialisation must be performed, which is generally done with Viterbi training.

Viterbi Training

Viterbi training for HMMs places a hard boundary on the observation sequence

O. When a new HMM is initialised, the Viterbi segmentation is replaced by

a uniform segmentation (i.e. each training observation is divided into N equal

segments) for the first iteration. After the first iteration, each training sequence

Or is segmented using a state alignment procedure which results from using the

Viterbi decoding algorithm described in the previous subsection to get the optimal

state sequence qr∗. If Aij represents the total number of transitions from state i

to state j in qr∗ for all R observation sequences, then the transition probabilities

can be estimated from the relative frequencies

aij =Aij∑N

k=1 Aik

(3.16)

Within each state, a further alignment of observations to mixtures components

is made by associating each observation ot with the mixture component with

the highest likelihood. On the first iteration unsupervised k-means clustering is

employed to gain an initial estimate for bj(ot). Viterbi training is repeated until

there is minimal change in the parameter model estimate λ.

Baum-Welch Re-Estimation

The Baum-Welch re-estimation algorithm uses a soft boundary denoted by L

representing the likelihood of an observation being associated with any given

Gaussian mixture component. This soft boundary segmentation replaces the

hard boundary used in the Viterbi training. The new likelihood is known as

the occupation likelihood [189] and is computed from the forward and backward


variables. The forward variable αi(t) is defined as

αi(t) = p(o1o2 . . .ot, qt = i|λ) (3.17)

that is, the likelihood of the partial observation sequence o1o2 . . .ot and state i

at time t, given the model λ. Which can be solved inductively as

1. Initialisation:

αi(1) = πibi(o1), 1 ≤ i ≤ N (3.18)

2. Induction:

αj(t + 1) =

[N∑

i=1

αi(t)aij

]bj(ot+1), 1 ≤ i ≤ N, 1 ≤ t ≤ T − 1 (3.19)

In a similar manner, the backward variable βi(t) is defined as

βi(t) = p(ot+1ot+2 . . .oT |qt = i, λ) (3.20)

that is, the likelihood of the partial observation sequence ot+1ot+2 . . .oT and state

i at time t, given the model λ. Again, this can be solved inductively by

1. Initialisation:

βi(T ) = 1, 1 ≤ i ≤ N (3.21)

2. Induction:

βi(t) =N∑

j=1

aijbj(ot+1)βj(t + 1), 1 ≤ i ≤ N, 1 ≤ t ≤ T − 1 (3.22)

From Equations 3.17 and 3.20 it can be seen that αi(t) is a joint likelihood whereas

βi(t) is a conditional likelihood such that

p(O, qt = j|λ) = p(o1o2 . . .ot, qt = j|λ)p(ot+1ot+2 . . .oT , qt = j|λ)

= αj(t)βj(t) (3.23)

using this result the likelihood of qt = j can be defined as Lj(t) in terms of αj(t),

βj(t) and λ via


Lrj(t) = p(qr

t = j|Or, λ)

=p(Or, qr

t = j|λ)

p(Or|λ)

=1

Pr

αrj(t)β

rj (t) (3.24)

the likelihood of qrt = j for mixture component m can also be defined as

Lrjm(t) =

1

Pr

[N∑

i=1

αri (t− 1)aij

]cjmbjm(or

t )βrj (t) (3.25)

where Pr is the total likelihood p(Or|λ) of the r’th observation sequence, which

can be calculated as

Pr = αrN(T ) = βr

1 (3.26)

The transition probabilities A = {aij} can now be re-estimated using

aij =

∑Rr=1

1Pr

∑Tr−1

t=1 αri (t)aijbj(o

rt+1)β

rj (t + 1)

∑Rr=1

1Pr

∑Tr−1

t=1 αri (t)β

ri (t)

(3.27)

Given Equations 3.24, 3.25 and 3.26, the mixture components of B = {bj(ot) =

(cjm, µjm,Σjm)} can be re-estimated by

µjm =

∑Rr=1

∑Tr

t Lrjm(t)or

t∑Rr=1

∑Tr

t Lrjm(t)

(3.28)

Σjm =

∑Rr=1

∑Tr

t Lrjm(t)(or

t − µjm)(ort − µjm)′∑R

r=1

∑Tr

t Lrjm(t)

(3.29)

cjm =

∑Rr=1

∑Tr

t Lrjm(t)∑R

r=1

∑Tr

t Lrj(t)

(3.30)

Equations 3.27 to 3.30 are iterated until convergence occurs.


3.4 Stream Integration

Fusing the audio and visual streams within an AVASR system is a prime example

of the general classifier combination problem [79]. As such, techniques used for

the integration of the audio and visual streams can generalise to any particular

problem, such as combining two or more visual streams of visual data. Combining

multiple visual streams together within a lipreading system, may have the ben-

efit of improving the visual speech representation, inturn improving the overall

lipreading performance. This particular facet is focussed on with the patch-based

and multi-view experiments conducted later on in this thesis (see Chapters 5 and

6).

The goal of combining various streams or classifiers together, is to give superior

performance over each of the single-stream classifiers. However, great care has to

be taken when combining the classifiers together due to the risk of catastrophic

fusion [101]. Catastrophic fusion occurs when the performance of an ensemble of

combined classifiers is worse than any of the classifiers individually.

According to Potamianos et al. [142], stream integration can be performed

either by feature fusion or decision fusion methods 2. Feature fusion methods are

based on concatenating the features of the various streams into a single feature

vector. The main benefit of using the feature fusion method is that only one

classifier is used and therefore can be employed easily into an already existing

lipreading system. However, if one data stream is not as reliable or as informative

as the other(s), these features can not be weighted accordingly. Also, the feature

fusion methods assume that all streams are synchronous and therefore can not

model any asynchrony between the various streams. This is not an issue for

lipreading however, as all streams of data should be synchronous in the visual

domain.

In comparison, decision fusion methods model each stream separately and use

their respective classifier outputs to recognise the given speech. Even though the

decision fusion methods are more complex than the feature fusion methods, they

2It should be noted that these groupings given by Potamianos et al. [142], refer to theintegration of both audio and video streams. However, these can be generalised to any numberof different data streams with the same temporal nature.

3.4. Stream Integration 33

do allow for the weighting of the various streams, which is a very useful char-

acteristic when there are varying levels of information contained in each stream.

Decision fusion methods also allow for the the various streams to be somewhat

asynchronous, although this is not a requirement for lipreading as previously

mentioned. The following subsections describe algorithms for both the feature

fusion and decision fusion methods.

3.4.1 Feature Fusion Techniques

Feature fusion methods can be either implemented by just using the plain con-

catenated feature vector [1], or by transforming the concatenated feature vector

into a more compact representation [143]. Both these techniques are discussed

shortly. However, it is worth noting that feature fusion methods can also be used

to convert features of one modality into another. Examples of this are of the audio

enhancement work performed by Girin et al. [54] and Barker and Berthommier

[5], where they used visual features to estimate the audio features. Goecke et al.

[58] and Girin et al. [55] later extended this work by converting the concatenated

audio-visual features into plain audio features. Although, the audio enhancement

work falls outside the scope of this thesis, this work was a motivating factor for

the pose-invariant experiments conducted in Chapter 7.

Concatenative Feature Fusion

Given the time synchronous observation vectors of the various input streams,

i.e. o{I}t , . . . ,o

{M}t , with dimensionalities D{I}, . . . , D{M} respectively, the joint

concatenated visual feature vector at time t becomes

o{C}t = [o

{I}t , . . . ,o

{M}t ]T ∈ RD (3.31)

where D = D{I} + . . . + D{M}. As with all feature fusion methods these features

are fed into the single-stream HMM. Concatenative feature fusion constitutes

a simple approach for combining different features together, and can be imple-

mented without much change to existing systems. However, the dimension of

D can be rather high (especially if M > 2) causing inadequate modeling of the


HMM, due to the curse of dimensionality [8]. The curse of dimensionality essen-

tially refers to the inability of classifiers such as HMMs to converge when given

a observations which are high in dimensionality. To overcome this problem, the

dimensionality of such observations have to be kept low (<= 60). This can be

done by employing the following feature fusion technique.

Hierarchical Linear Discriminant Analysis (HiLDA) Feature Fusion

To overcome dimensionality constraints, it is desirable to get a lower dimensional

representation of Equation 3.31. Potamianos et al. [139] proposed the use of

linear discriminant analysis (LDA) to obtain such a reduction on the concate-

nated audio-visual feature vector (see Chapter 5.3.1 for full explanation). In this

work, they combined the LDA step with a maximum likelihood linear transfor-

mation (MLLT) step. As these steps were performed in the audio and visual

streams prior to concatenation, they termed this feature fusion technique hierar-

chical linear discriminant analysis or HiLDA. In this thesis, the HiLDA technique

was implemented without the MLLT step, so the resulting combined observation

vector can be expressed as

o{HiLDA}t = W × o

{C}t (3.32)

where W is the LDA transformation matrix. The dimensionality of 3.32 can be

set to a value which ensures that the single stream HMM converges.

3.4.2 Decision Fusion Techniques

Even though feature fusion techniques have shown themselves to work well [126],

they cannot take into account the relative reliabilities or usefulness of the various

streams. Such a thing is very important as the information contained in the

various streams can vary somewhat [142]. Decision fusion techniques however,

provide a framework for imparting such information on the respective streams.

According to Potamianos et al. [142], the most common used decision fusion

techniques for AVASR model each stream in parallel, using adaptive weights and

combine the log likelihoods linearly. The most likely speech class or word are then

3.4. Stream Integration 35

derived using the appropriate weights as was done in [1, 47, 68, 70, 85, 126, 135].

This task is relatively easy for isolated speech recognition, however, for continuous

speech recognition this is a poses a very difficult problem as the sequences of

classes (such as HMM states or words) need to be estimated [142]. As such, there

are three possible temporal levels of combining the various stream likelihoods

together. These are:

1. Early integration (EI), combines the stream likelihoods at the HMM state

level, which gives rise to the multi-stream HMM classifier [10], [189]. At the

EI level of integration, synchrony between the various streams is enforced.

2. Middle integration (MI), is implemented by means of the product HMM

[177], or coupled HMM [11, 125], which force HMM synchrony at the phone,

or word boundaries. The MI approach has been used to good affect in

AVASR as it can compensate for the slight lag between the audio and

visual streams due to the voice onset time (VOT) [92]. Examples of this

integration strategy that has been used for AVASR can be found in [29, 47,

63, 75, 107, 123, 125, 126, 175].

3. Late Integration (LI), is typically where a number of n-best hypotheses are

rescored by the log-likelihood combination of the various streams. During

the classification process there is no interaction between the various streams

with only the final classifier likelihood scores being combined. In this ap-

proach temporal information between speech modalities is lost. In AVASR,

this integration strategy has been used in [1, 37, 68].

As mentioned previously, in lipreading all the visual streams should be syn-

chronous as the problem is constrained to one modality, not two as is the case for

AVASR. As this is the case, integrating the streams according to the EI strategy

would be best for lipreading. The following describes the multi-stream HMM ,

with special reference to the synchronous case, which was used in this thesis.

Multi-Stream HMMs

Multi-stream HMMs use separate independently trained HMMs and combine

them into a single HMM in such a way that one stream may have some temporal


dependence on the other during decoding, without the disadvantage of training

both sequences together. Multi-stream HMMs can be used to model the MI inte-

gration strategy as they provide relative independence between streams statically

with a loose temporal dependence dynamically. There are two main ways to

build a multi-stream HMM, namely synchronously or asynchronously. Although,

the asynchronous multi-stream HMM is useful for applications for AVASR, the

synchronous multi-stream HMM (SMSHMM) is of more benefit to lipreading due

to its synchronous nature. Even though the SMSHMM is more complicated than

its single stream cousin, it can be implemented as a similarly structured joint

HMM. When decoding an SMSHMM state transitions must occur synchronously

between HMM streams. A necessary condition of this type of multi-stream HMM

is for all HMM streams to have the same number of states N . The i’th initial

state distribution πi, observation emission likelihood bj(ot) and the transition

probability ai,j for the joint SMSHMM can be expressed in terms of

πi = (π{I}i )α{I}

(π{II}i )α{II}

. . . (π{M}i )α{M}

(3.33)

bj(ot) = (b{I}j (o

{I}t ))α{I}

(b{II}j (o

{II}t ))α{II}

. . . (b{M}j (o

{M}t ))α{M}

, 1 ≤ j ≤ N

(3.34)

ai,j = (a{I}i,j )α{I}

(a{II}i,j )α{II}

. . . (a{M}i,j )α{M}

, 1 ≤ i, j ≤ N (3.35)

where {I}, {II}, . . . , {M} refer to the respective streams 3. The weighting factor

α is an exponential weighting factor reflecting the usefulness or reliability of

the respective streams, and is constrained to lie between zero and one, with∑M

i=I αi = 1. Once the new emission likelihoods and transition probabilities have

been found, decoding can take place using the Viterbi algorithm [146] to gain an

estimate of p(OC |λi).

The SMSHMM has been considered in audio-only and visual-only speech

recognition, where the given static features as well as their first and second or-

der derivatives have been assigned their own stream [102, 189]. For AVASR,

3It is worth noting that the number of streams in HTK [189] is restricted to four, which wasthe HMM decoder used for this thesis

3.5. HMM Parameters Used in Thesis 37

researchers have used the SMSHMM as a two stream HMM with one for the au-

dio stream and the other for visual [47, 85, 107, 126, 135]. In this thesis, a novel

extension to the SMSHMM will be investigated by using different visual features

from different poses, as well as different parts of the mouth region for each stream.

Full description of these experiments and results are given in Chapters 5 and 6.

It is worth noting that recently Dean et al. [41] have proposed the use of the

fused HMM, which is similar to the SMSHMM. In this work, they use the just the

most reliable stream (i.e. the audio) to train the various streams of the HMMs

to obtain the best possible state sequence. Decoding is then performed on the

combined streams. This presents another option for integrating the various visual

streams for the lipreading system.

3.5 HMM Parameters Used in Thesis

A HMM can be employed to represent a sub-word unit like a phoneme (or viseme),

tri-phone, a word or a sentence. For large vocabulary tasks, tri-phone models are

used as the number of word models needed becomes very large and the training

set is typically not large enough to build good enough models for all word classes

of the vocabulary. However, this was not a concern for this thesis as the task

of connected word recognition was performed 4. As such, all word HMMs were

modeled using 9 states in a left-to-right topology, with 7 Gaussian mixtures per

state using the hidden Markov model toolkit HTK [189]. This HMM configura-

tion was used as experimental and heuristic evidence showed that this was the

optimal configuration. A silence and short-pause model were also employed. All

models were bootstraped from a segmentation of the parallel audio channel, ob-

tained by an audio-only HMM with identical topology. The audio-only HMMs

were trained on 39-dimensional acoustic features which were extracted to rep-

resent the acoustic signal at the rate of 100 Hz. These were perceptual linear

prediction (PLP) based cepstral features, obtained using a 25 ms Hamming win-

dow, and augmented by their first and second derivatives. The HTK toolkit was

utilized for all training and testing. These HMMs were designed to recognise

4Connected speech recognition refers to connecting whole word HMMs together in sequencecompared to continuous speech recognition which connects sub-word HMMs [189]


the connected-digit sequences (ten-word vocabulary with no grammer), and they

were based on single-stream HMMs of a variable dimension (dimensionality of

the visual features is described in Chapter 5.4).

Using HTK, each word HMM was built using HInit and HRest, which are tools

which estimate the initial parameters and re-estimates the parameter using the

Baum-Welch algorithm respectively. The initial soft boundaries for this process

were gained from the hand labeled time transcriptions given with the databases

(descriptions of the databases are given in the next section). This process is

regarded as a the bootstrap operation, and it is extremely good at determining

the isolated word models. As connected words were wanted to be recognised

for this thesis, the additional HTK tool HERest performed embedded training.

The embedded training uses the same Baum-Welch procedure as for the isolated

case but rather than training each model individually all models are trained in

parallel, using the full transcriptions [189]. In the recognition phase, the Viterbi

algorithm is then used to decode the likely state sequence and the words. In

HTK, this was performed using the HVite command.

3.5.1 Measuring Lipreading Performance

As mentioned previously, in this thesis all lipreading experiments conducted were

for the small-vocabulary task of connected-digit recognition. Lipreading results

were reported as a word-error-rate (WER) percentage, calculated by

Accuracy =

(1− N −D − S − I

N

)× 100% (3.36)

where N is the total number of actual words, D is the number of words deleted,

S is the number of words substituted and I is the number of incorrectly inserted

words.

3.6. Current Audio-Visual Databases 39

3.6 Current Audio-Visual Databases

A major restriction to the progress of lipreading and AVASR in general, has been

the availability to researchers of a large audio-visual database which contains

many speakers across numerous different environmental conditions with respect

to both the audio and video modalities. This can be attributed to the large cost

associated with collecting, storing and distributing the audio-visual data. For

example, the storage requirement for capturing video at the full size, frame rate,

and high image quality is enormous, making the widespread database distribution

a non-trivial task [142]. Even in cases where there is a large audio-visual database

with many speakers, a large problem lies in the fact that these are often developed

by corporations which limits their accessibility to the research community due to

proprietary issues. Consequently, most existing databases appear to have been

produced with a specific application in mind by a small group of researchers with

limited resources, rather than attempting to solve the many variants associated

with audio-visual speech. However, as the capturing and storing of audio-visual

data gets more and more affordable, more databases are becoming available which

investigate the many variants of audio-visual speech. In this section, a review of

the currently available audio-visual corpora is conducted. As part of this review,

particular emphasis will be placed on the two databases which contain multi-view

visual speech data that were used in this thesis.

3.6.1 Review of Audio-Visual Databases

Coinciding with the first ever AVASR system, Petajan [132] collected a database

consisting of a single subject uttering 2-10 repetitions of 100 isolated English

words. Since then, similar single subject databases have been developed for re-

searchers conducting experiments of limited size for different languages [1, 27, 40,

60, 68, 150, 164, 165, 172, 175]. In addition to these single subject databases,

many multiple speaker databases have been collected over the past decade. Due

to the cost of capturing and storing video data, these have only been concerned

with small vocabulary tasks, such as isolated or connected digit, letter or word

recognition. One of the first was the Tulips 1 database [122], which contains


recordings of 12 subjects uttering digits “one” to “four”. A 10 subject isolated

letter dataset for English has also been collected which has been used by Matthews

et al [113] and Cox et al. [37]. Chen [22] collected a 10-subject isolated word

database with a vocabulary of 78 words which was collected at the AMP labo-

ratory at Carnegie Mellon University. The AMP/CMU database has been freely

available for researchers to use and as such, has been extensively used in literature

[22, 29, 75, 125, 191].

With the improvement in computer technology and the reduction in cost of

video capturing and storage devices in recent years, the multi-speaker databases

have been extended to include many more speakers (i.e. > 35 speakers compared

to ≤ 10). However, most these databases are still concerned with small vocabu-

lary tasks. One of the first of these databases was the AT & T [135] database,

which is a 49 subject database based on connected letters. The University of Illi-

nois at Urbana-Champaign has also collected a 100 subject database for the task

of connected digit recognition [28].Due to it being available to all researchers, one

of the most popular databases in the late 1990’s was the M2VTS database [133].

The M2VTS database, consisted of 37 speakers saying the digits “zero” to “nine”

in French. The work involving the M2VTS database was later extended to form

the XM2VTS database [119], which contains 295 subjects. Four sessions of each

subject were taken to account for natural changes in appearance of speakers. In

each session, three sequences of audio-visual speech were taken, where two were

of the digits “0” to “9” in different order. The third sequence was the utterance

“Joe took father’s green shoe bench out”, which was designed to maximise visible

articulatory movements. The XM2VTS database is currently the largest publicly

available audio-visual database in terms of speakers but is still only constrained

to small vocabulary tasks. However, due to its size it is rather expensive to

purchase.

The VidTIMIT database [159] has been recently released, which consists of

43 speakers reciting 10 TIMIT sentences each, and has been used for multi-modal

speaker verification [160]. The VidTIMIT was collected in an effort to collect a

more phonetically balanced dataset. Motivated by this work, Saenko collected

the AVTIMIT database, which consisted of 223 speakers who spoke 20 different


TIMIT sentences [156]. To cater for Australian English, Goecke and Millar [57]

collected the AVOZES audio-visual data corpus which consisted of 20 speakers

uttering phonemes and visemes of Australian English. AVOZES was also the first

audio-video database with stereo-video recordings, which enables 3D coordinates

of facial features to be recovered accurately [57].

Recently, in-car audio-visual data has been collected as part of the AVICAR

database [93]. The AVICAR database is the first publicly available audio-visual

corpora of in-car data, and constitutes the first publicly available database which

resembles noisy or “real-world” conditions. It consists of 100 speakers (50 male

and 50 female) and is captured at five different noise conditions; i.e. idling, driving

at 35mph with windows open and close, and driving at 55mph with windows open

and closed. The speech data is of isolated digits, isolated letters, phone numbers

and sentences, all in English. The AVICAR database is a multi-sensory database

consisting of a eight microphone array to collect the audio signal and four video

cameras on the dashboard (all of speakers in frontal pose).

To date, there is only one database which exists for large vocabularies tasks.

This database is the IBM Via Voice audio-visual database [127] and it consists of

290 fully frontal subjects uttering continuous speech with mostly verbalised punc-

tuation, dictation style [142]. The duration of the database is approximately 50

hours and it contains 24325 transcribed utterances with a 10403 word vocabulary.

In addition to this, a 50 subject connected digit database was collected that con-

tained 6689 utterances which relates to approximately 10 hours of speech data.

The Via Voice database was recorded in a quiet studio like environment, with

uniform lighting and background. In an effort to capture visual data in more

challenging environments, additional office data was captured using 109 subjects

and in-car data was also captured using 87 subjects, consisting of 6295 and 1485

utterances respectively [141]. These additional databases were of connected digits

only in the same format of the digits portion of the Via Voice database. However,

due to commercial constraints this database is not available publicly.

The databases presented above were all of a speaker’s frontal face. As the

motivation for this thesis was to determine the effect that pose variation had

on lipreading performance, datasets which facilitated this type of variation had


to be used. Fortunately, the IBM smart-room [138], CUAVE [130] and DAVID

[25] databases all have such pose variation. For this thesis, the IBM smart-room

and CUAVE databases were used and are described in detail in the following

subsections. The DAVID database was not used however, as it was relatively

small in size and was produced for the task of audio-visual speaker recognition

in mind. It is also worth noting that smaller datasets such as the “CMU Audio-

Visual Profile and Frontal View” database [91] and the profile speech database

collected by Yoshi et al. [188] have just been collected which would also be useful

for such tasks.

3.6.2 IBM Smart-Room Database

With the ever decreasing cost associated with collecting synchronous audio and

video data, collecting data of meeting or lecture events inside a smart room [52,

131] that are equipped with a number of far-field audio-visual sensors, including

microphone arrays, fixed and pan-tilt-zoom (PTZ) cameras is becoming viable.

This scenario is of central interest in the “Computers in the Human Interaction

Loop” (CHIL) integrated project currently funded by the European Union [26].

A schematic diagram of one of the smart rooms developed for this project, in

particular the one located at IBM Research, which has been termed the IBM

smart-room database is depicted in Figure 3.3 5.

Clearly, audio-visual speech technologies, such as speech activity detection,

source separation, and speech recognition, are of prime interest in this scenario,

due to overlapping and noisy speech, typical in multi-person interaction, cap-

tured by far-field microphones. Data from the smart room fixed cameras are of

insufficient quality to be used for this purpose, as they typically capture the par-

ticipants’ faces in low resolution (see also Figure 3.4). On the other hand, video

captured by the PTZ cameras can provide high resolution data, assuming that

successful active camera control is employed, based on tracking the person(s) of

interest [192]. Nevertheless, since the PTZ cameras are fixed in space, they can-

not necessarily obtain frontal views of the speaker. Clearly therefore, lipreading

5This work was supported by the European Commission under the integrated project CHIL,Computers in the Human Interaction Loop, contract number 506909.


CounterCounter

DOOR

CONF.TABLE

CHAIRS

PROJ. SCREEN Whiteboard 1

Wh

iteb

oa

rd

2

64-chmkIII_1

(390 , 575 , 175)

64-ch mkIII_2

(695 , 265 , 175)

4-channelArray_C

(230 , 5 , 175)

4-channelArray_D

(480 , 5 , 175)

(175 , 585 , 175)

4-channelArray_A

(425 , 585 , 175)

4-channelArray_B

Table-1Mic

Table-2Mic

Table-3Mic

Presenter MicCTM-0

CTM-1CTM-2

CTM-3CTM-4

(322 , 212 , 74)

cam8 PTZcam7PTZ

cam6 PTZ

cam9PTZ

cam5

cam1

cam2

cam3

cam4

~ 55 cm

~ 110 cm

y

x

z

(x , y , z) [cm]FLOOR

(0 , 0 , 0)

(715 , 590 , 270)

CEILING

L E G E N D

Fixed Camerasat Room Corners (4)

PanoramicCeiling Camera (1)

Pan-Tilt-Zoom (PTZ)Cameras (4)

Close-Talking Microphones (4)

Table-Top Microphones (3)

64-channel LinearMicrophone Arrays (2)

4-channel T-shapedMicrophone Arrays (4)

Figure 3.3: The IBM smart room developed for the purpose of the CHIL project.Notice the fixed and PTZ cameras, as well as the far-field table-top and arraymicrophones.

from non-frontal views is required in this scenario, as well as fusion of multiple

camera views, if available. This scenario is the prime focus for this thesis, and

methods to solve these problems will be looked at throughout this document 6.

A total of 38 subjects uttering connected digit strings have been recorded

inside the IBM smart room, using two microphones and three PTZ cameras.

Of the two microphones, one is head-mounted (close-talking channel – see also

Figure 3.5) and the other is omni-directional, located on a wall close to the

recorded subject (far-field channel). The three PTZ cameras record frontal and

6The IBM smart-room database was able to be used as part of this thesis, due to thecandidates internship with the IBM T.J. Watson Research Center


cam1 cam2

cam3 cam4

cam6 cam7

FIXED

ROOM CORNER

CAMERA VIEWS

PTZ CAMERA

VIEWS

Figure 3.4: Examples of image views captured by the IBM smart room cameras.In contrast to the four corner cameras (two upper rows), the two PTZ cameras(lower row) provide closer views of the lecturer, albeit not necessarily frontal (seealso Figure 3.3).

two side views of the subject, and feed a single video channel into a laptop via

a quad-splitter and an S-video–to–DV converter. As a result, two synchronous

audio streams at 22kHz and three visual streams at 30 Hz and 368×240-pixel

frames are available.

Among these available streams, the far-field audio channel and two video

views, i.e. the frontal and right profile (which was the one that closest to the

profile pose, see Figure 3.5), were used in this thesis. Unless specified otherwise,

a total of 1661 utterances were used in this thesis, partitioned using a multi-

speaker paradigm into 1247 sequences for training (1 hr 51 min in duration), 250

for testing (23 min), and 164 sequences (15 min) that are allocated to a held-out

set.

Ideally, a speaker-independent paradigm would be used, however, this is very

hard to do due to the lack of speakers in this dataset (38). In audio-only speech

recognition experiments, a all-in one-out type arrangement can be developed to


Figure 3.5: Examples of synchronous frontal and profile video frames of foursubjects from the IBM smart-room database.

overcome this problem which would result in 38 different train/test sets. This

can also be carried out for the task of lipreading, however, this is much more dif-

ficult due to the fact that a different visual front-end has to be developed for each

training set. This requirement is prohibitive due to the amount of time and com-

putation required to optimise each visual front-end to each set of training data. As

such, the multi-speaker paradigm in lipreading and AVASR experiements have

been preferred as only one visual front-end is required. In the multi-speaker

paradigm, all speakers in the data set are represented both in the training and

test sets, although different sequences are used for both sets. This ensures that

a global model is obtained which provides good information about the lipreading

performance.

3.6.3 CUAVE Database

Another audio-visual database which contains speakers talking in non-frontal

poses, is the Clemson University Audio-Visual Experiments or CUAVE database

[130]. The main motivation behind the creation of the CUAVE database was

to create a flexible, realistic and easily distributable database that allows for

representative and fairly comprehensive testing [130]. The CUAVE database

consists of two sections, with the first being the individual and the second being

the group section (see Figure 3.6). The individual section is designed to give

realistic conditions such as speaker movement, whilst the group section is included

to look at pairs of simultaneous speakers, which is the first data of its kind. As

the scope of this thesis is constrained to lipreading of a single speaker across


Figure 3.6: Examples of sequences from the CUAVE database, which consistsof 36 individual speakers and 20 group speakers. The top line give examples ofthe individual sequences, whilst the bottom gives examples of the group speakersequences.

different poses, only the individual section was used as part of this thesis 7.

The individual section of the CUAVE database was broken into 2 parts. The

first was for isolated-digits and the second was the connected-digits. As no profile

data was included in the connected-digits section, only the isolated-digits portion

was used. Each isolated-digits sequence was broken into the following four tasks:

1. Normal, where each speaker spoke 50 digits whilst standing still naturally,

2. Moving, where each speaker was asked to move side-to-side, back-and-forth,

or tilt the head while speaking 30 digits,

3. Right profile, where each speaker utters 10 digits in the right profile pose,

and

4. Left profile, where each speaker utters 10 digits in the left profile pose.

Examples of these tasks are given in Figure 3.7. In addition to performing ex-

periments on these four individual tasks, experiments on the combination of all

the tasks were also undertaken (see Chapter 8.4). As such, continuous video data

across all these tasks were required (i.e. speaker in shot at all times). Unfortu-

nately, only 33 of the 36 speakers were able to be used for this task. As only one

sequence was available per speaker, a multi-speaker paradigm was not able to be

used for these experiments. As such, a quasi speaker-independent paradigm was

7Special mention must go to Clemson University for freely supplying their CUAVE databasefor this work.

3.7. Chapter Summary 47

Figure 3.7: Examples of the CUAVE individual sequences. The top three rowsgive examples of the speaker rotating from left profile to right profile. The bottomthree rows give examples of the speaker moving whilst in the frontal pose.

used, which consisted of 10 different train/test sets, consisting of 25 speakers for

training and 8 speakers for testing for each set. The reason why this is termed

quasi speaker-independent is because it is not a fully speaker-independent task,

as only one visual front-end was developed.

3.7 Chapter Summary

In this chapter the topic of classifying visual speech was broached. The theory

behind the HMM was heavily documented, with details of the training and de-

coding of the models being given via the Baum-Welch and Viterbi algorithms

respectively. Both the feature and decision fusion strategies for combining vari-

ous streams together were then analysed. Feature fusion methods were described

as being the easier to implement as they are just a concatenation of the various


streams of data, although they can not model the usefulness or reliability of the

various streams. Contrastingly, this can be done using the decision based meth-

ods. Even though there exist many decision based methods, it was found that

the synchronous multi-stream HMM (SMSHMM) was the fusion technique best

suited for the combining of various visual streams. The details of the HMM pa-

rameters used for this thesis were then given. A brief mention of the commands

used to train and decode the HMMs using HTK [189] were also given.

This chapter then concluded by giving a relatively thorough review of the

current audio-visual databases which are currently available to use. From this

review, it was found that due to the cost of capturing, storing and distributing

audio-visual databases, nearly all available corpora is restricted to fully frontal

data for a small number of speakers for small vocabulary tasks such as connected

digit recognition. However, as the cost of capturing and storing such databases

is becoming cheaper, collecting audio-visual data with various visual variabilities

is not such an major issue that is once was. An indication of this comes in

the form of the recently developed databases such as the IBM smart-room and

CUAVE databases, which contain variabilities with respect to the speaker’s pose.

As this type of variability in the visual domain is the focus of this thesis, these

two databases were described as a prelude to the experiments being conducted

in this thesis.

Chapter 4

Visual Front-End

4.1 Introduction

For a lipreading system to be of use, it has to be able to locate and track the

visible articulators which cause human speech. It is widely agreed upon that the

majority of these visible articulators emanate from the region around a speaker’s

mouth, otherwise known as the region-of-interest (ROI) [92]. The visual front-end

is responsible for locating, tracking and normalising a speaker’s ROI and can be

considered the most important module of a lipreading system. The reason for this

is that if the visual front-end does not accurately locate and track the speaker’s

ROI, then this error will filter throughout the system and cause erroneous results.

This effect is known as the front-end effect. There are many different factors which

can heighten this phenomenon such as pose, occlusion and illumination. In this

chapter, all the various aspects of a visual front-end for a lipreading system are

reviewed. As part of this review, a survey of current algorithms which can be used

as part of the visual front-end are examined and the algorithm chosen for this

thesis is fully described. This algorithm is known as the Viola-Jones algorithm

which is based on a boosted cascade of simple classifiers. The chapter concludes

by evaluating the implemented visual front-end on frontal pose data.

Before proceeding, it would be prudent to give a high-level description of

the visual front-end as their is some conflict in literature as to what it actually

constitutes. Some researchers [142] consider the visual front-end to consist of

49

50 Chapter 4. Visual Front-End

Video InFace Localisation

Define Search Regionsfor Facial Features

. ..

Facial FeatureLocalisation

Normalise for Scaleand Rotation

Repeat for all frames

Check and Update ROI Location

Smooth using Temporal Filter

To Lipreading System

Step 2

Step 1

Step 3

Figure 4.1: Block diagram of a visual front-end for a lipreading system. It isessentially a three-step process, face localisation being step 1 and step 2 consistingof located the mouth ROI. Step 3 is tracking the ROI over the video sequence.

locating and tracking the ROI and then extracting features from the ROI. For

this thesis however, the visual front-end refers to just locating and tracking the

ROI. A depiction of this process is shown in Figure 4.1. As can be seen in

this figure, the visual front-end is essentially a three-phase, hierarchical process,

starting with locating a speaker’s face. Once the face has been located, facial

features such as the eyes, nose and mouth corners are then located. Based on the

positions of these facial features, the ROI is then defined. Once the ROI has been

found, tracking can be performed and the ROI can be extracted from each video

frame. Ideally, this process would be conducted on each video frame to give the

most accurate and up to date positions but depending on the visual front-end

algorithm, this can prove to be too computationally expensive to perform. All

these issues are discussed in this chapter.

4.2 Front-End Effect

For the task of lipreading, the front-end effect can be formally defined as the

dependence the lipreading system has on having the ROI successfully located. The

impact of the front-end effect is best illustrated in Figure 4.2. From this figure it

can be seen, that if the ROI is located poorly, then this noisy or corrupt input will

cascade throughout the system and will most likely recognise the visual speech

4.3. Visual Front-End Challenges 51

Classification based on located ROI

Recognisedspeech

Input image Located ROI

ηd ηc

ηo

Figure 4.2: Depiction of the cascading front-end effect.

incorrectly. This effect can be expressed mathematically as

ηo = ηd × ηc (4.1)

where ηd is the probability that the ROI has been successfully located, ηc is the

probability that a correct decision is made given the ROI has been successfully

located and ηo is the overall probability that the system will recognise the correct

speech. Inspecting Equation 4.1, it can be seen that the performance of the

overall classification process ηo can be severely affected by the performance ηd of

the visual front-end.

In an ideal scenario, ηd = 1, so that more effort can be concentrated on im-

proving the performance of ηc, thus improving the overall lipreading performance.

A very simple way to ensure ηd approaches unity is through manual labeling of

the ROI. Unfortunately, due to the amount of visual data needing to be dealt

within a lipreading application, manual labeling is not a valid option. The re-

quirement for manually labeling the ROI also brings the purpose of any lipreading

system into question due to the need for human supervision. With these thoughts

in mind an integral part of any lipreading application is the ability to make ηd

approach unity via a highly accurate visual front-end.

4.3 Visual Front-End Challenges

Unfortunately, getting ηd towards unity is a very difficult to achieve due to the

man variants that a visual front-end has to encounter. According to the survey

conducted by Yang et al. [186], the challenges associated with the visual front-end

can be attributed to the following six factors:


• Pose: The images of a face vary due to the relative camera-face pose

(frontal, profile etc.), and some facial features such as an eye or the nose

may become partially or fully occluded.

• Presence or absence of structural components: Facial features such

as beards, moustaches, and glasses may or may not be present and there is

a great deal of variability among these components including shape colour

and size.

• Facial expression: The appearance of a person’s face can vary due to

their expression (happy, sad, angry etc.).

• Occlusion: Faces may be partially occluded by other objects. In an image

with a group of people, some faces may partially occlude other faces.

• Image orientation: Face images directly vary for different rotations about

the camera’s optical axis.

• Imaging conditions: When the image is formed factors such as lighting

and camera characteristics affect the appearance of a face.

As the factors listed above show, the task of the visual front-end is quite complex.

As this is the case, some of the work conducted in lipreading has neglected the

visual front-end by manually locating the ROI or have artificially located the ROI

via chroma key methods [142]. Most of the work has focussed on data which has

been conducted in ideal laboratory conditions. Almost all the work has neglecting

the variants of pose and orientation. The lack of work in these areas has stymied

the full deployment of a lipreading system, as the visual front-end can not deal

with a whole array of difficult conditions.

However, as mentioned in the first chapter, this thesis is focussed on reme-

dying this situation by attempting to overcome the problems of pose as well as

normalizing for the various speaker structural components and image conditions.

As such, their is only one restriction placed on this work. That is:

• there is only one speaker in each video sequence and he/she is present

during the entire sequence

4.4. Brief Review of Visual Front-Ends 53

As there is only one speaker in shot in any video sequence, this is referred to

“localisation” as the position of the face and subsequent facial features have to be

found [186]. This is in contrast to the term “detection” which refers to the much

more difficult task of first determining how many faces are in a video sequence

and then determine the location of the faces and their facial features [186].

As shown in the survey by Yang et al., there are over 150 published articles in

the well established field of face and facial feature localisation/detection. Unfor-

tunately, from all this research there is still no one technique that works best in

all circumstances. In the next section, a small review of the most popular visual

front-ends are given.

4.4 Brief Review of Visual Front-Ends

Yang et al. [186] categorize locating a person’s face and facial features into four

broad groups. These being:

1. Knowledge-based methods: These rule-based methods encode human

knowledge of what constitutes a typical face. Usually, the rules capture the

relationships between facial features. An example of such a method is the

multiresolution rule-based method [185].

2. Feature invariant approaches: The aim in this approach is to find struc-

tural features that exist even when the viewpoint, illumination or pose of

the person varies. Such features include colour, texture and edge informa-

tion. A variety of shape models can be employed from very simple expert

geometric models [190], to snakes [27], B-splines [148] or point distribution

models [34].

3. Template matching methods: These methods share characteristics of

the feature invariant and rigid template paradigms. In a similar fashion to

the rigid template approach, the geometric form and intensity information

of the object are dependent on each other. In this approach, however, the

template used to evaluate the intensity information of an object is non-

rigid. The intensity model is evaluated by gaining a cost function of how


similar the intensity values around or within the template are to the in-

tensity model describing the object. Unlike the rigid template approach,

an exhaustive search of the image using a deformable template is compu-

tationally intractable due to the exponential increase in the search caused

by the template being allowed to vary in both shape and position. Such

an operation can be made computationally tractable by employing quicker

minimisation techniques such as steepest descent [89, 190], downhill sim-

plex [89, 107] and genetic algorithms [33], allowing this approach to be

computationally feasible.

4. Appearance-based methods: In contrast to template matching, the

models (or templates) are learnt from a set of training images which should

capture the representative variability of facial appearance. These learned

models are then used for detection. These methods are designed mainly

for face detection. (e.g. eigenface [176], distribution method [170], neural

network [154], SVM [98], Naive Bayes classifier [163], hidden Markov model

[124], information theoretical approach [31, 95]).

The choice of visual front-end is dependent on the type of application it is being

used for and the conditions under which the video was captured. In lipreading

literature, appearance based approaches have been widely used achieving good

results [96, 121, 154, 184]. A major reason for this is that they are well suited

to many different objects (face, eyes, nose, mouth corners etc.), under many dif-

ferent conditions due to their probabilistic nature. These techniques are good at

finding a crude ROI which is all that is required for the appearance-based visual

feature extraction process (see Chapter 5.2.1). Feature invariant approaches have

been widely used for lip contour localisation and tracking. Under this approach,

methods based on colour [20, 99, 148, 174], edges [149] as well as localised tex-

ture [99] have been used to gain a geometric model of the lips. However, these

approaches require extremely precise localisation of lip features and are highly

susceptible to errors in conditions of poor illumination and speaker movement.

The template matching method was first applied by Yuille et al. [190] for

mouth and eye localisation using expert based appearance and shape models. In

4.4. Brief Review of Visual Front-Ends 55

this approach an expert deformable template of the eyes and labial contour is

fitted to an intensity model, by calculating a cost function based on the grayscale

intensity edges, valleys and peaks around the template boundary. The search

strategy uses the steepest descent algorithm to fit the template. Unfortunately,

due to the heuristic nature of the shape models and intensity models the approach

has poor performance when applied across a large number of subjects. Cootes et

al. [34] devised a similar technique for building a deformable template incorporat-

ing texture and shape models through exemplar learning. The technique used a

deformable template known as an active shape model (ASM). The ASM was able

to statistically learn allowable variations in shape of an object from pre-labeled

object shapes in a point distribution model (PDM) [33, 34]. Intensity informa-

tion about the object was also statistically learnt. In this approach a number

of grayscale profile vectors were extracted normal to set points around the de-

formable template. All these vectors were concatenated into a matrix known as

global profile vectors from which variations in intensity were statistically modeled

as a grey level profile distribution model (GLDM) [33, 115]. Luettin [107] ap-

plied ASMs to lip contour localisation, using the downhill simplex minimisation

technique to fit the lip shape model described by a PDM to an image containing

a mouth.

Matthews et al. [115] used another type of statistically learnt deformable

template approach to fit a lip shape model to an image containing a mouth. This

type of deformable template is referred to as an active appearance model (AAM)

and was first developed in [33]. This approach, similar in many respects to ASMs,

uses a PDM to statistically learn the shape variations of the object. The inten-

sity model for the object is learnt by warping the intensity information contained

within the deformable template back to the mean shape position. This warped

intensity information is then used to statistically model the distribution of inten-

sity values of an object whose shape has been normalised. The statistical nature

of the intensity model allows AAMs to be used for detection as well as location

purposes. AAMs have been applied to the task of lip contour detection/location,

using a genetic algorithm for minimisation [115]. ASMs and AAMs have been

used with much success in whole facial feature location, where an entire model


of the face (i.e. including eyes, lips, nose and jawline) are located. Unfortu-

nately, the minimisation technique required to fit ASMs and AAMs are highly

sensitive to initialisation and do not guarantee convergence to an acceptable mini-

mum. Although deformable template approaches, namely ASMs and AAMs, have

been shown to be useful for face/eye detection and mouth location/tracking for

lipreading applications in literature [107, 108, 115], the problems associated with

searching for a minimum make detection/location performance largely unreliable

and need a massive amount of annotated training data. This was highlighted in

the large vocabulary experiments conducted by Matthews et al. [116], where this

was suspected to be the case.

All these methods just presented have assumed a single camera. Recently

however, Goecke [56] has presented a novel real-time lip localisation and track-

ing algorithm based on video data captured on a stereo camera using colour

information and prior knowledge of the mouth area. By using stereo vision in a

calibrated camera system, the 3D coordinates of object points could be recovered

which enabled speakers to act normally and move freely within a constrained

environment. This approach is extremely attractive as it lends itself to tracking

of a speaker’s mouth ROI across multiple views. A caveat on this however, is

that the video must be captured via a stereo camera which greatly limits the use

of this approach.

As mentioned previously, no one visual front-end has shown itself to be su-

perior to the others for the task of lipreading. This may because they are not

robust across of different conditions, only useful for certain speakers or that they

are too computationally expensive to run in real-time. Recently however, Viola

and Jones [180] introduced an algorithm based on a boosted cascade of simple

classifiers. Through this novel technique, they were able to obtain extremely

high accuracy in real-time. Seeing that this framework is extremely quick, and

generic, it is amendable to having multiple visual-front ends running in parallel

from any type of visual data which allows a multiple pose face and facial feature

localisation to take place [82]. Even though there are a small number of visual

front-ends which can handle non-frontal data [97, 155, 163], the Viola-Jones algo-

rithm provides a framework that can localise faces and facial features regardless

4.5. Viola-Jones algorithm 57

of pose and in real-time. For these reasons, this particular method was chosen as

the visual-front end for use in this thesis as it satisfied all the requirements for

the objectives of this thesis. The Viola-Jones algorithm is described in the next

section.

4.5 Viola-Jones algorithm

In 2001, Viola and Jones [180] proposed a rapid object detection scheme based

on a boosted cascade of simple “haar-like” features. Since then, this work has

revolutionised the field of computer vision, as it has provided a object detec-

tion/localisation framework that is extremely quick and accurate which is imper-

ative for real-time tasks. As it is based on a set of training examples, it can be

used for any object detection/localisation task. This is especially beneficial for

the case of lipreading where extremely quick face and facial feature localisation

is required as well as being mendable to the variations associated with pose and

environmental conditions. This section is devoted to giving a brief description

of the algorithm and how it is applicable to a lipreading visual front-end. The

Viola-Jones algorithm is essentially a three step process. Each of these steps

listed below are described in the following subsections:

1. Feature representation of images using “haar-like” features;

2. Selecting “weak” classification functions using a learning algorithm known

as “boosting”; and

3. Cascading the “weak” classifiers into a final “strong” classifier.

4.5.1 Features

The Viola-Jones algorithm employs a feature representation of the images instead

of pixels. This is done for two reasons. Firstly, it is much quicker than using pixels

and secondly, the features encode knowledge within the image which is difficult

to learn using a finite amount of training data [180]. The feature representation

used in this algorithm are termed “haar-like” because they are similar to the over-

complete Haar basis functions used by Papageorgiou et al. [129]. The original


(a) Original set of haar-like features

(b) Extended set of haar-like features

Figure 4.3: Comparison of the feature sets used by: (a) Viola and Jones with theoriginal 4 haar-like features; and (b) Lienhart and Maydt with their extended setof 14 haar-like features including their rotated features. It is worth noting thatthe diagonal line feature in (a) is not utilised in (b).

set of four features were later extended by Lienhart and Maydt [94], to fourteen

by introducing features which were rotated by 45o. The motivation behind using

these extended features was that they add additional domain-knowledge to the

learning framework which is otherwise hard to learn. Lienhart and Maydt showed

that improved performance is achieved with these set of extended features with an

average of 10% reduction in the false alarm rate at a given hit rate. A comparison

of these two feature sets are given in Figure 4.3. The value of these haar like

features are calculated as the sum of the pixels within the white rectangles are

subtracted from the sum of the pixels in the black rectangles.

If the object of interest, say a face, was 16 × 16 pixels within an image, the

number of features derived could be well over 100 000 for that face. This is

because the feature set shown in Figure 4.3(b) are found sliding over the face


at different location and scales in both the x and y directions. These features

can however be computed extremely rapidly using the integral image [180]. The

upright integral image at location x, y contains the sum of the pixels above and

to the left of x, y inclusive

iiu(x, y) =∑

x′≤x,y′≤y

i(x′, y′) (4.2)

where iiu(x, y) is the integral image and i(x, y) is the original image. The integral

image for the upright rectangle features can be computed in one pass over the

original image using

iiu(x, y) = iiu(x, y − 1) + iiu(x− 1, y) + (4.3)

i(x, y)− iiu(x− 1, y − 1)

where iiu(−1, y) = 0 and iiu(x,−1) = 0. Figure 4.4 shows how the integral image

can be used to determine the rectangular sum using four point references. Given

that: the value at point 1 is the sum of the pixels in A; point 2 is the sum of

the pixels within A + B; point 3 is the sum of the pixels within A + C; and

point 4 is the value of the pixels within A + B + C + D; the sum of the pixels

in rectangle D, can be computed as 4 + 1− (2 + 3). Since two-rectangle features

defined above involve adjacent rectangular sums they can be computed in six

array references, eight in the case of the three rectangle features, and nine for

four-rectangle features. All these values, once computed are stored in a look-up

table, and can be accessed to calculate the features.

The integral image can also be computed easily for the rotated features. At a

given point in the image, the sum of the pixels of a 45o rotated rectangle with the

bottom most corner at the given point and extending upwards till the boundaries

of the image is calculated the same as in Equation 4.2. The rotated integral

image, iir(x, y), can also be calculated in one pass from left to right and top to

bottom over all pixels by:


A B

C D1 2

3 4

Figure 4.4: Example of how the integral image can be used for computing uprightrectangular features.

ABC

D

1

23

4

Figure 4.5: Example of how the rotated integral image can be used for computingrotated features.

iir(x, y) = iir(x− 1, y − 1) + iir(x + 1, y − 1)− (4.4)

iir(x, y − 2) + i(x, y) + i(x, y − 1)

where iir(−1, y),iir(x,−1) and iir(x,−2) = 0. Like the example shown for the

upright case in Figure 4.4, the rotated integral image can also be used to calculate

any rotated sum by four point references as shown in Figure 4.5. In this example,

the area within D can be found exactly the same as the previous example with

D = 4 + 1− (2 + 3).

4.5.2 Classification

As mentioned in the previous subsection, there are over 100 000 features asso-

ciated with an object of size 16 × 16 pixels within an image. Even though the


integral image allows for quick computation of these features, this number is pro-

hibitively large to process. To counter this, Viola and Jones hypothesized that

only a small number of these features were required to successfully detect/locate

the object of interest. They overcame the challenge of selecting which features

to use via “AdaBoost” which was initially proposed by Freund and Schapire

[49]. AdaBoost is a learning algorithm which combines the performance of many

“weak” classifiers to produce a “strong” final classifier which has good gener-

alisation performance [129, 162]. Viola and Jones used a variant of AdaBoost

by constraining the weak classifier to be dependent on a single feature. The

AdaBoost procedure chose this single feature as it was the best in separating

the positive and negative examples in the training dataset (see next section for

description on training). This was done by determining the optimal threshold

classification function for each feature such that the minimum number of exam-

ples are misclassified. A weak classifier hj(x) thus consisted of a feature fj, a

threshold θj and a polarity pj indicating the direction of the inequality sign, so

that

hj(x) =

1 if pjfj(x) < pjθj

0 otherwise.

(4.5)

where x is the sub-window of the image. Taken from [180], the process to finding

these parameters are as follows:

1. Given N example images (x1, y1), . . . , (xn, yn) where x is the sub-window

image of the object/background within the entire image and yi = 0, 1 for

negative and positive examples respectively.

2. Initialize weights w1,i = 12m

, 12l

for yi = 0, 1 respectively, where m is the

number of negative examples and l is the number of positive examples.

3. For t = 1, . . . , T :

(a) Normalize the weights,

wt,i ←− wt,i∑nj=1 wt,j


so that wt is a probability distribution

(b) For each feature, j, train a classifier hj which is restricted to using a

single feature. The error is evaluated with respect to wt, ε =∑

i wi|hj(xi)−yi

(c) Choose the classifier ht, with the lowest error εt

(d) Update the weights:

wt+1,i = wt,iβ1−eit

where ei = 0 if example xi is classified correctly, ei = 1 otherwise, and

βt = εt

1−εt

4. The final strong classifier is:

hj(x) = 1 if

T∑t=1

αtht(x) ≥ 1

2

T∑t=1

αt

= 0 otherwise

where α = log 1βt

Viola and Jones gave an example of what the selected features from this

process were for the task of face localisation. In the example they gave, they

mentioned that the first feature selected was across areas of the eyes, nose and

cheeks. It was suggested that this was chosen as the eyes being dark, were

contrastive to the lighter areas of the cheek and nose. It was also mentioned that

the first feature was relatively large with respect to the face, and was insensitive

to size and location of the face. This characteristic highlights the ability of the

features to scale well and be adaptable to the face being in various locations.

This example is replicated from [180], to highlight this point.

4.5.3 Cascading the Classifiers

Instead of having one “strong” classifier to detect/localise all objects of interest,

Viola and Jones proposed the use of cascading a series of the “weak” classi-

fiers, increasing in complexity at each stage, to dramatically increase the speed


Figure 4.6: Example of the first feature selected by AdaBoost. It has selectedthe feature across the eye, nose and cheek areas, possibly due to the contrast incolour.

� � � ��

�� Figure 4.7: Example of a face localiser based on a boosted cascade of 20 simpleclassifiers. If the hit rate for each classifier is 0.9998 and the false-alarm rate isset to 0.5 then the overall localiser should be able to yield a hit rate of 0.999820 =0.9960 and a false-alarm rate of 0.520 = 9.54× 10−7.

of the detector/localiser. This is achieved by focussing attention of the detec-

tor/localiser on the regions of the image which are most likely to contain the

object of interest which is performed via the cascade. The cascade, which essen-

tially takes the form of a decision tree, is a simple process where a positive result

from the first classifier on a given sub-window triggers the second classifier and

so on. A negative outcome at any stage of the cascade leads to the rejection of

that sub-window. An illustration of a 20 stage cascade is shown in Figure 4.7,

for the task of face localisation.

If the false-alarm rate was set to 0.5, then after the first five stages, the

detector/localiser would have eliminated nearly 97% of the non-object windows,

which allows more computation on the areas of the image which may contain the

object. This cascading framework, allows extremely rapid detection/localisation

of objects, as it is very efficient in determining whether a sub-window is possibly


the object of interest or not. As in face and facial feature localisation, most sub-

windows are not objects of interest, this particular characteristic bodes well for

the localisation of these objects for every frame. By setting the hit rate high and

the false-alarm rate to a reasonable value, then very good performance can be

obtained. In Figure 4.7, a 20 stage boosted classifier is shown. If the hit rate is

0.9998 and the false-alarm rate is set to 0.5, then the overall localiser can obtain

a performance of 0.999820 = 0.9960 with a false-alarm rate of 0.520 = 9.54×10−7.

From this section, it can be seen that the Viola-Jones algorithm gives a frame-

work for which accurate yet quick object detection/localisation can take place.

Probably the key to this algorithm is to choose simple classifiers which can reject

the majority of the sub-windows before more complex classifiers are called into

action. However, it must be noted that these simple classifiers are determined

from the positive and negative algorithm that are given to it in the training

phase, so it is imperative that a exhaustive set of positive and negative images

are provided so that good generalisation is achieved.

The next section describes this training process of the Viola-Jones algorithm

through the implementation of the visual front-end for the frontal-pose. This

visual front-end produced the ROI’s which were used for the basis of this thesis.

4.6 Visual Front-End for Frontal View

The visual front-end for the frontal view was implemented and tested on the

frontal view data from the IBM smart-room database (see Chapter 3.6.2). This

visual front-end and the eventual ROI extraction was devised on a similar hier-

achical strategy to that of Cristinacce et al. [38], where the facial feature local-

isation was based on the search areas defined by the previously localised feature

points. Both the face and the facial features were localised using the boosted

cascade of classifiers based on the work described in the previous section. The

classifiers were generated using OpenCV libraries [128].

The positive examples used for training these classifiers were obtained from

a set of 847 training images taken from the training set of visual speech utter-

ances, with 17 manually labeled points for each face. As only the ROI was to

4.6. Visual Front-End for Frontal View 65

�� Figure 4.8: Points used for facial feature localisation on the face: (a) right eye,(b) left eye, (c) nose, (d) right mouth corner, (e) top mouth, (f) left mouth corner,(g) bottom mouth, (h) mouth center, and (i) chin.

be extracted, it was decided that 9 of the 17 manually labeled points were to be

used to somewhat simplify the process. These points were the: left eye; right eye;

nose; right mouth corner; top mouth; left mouth corner; bottom mouth; center

mouth; and chin; and are depicted in Figure 4.8. This provided 847 positive

examples for all 9 facial features.

he resulting positive examples for the face was further augmented by including

rotations in the image plane by ±5 and ±10 degrees, as well as mirroring the

images, providing 5082 positive examples. As a number of the facial features were

located so close to each other (a matter of pixels in some cases), it was decided

not to include rotated examples of the facial features. The positive examples for

the face were all normalized to 16 × 16 pixels, based on the distance of 6 pixels

between the eyes. Examples of the face templates are shown in Figure 4.9.

The negative face examples consisted of a random collection of approximately

5000 images which did not contain any faces. Some of them were of the back-

ground within the face images, as well as random objects. A small array of these

examples are shown in 4.10. The majority of these images were of a high resolu-

tion in comparison to the face images (around 360× 240 pixels), so that enough

negative sub-windows could be used to train up the classifier adequately. This

was very important, as the Viola-Jones algorithm disregards most of the nega-

tive examples in the first few stages, so it was vital that there is an abundant of


Figure 4.9: Example of the 16 × 16 frontal faces from the IBM smart-roomdatabase used for this thesis.

background examples to satisfy this requirement. Having background images of

high resolution was one way of overcoming this.

The eye classifiers were trained using image templates of size 20×20, the nose

and chin using templates of 15 × 15, and the right, top, left and bottom mouth

templates were of size 10× 10. The mouth center templates were of size 24× 24,

and this classifier was used to find a coarse ROI so that further refinement could

take place, hence the larger template size. All these templates were taken from

normalised face images of size 64× 64, based on a distance of 32 pixels between

the eyes. Figure 4.11, gives an example of the facial feature templates used to

train up the various classifiers. As the face localisation step reduces the search

space for the facial feature localisation, the negative examples for the various

facial features consisted of images of other facial features. This was done to

alleviate the confusion that might have occurred due to various facial features


Figure 4.10: Example of the negative images used for training of the face classifier.

looking alike, i.e. the mouth open can appear like an eye in various illumination

conditions. Examples of the negative images used to train up the facial feature

classifiers are shown in Figure 4.12.

Due to the lack of manually labeled faces available, all classifiers were tested

on a small validation set of 37 images, which would give an indication on what

particular features would give us the best chance of reliably tracking the localised

features. The results are shown in Table 4.1. Normally, the distance between

the eyes (deye) is used as a measure of performance for the task of facial feature

localisation as they have been long regarded as an accurate measure of the scale

of a face [80]. As such, a facial feature was not considered located if the estimate

of the position of the feature (pf ) and its manually annotated position (pf ) was

more than 10% of the annotated distance between the eyes (i.e. for a feature

deemed to be located, pf < 0.1× deye [81].

As can be seen from this table, most of the facial features were located at


Figure 4.11: Example of the templates used for the training of the frontal facialfeatures. The ROI shown on the right is an example of the mouth center template.

Facial Feature Accuracy (%)

Right Eye 91.08

Left Eye 89.47

Nose 89.47

Right Mouth 91.08

Top Mouth 81.08

Left Mouth 89.47

Bottom Mouth 83.78

Center Mouth 89.47

Chin 67.57

Table 4.1: Facial feature point detection accuracy results for frontal pose

a pretty high rate, except the chin and top and bottom mouth. As the final

extracted mouth ROI needed to be normalised for scale and rotation to enforce

alignment across all the different ROI images, two geometrically aligned points

had to be found for this to happen. In literature, normally eye locations are used

for such alignment. However, it was found heuristically that this metric was not

ideal for scaling the mouth as there is a great deal of variability in mouth shape

and size, which did not appear to be correlated with the distance between the

eyes. As such, it was determined that the left and right mouth corners would

be used as these gave much better reference points for the scale and rotation

normalisation to occur. Upon inspection, the face localisation accuracy on this


(a) Eyes

(b) Nose

(c) Mouth Region

(d) Right Mouth

(e) Top Mouth

(f) Left Mouth

(g) Bottom Mouth

(h) Chin

Figure 4.12: Example of negative images used for the training of the frontal facialfeature classifiers.

validation set was 100%.

The visual front-end used to extract the mouth ROI for the frontal pose is

outlined in Figure 4.13. Given the video of a spoken utterance, face localisation

is first applied to estimate the position of the speaker’s face. As the classifier is

able to scale well, an image pyramid approach to search at different scales was not

required. Once the face was located, the eyes were searched over specific regions

of the face (based on training data statistics). Once these eye location were found,

a general mouth search region was specified. The mouth center classifier was then

used to refine this search region. The resulting mouth region was then used as the

search region to locate the right and left mouth corners. Once these two points

were found, the extracted mouth ROI was then rotated so that these two points

70 Chapter 4. Visual Front-End�� ! � �� " � �� #$%&$%'(��)* +�� ,-./0��1�2345678 679 : �3-./0 �;�-<= -.>�?@AB CD@EEFGFBHIJ K IJ LBMND@LBE OPB QRSKRST CD@EEFGFBHE UVWLX CBYLBHCD@EEFGFBHZ [RK[R�� * +�� * +�� \� �� +�� ]�^� +�� _ � �� "�� * +��\ ��

`BGL @Ya bFcXL UVWLXCD@EEFGFBHEZ ISKISFigure 4.13: Block diagram of the visual front-end for the frontal pose.

were aligned horizontally and scaled to be 20 pixels apart to yield a final 32× 32

pixel ROI to be used in the lipreading system. The final ROI contained most of

the lower part of the face. In a comprehensive review conducted by Potmianos

and Neti [140], they found that improved lipreading results can be obtained by

having the jaw and cheeks included in the final ROI compared to the ROI just

containing the lips. This finding is also supported by human perception studies

which show that visual speech perception is improved when the entire lower face

is visible [169]. It must be also noted that the final 32×32 ROI was downsampled

from a much higher resolution (on average approximately 80 × 80 pixels). The

reason why the ROI was downsampled was to keep the dimensionality low. Such

a method is not expected to affect the lipreading performance according to the

work conducted by Jordan and Sergeant [83].

Following the ROI localisation, the ROI is tracked over consecutive frames.

If the detected ROI is too far away from previous frame, then it is regarded as

a detection failure and the previous ROI location is used. A mean filter is then

used to smooth the tracking. Due to the speed of the Viola-Jones algorithm, this

process was performed on every frame. Prior to this full process beginning, an

initialization phase is executed to get an initial lock on the location of the various

facial features.

4.7. Chapter Summary 71

Figure 4.14: Mouth ROI extraction examples. The upper rows show examples ofthe localised face, eyes, mouth region and mouth corners. The lower row showsthe corresponding normalised mouth ROI’s (32× 32 pixels).

Overall, the performance of the visual front-end was very good, with it ap-

pearing to generalise well across all the different variations present in the dataset,

such as appearance and illumination. It should be noted that there was only a few

number of poorly or mistracked ROI’s in the dataset, which could be attributed

to random head movement. As it was assumed that little face movement would

occur, strict thresholds were set to minimise the amount of allowable movement

in the facial features. This was alleviated however, by the relaxing of such con-

straints and performing localisation on each frame. Figure 4.14, shows face and

facial feature localisation examples from the visual front-end and the final ex-

tracted mouth ROIs.

4.7 Chapter Summary

A visual front-end which can automatically and accurately localise face and facial

features positions quickly is of the utmost importance for a lipreading system.

However, as it was noted in this chapter, this task is difficult due to the many

variations the visual front-end has to deal with such as pose, illumination, ap-

pearance and occlusion. If the system can not deal with these variations, poor

localisation of the mouth ROI will inevitably take place which will effect the over-

all accuracy of the lipreading system due to the front-end effect. This chapter

reviewed various approaches to the visual front-end, especially focussing on the


Viola-Jones algorithm which is both extremely rapid and accurate across all dif-

ferent conditions. This algorithm was then implemented for extracting the mouth

ROIs for the frontal pose scenario, achieving accurate results. The next step after

extracting the mouth ROIs is to extract visual features from them. This process

is described in the next chapter.

Chapter 5


5.1 Introduction

The visual feature extraction step seeks to find representations of the given ob-

servations that provide discrimination between the various speech units whilst

providing invariance to irrelevant transforms on the observations that are in the

same class. Ideally, the task of the visual feature extraction step is to yield visual

speech features which make the job of the classifier trivial, i.e. the features would

already be clustered into their separated classes without overlapping. However,

due to the many variants within the mouth ROI such as, illumination, appear-

ance, viewpoint, alignment, speaking style and of course the high dimensionality

associated with image/video data, the task of finding features which provide good

speech discrimination is extremely difficult.

Over the past twenty or so years, various sets of visual features for lipreading

have been proposed in the literature [142]. In general, they can be grouped into

three groups; appearance-based, contour-based, or a combination of both. The

first part of this chapter is dedicated to briefly reviewing these. Even though no

one technique has shown itself to be superior to the other, the appearance based

methods has been preferred by many researchers as it is motivated by human

perception studies and does not require finer localisation and tracking, which

reduces the impact of the front-end effect (see Chapter 4.2). As such, the latter

part of the chapter focusses on an appearance-based technique which is considered

73

74 Chapter 5. Visual Feature Extraction

as the current state-of-the-art. This current state-of-the-art technique is based on

a cascade of appearance features and in this chapter each stage of the cascade is

investigated to determine the relative impact, which is a novel contribution. Even

though this techniques work well, there are shortcomings associated with it such

as dimensionality constraints. Making use of the laterally symmetrical nature

of the frontal ROI has been shown to alleviate some of these problems and this

is investigated in this chapter as well. Another potential method which shows

promise is through the use of patches. Motivated by the frontal ROI symmetry

work, novel analysis of the ROI via patches is introduced. The idea behind the

use of patches is that if there are areas of the ROI which are more pertinent to the

task of lipreading, then these areas can be weighted higher to improve lipreading

performance. The remainder of the chapter is dedicated to developing a new

multi-stream visual feature extraction technique which fuses the more pertinent

areas of the ROI together to gain a better representation.

5.2 Review of Visual Feature Extraction Tech-

niques

Potamianos et al. [142], divided the ways visual speech could be represented into

the following three categories:

(i) appearance based,

(ii) contour based, and

(iii) combination of appearance and contour based features.

The following section gives a brief review of the progress that has been made over

the past twenty years with respect to these approaches, citing the advantages

and disadvantages associated with them. The section concludes by comparing

the three approaches.

5.2. Review of Visual Feature Extraction Techniques 75

5.2.1 Appearance Based Representations

Appearance based representations are concerned with transforming the whole in-

put ROI image into a single meaningful feature vector. This method is motivated

by the fact that in addition to the lips, the visible speech articulators such as

the teeth, tongue, jaw as well as certain facial muscle movement are informative

about the visual speech [167]. To incorporate such features, the ROI is normally

a square or rectangular region around the speaker’s mouth, as was described in

the previous chapter. However, there is no fixed dimension, shape or size that

the ROI has to conform to. Some researchers have used the entire face for the

ROI [116], a three-dimensional rectangle to capture temporal nature of the signal

[136], or a disc around the speaker’s mouth [45]. Instead of pixel values, differ-

ence [161] or optical flow values [65, 112, 171] have been used as the ROI. As the

dimensionality of the ROI’s are generally too high to be applied successfully to

a statistical classifier such as a HMM [147], the goal of these methods has been

to find a compact representation of the ROI, which is low in dimensionality but

retains most, if not all, of the visual speech information. This limitation on the

number of features allowed to be used is also known as the curse of dimensionality

[8].

The dimensionality reduction problem is similar to that encountered in face

recognition, where a compact representation of a face image is desired. As such,

it is not surprising to see that a lot of the work performed in lipreading has mir-

rored the work done in face recognition. After Turk and Pentland published their

ground breaking paper on eigenfaces [176], where principal component analysis

(PCA) was used for feature reduction, Bregler and Konig [13] later introduced

eigenlips. This work was based on the same idea, except that PCA would be per-

formed on the mouth ROI, not the face. Since then, PCA has been a very popular

appearance based method used for lipreading with many researchers achieving

good results [12, 13, 27, 45, 47, 98, 102, 137, 175]. Independent component

analysis (ICA) has also been used for lipreading [64]. The goal of ICA [78] is

to perform nonlinear monotonic transformations such that the transformed rep-

resentation is statistically independent, not just uncorrelated. However, results

received using such a transform were not significantly better than traditional


PCA representations for lipreading [64].

Linear image transforms such as the discrete wavelet transform (DWT) [134,

137] and the discrete cosine transform (DCT) have been employed in various sys-

tems [45, 62, 72, 98, 106, 126, 137, 158, 161]. These non-data driven approaches

do not produce any compression, but make the transformed coefficients more

amenable to quantisation by removing much of the statistical redundancy in the

image. The DCT is the basis of many low-rate algorithms in use today such as

MPEG for High Definition Television (HDTV) and JPEG for still images. These

image transforms usually allow fast implementation when the image size is of

a power of 2 by use of the Fast Fourier Transform (FFT) and can lend itself

to a real-time implementation of a lipreading system [32]. Another benefit of

these approaches are they are not dependent on a training ensemble, however,

they bring minimal prior knowledge about the mouth to the system if used as

the sole visual feature extractor. Non-linear transforms such as the multiscale

spatial analysis (MSA) technique have also been used to gain a representation

of the mouth ROI [115]. MSA uses a nonlinear scale-space decomposition algo-

rithm called a sieve which is a mathematical morphology serial filter structure

that progressively removes features from an image by increasing the scale.

The methods mentioned above achieve reasonable performance, however, they

are merely concerned with compressing the ROI and do not necessarily take into

consideration the speech content contained within the ROI. A major reason for

this is that they are unsupervised processes and as such, do not make use of im-

portant speech information available such as timed labeled transcriptions, which

is used for the training of the HMM models. Linear discriminant analysis (LDA)

on the other hand, is a supervised process which can use information such as the

time label transcription to segment the training data into speech classes (such

as HMM states) and calculate a transform which can project the ROI data to

a lower dimensionality whilst maximisimg the separation of the speech classes.

First proposed for lipreading by Duchnowki et al. [45], LDA was applied directly

to the pixels of the ROI. However, this can be problematic as when there are many

training examples, the calculation of the LDA transform matrix can become com-

putationally prohibitive. To counteract this, a dimensionality reduction step of

5.2. Review of Visual Feature Extraction Techniques 77� �Figure 5.1: Appearance based features utilise the entire ROI given on the left.Contour based features require further localisation to yield features based on thephysical shape of the mouth, such as mouth height and width which is depictedon the right.

PCA [98] or DCT [140] is applied, which has shown to outperform other appear-

ance based methods. Potamianos et al. [140] further improved performance by

following the LDA step with an application of a data maximum likelihood linear

transform (MLLT), which maximises the observations in the LDA feature space,

under the assumption of a diagonal covariance.

As mentioned in Chapter 2, the dynamics of visual speech play a very impor-

tant role in human perception [152]. Dynamic speech information such as first

and second order derivatives of the visual speech feature vector can be used to

capture this information[189]. LDA can also be used in this capacity as well [142],

by concatenating ±J adjacent feature vectors around the current frame to form

one large feature vector. LDA is then used to produce a transform which can

maintain the dynamic information whilst producing a final feature vector which

is small enough to allow convergence of the HMM. This process, first proposed by

Potamianos et al. [145], is performed via a multi-stage cascade and is currently

the state-of-the-art in visual feature extraction. As this is the case, this process

will be used as the baseline system for this thesis and is described in Chapter 5.3.

5.2.2 Contour Based Representations

Contour based representations are concerned with representing the mouth based

on the physical shape of the visible articulators. Whereas the appearance based

features just utilise the pixels within the ROI, the contour based approach goes a

step further by specifying the locations of the various visible articulators and using


these locations as their features as seen in Figure 5.1. This intuitive approach

has the appeal of being low in dimensionality, however, it does require further

localisation and tracking which can have an adverse effect on the lipreading system

due to the front-end effect.

A common contour based technique is to represent the mouth based on its

physical measurements such as mouth height, width, area etc. [1, 22, 59, 71, 132,

172] or even teeth [56] (see Figure 5.1). In [88], Kaynak et al. have provided a

comparison of these type of techniques. Another popular technique is to use active

shape models (ASM) [109, 126], as discussed in the previous chapter, to represent

the inner and outer lip contour by a set of labelled points. Other parametric

models such as the snake based algorithm [27], lip template parameters [21] and

deformable templates [73] have been used to good effect. Contour based features

based on MPEG4 standard of Facial Action Parameter (FAPs), have also been

proposed [3]. Recently, Rothkrantz et al. [153] introduced the use of lip geometry

estimation (LGE) along with optical flow analysis as another method for visual

feature extraction for lipreading.

5.2.3 Combination of Features

Based on the theory of fusing the complementary visual features to the acoustic

features to improve system performance, researchers have used a similar theory in

combining the appearance and contour based representations. This theory stems

from the hypothesis that the appearance features encode low-level information

and the contour features encode high-level information about visual speech and

by combining them together, improved performance can be sought as they are

complementary to each other. Luettin first did this by combining ASM features

with PCA features [109]. Chiou and Hwang [27] followed this up by combining

their snake contour features with the PCA features. Chan [19] then used geo-

metric features with PCA features. These approaches just concatenate the both

sets of features into a single feature vector. Conversely, the active appearance

model (AAM) creates a single model of both shape an appearance, using PCA

to statistically combine both the ASM with appearance features based on the

pixel intensity values [33, 114, 127]. A disadvantage of using this AAM approach

5.2. Review of Visual Feature Extraction Techniques 79

is that it requires an extremely large number of manually annotated points for

the training examples and does not work well when the speaker is not contained

within the training set.

Recently, Saenko et al. [157] proposed the use of multiple streams of hidden

articulatory features (AFs) to represent the visual speech signal. In this work,

each sound is described by a unique combination of various articulator states,

such as “lip-opened”, “lip-rounded”, “presence of teeth” etc. A problem associ-

ated with this multi-stream approach is the complexity involved as each of these

articulatory states (such as “lip-opened”) require extra classification (via a SVM

for example) prior to the sound classification, which may make this approach

intractable.

5.2.4 Appearance vs Contour vs Combination

Even though a plethora of research has been conducted within the field of visual

feature extraction for lipreading, it is still not clear which approach is best. A ma-

jor reason for this is that no comprehensive comparison of the various approaches

have been conducted as yet. In comparisons of limited size, Matthews et al. [115]

showed that AAMs outperform ASMs. Chiou and Hwang [27] documented that

their combined features were superior to the contour and appearance based fea-

tures. Potamianos et al. [137] and Scanlon and Reilly [161] demonstrated that

the appearance based features outperformed contour features. In experiments

based on the task of large vocabulary speaker independent AVASR, Matthews et

al. [116] showed that the appearance features outperformed AAMs.

In a recent paper by Rothkrantz et al. [153], they cite that the contour based

approach is superior to the appearance based approach. Even though they did not

do any experiments to support this statement, they hypothesised that the appear-

ance based features were inferior as they contained a lot of information which may

not pertain to the task of speech recognition but more so to speaker recognition,

as the heavily compressed features relate to speaker information and not speech

information. If a coarse image compression technique such as PCA or DCT is

applied and nothing else, this can be the case. However, if dimensionality reduc-

tion schemes such as LDA are employed, speech classification information can be


maintained. Other feature normalisation techniques such as removing the mean

feature vector or image over the utterance also normalise against speaker appear-

ance information. In the current state-of-the-art system devised by Potamianos

et al. [145], they use a cascade of appearance features using LDA as well as a

speaker normalisation step to maximise visual speech information.

Contour representations of the mouth ROI can be said to have certain bene-

fits over appearance based representations. The main benefit can be found in the

invariance provided from the shape information contained within the contours of

a speaker’s mouth region. Appearance features tend to suffer from irrelevant vari-

ations pertaining to visual speech due to illumination, speaker appearance and

ROI alignment. However, the accuracy of physically extracting the contours of

the mouth region makes this task very difficult, thus making them susceptible to

the front-end effect. By comparison, appearance based features rely on a coarse

detection of the ROI, making them far more stable, especially in difficult con-

ditions. Also, normalisation techniques for speaker appearance and illumination

can be employed to aid in the robustness of these features. AAMs appear to be

the best of the combined feature set, however, as shown during a comprehensive

comparison [116], appearance based features seem to be superior to them at the

moment.

Potamianos et al. in their review paper [142], highlight two very important

points in the argument towards appearance based features. Firstly, their use is

well motivated by human perception studies of visual speech as they contain in-

formation about the visible articulators (such as tongue, teeth, muscles around

the jaw etc.), which are not contained just by the contours of the lips [7]. In the

perception studies cited, perception of the mouth using the entire mouth ROI was

far superior to just the lip movement [169]. Secondly, appearance based features

can be computed very quickly, which lends itself to real-time implementation.

This point is probably the most important in terms of deploying a real-world

lipreading system. Another point which is also very important, is that the ap-

pearance based features are generic and can be applied to mouth ROIs of any

viewpoint compared to the contour based approaches as specific contours have to

be developed for the many views which may be a very cumbersome and exhaustive

5.3. Cascading Appearance-Based Features 81

Figure 5.2: Block diagram depicting the cascading approach used by Potmianoset al. [145] to extract appearance based features from the mouth ROI.

task.

For all these reasons, the appearance based approach was employed as the

visual feature extraction method of choice for this thesis. In the next section,

the current state-of-the-art technique based on a cascade of appearance-based

features is used as the benchmark, and experiments are conducted to show that

certain measures can be executed to normalise against the numerous variations

that the appearance based features are susceptible to.

5.3 Cascading Appearance-Based Features

The current state-of-the-art in visual feature extraction is that of multi-stage

cascade of appearance features devised by Potamianos et al. [145]. For this

thesis, a system based on this approach is used as the baseline system for all

work conducted. The complete system is depicted in Figure 5.2. From this figure

it can be seen that the system consists of 2 main stages. These being:

1. static feature capture (features captured per single frame), and

2. dynamic feature capture (features capturing temporal information over

multiple frames).

Each of these steps will be described in detail in the following subsections. How-

ever, as can be seen in Figure 5.2, a preprocessing step is required to convert the

ROI from a colour image into a grayscale image. The curse of dimensionality

can account for the reason why grayscale intensity values have been preferred

to using colour values, due to their more compact representation (three times

smaller). Research has found that the loss of the chromatic information does not

impact on the speech classification performance [143].

halla

This figure is not available online. Please consult the hardcopy thesis available from the QUT Library


Remove MeanFeature Vector,

yI

2D-DCT Intra-FrameLDA

Input GrayscaleROI, It

1...

M

1...N

Static FeatureVector, yIII

t

To Dynamic Feature Capture

yIt

1...

M

Normalised Vector, yII

t

Figure 5.3: Block diagram showing the capturing of the static features of a ROIframe.

5.3.1 Static Feature Capture

The goal of the static feature capture module is to maximise the amount of speech

data contained within each ROI frame with the least amount of features. This

fine balancing act is due to the dimensionality constraint enforced by the HMM

classifier as was mentioned in the previous section. However, this is achieveable

via a three stage cascading module which is depicted in Figure 5.3. As can

be seen in this figure, the static feature capture starts with a two-dimensional

DCT. As mentioned in the review given in the previous section, compression

techniques such as PCA and DWT can also be used instead of the DCT. However,

it was found in Potamianos et al. [143] that all of these algorithms achieve

approximately the same performance, with the DCT on par with PCA and slightly

outperforming DWT. The DCT and the DWT have the added advantage that

they allow fast implementations if the image resolution is a power of two and

also do not require prior knowledge of the training ROI examples. On the other

hand, PCA does require these training examples which is very computationally

expensive, especially given the high dimensionality of the images (32×32 = 1024)

and the number of training examples that are normally required (>= 200, 000)

to adequately train up the PCA subspace. As such, the DCT has been chosen to

be the image compression technique of choice.

Discrete Cosine Transform (DCT)

Given a grayscale intensity frame of the ROI of dimension D = L×W , at pixel

location (l, w), the two-dimensional DCT of the ROI, It(l, w), can be computed

as follows


Ft(l, w) =

√2

L

√2

Wclcw

L−1∑i=0

W−1∑j=0

It(i, j) cos(2i + 1)lπ

2Lcos

(2j + 1)wπ

2W(5.1)

where

l = 0, 1, . . . , L− 1

w = 0, 1, . . . , W − 1

for

cl,w =1√2

for l, w = 0

= 1 for l, w 6= 0

where L and W refer to the length and width of the ROI in pixels respectively.

No compression on the DCT image ,Ft(l, w), has taken place in the form given

in Equation 5.1. All this has achieved is transforming the image from the spatial

domain into the frequency domain, similar to what the discrete Fourier trans-

form (DFT) does. However, the form of Ft(l, w) lends itself to be compressed

as the coefficients within the transformed image are grouped according to their

importance. Most of the energy or information is contained within the low-order

coefficients, whereas, the higher order coefficients have very low information con-

tained within them and can be discarded. By reorganising the image data in

this way, compression can be achieved by retaining the coefficients with the most

information.

A convenient way to scan the two-dimensional DCT is the zig-zag scheme

(see Figure 5.4) used in the JPEG standard [173] because it groups together

coefficients with similar frequency. As such, the top M coefficients according

to this pattern represent information within the image which contains the most

variability or information which is used to represent the given ROI. Figure 5.5

shows examples of the reconstructed images using the various numbers of M . As

can be seen in this figure, low number of features such as M = 10 or 30 result

in very little information being maintained from the original ROI. However, as


Figure 5.4: Diagram showing the zig-zag scheme used in reading in the coefficientsfrom an encoded the two-dimensional DCT image.

(a) (b) (c) (d) (e)

Figure 5.5: Examples showing the reconstructed ROI’s using the top M coeffi-cients from the DCT: (a) original, (b) M = 10, (c) M = 30, (d) M = 50 and (e)M = 100.

the number of features used is increased to M = 50 and 100, the more the

reconstructed ROI’s look like the originals, but with a dimensionality reduction

of a factor of 10 to 20.

Using the top M DCT coefficients, the ROI frame can be expressed as a vector

of the formyI

t = [y1, . . . , yM ]′ (5.2)

where y1, . . . , yM correspond to the top M coefficients according to the zig-zag

pattern and yIt refers to the feature vector after the first stage of the cascade,

which can be seen in Figure 5.3. It is worth noting that the DCT is completely

reversible, that is, performing the inverse DCT on the DCT of an image will

restore the original image; using this property, allowed the reconstructions in

Figure 5.5.


-300 -200 -100 0 100 200 300 400 500 600 700-400

-350

-300

-250

-200

-150

-100

Comparison of 2nd and 3rd DCT coefficients for two speakers for three digits

2nd DCT coefficient

3rd

DC

T c

oeffi

cien

t

- zeroo - onex - two

speaker 1

speaker 2

Figure 5.6: Plot showing the speaker information contained within the featureswithout normalisation, for the digits “zero”, “one” and “two”.

Feature Mean Normalisation

Once the DCT step has been executed, the next step is to perform feature mean

normalisation (FMN). The FMN step is very important due to the fact that

appearance based features have an abundance of speaker information contained

within them as noted recently by Rothkrantz et al. [153] and addressed earlier in

[104]. This becomes apparent when the DCT features for two different speakers

are analysed. In Figure 5.6, the second and third DCT coefficients for two speak-

ers are plotted against each other for three spoken digits. It can be seen that

the features are grouped according to their speaker and not the speech classes.

This is a very good result for the task of speaker recognition, however in terms of

lipreading, this information is irrelevant and can be classed as a type of noise. In

audio-only speech recognition a method called cepstral mean subtraction (CMS)

has been used as a method to remove this speaker information, as well as other

environmental variations [100, 189]. Similarly, this can be done in the visual do-

main by subtracting the mean feature vector over the entire utterance, yI , thus

effectively removing the speaker information. A simple block diagram depicting


1...

M

Input Static FeatureVector, yI

t

Remove MeanFeature Vector,

yI

1...

M


t

To Intra-Frame LDA

Figure 5.7: Block diagram showing the feature mean normalisation (FMN) stepof the cascading process, resulting in yII

t .

this is given in Figure 5.7. So, given the feature vector from the previous DCT

step, the normalised feature vector can be found via

yIIt = yI

t − yI (5.3)

where

yI =1

T

T∑t=1

yIt (5.4)

and T is the length of the utterance. This FMN step has shown itself to greatly

improve the performance of appearance based features [137, 140]. As can be

seen in Figure 5.8, the FMN essentially removes the mean of the image, which

essentially contains the redundant speaker information. In this thesis, the result-

ing normalised DCT features yIIt are termed as mean-removed DCT (MRDCT)

features.

The MRDCT features can be obtained via a slight augmentation to the static

feature capture module. Instead of removing the mean feature, the mean ROI

image over the utterance can be removed. This can be done by placing the FMN

step prior to the DCT step as can be seen in Figure 5.9. In the end yIIt is still

obtained, essentially resulting in the same output, but the normalisation is done

in the image domain rather than the feature domain. In this configuration the

mean image over the utterance, I is subtracted from the input image It to yield IIt .

The two-dimensional DCT is then performed on IIt to gain the MRDCT features

yIIt .

There are a few reasons why this change is necessary. Firstly, in the current

system the FMN process is relied on as the sole normalisation step. No rotation

normalisation, pose compensation or lighting normalisation is directly applied


-300 -200 -100 0 100 200 300 400-150

-100

-50

0

50

100

Comparison of 2nd and 3rd MRDCT for two speakers for 3 digits

2nd MRDCT coefficient

3rd

MR

DC

T c

oeffi

cien

t

- zeroo - onex - two

Figure 5.8: Plot showing that with FMN the unwanted speaker information con-tained within the features is effectively removed, for the digits “zero”, “one” and“two”.

on the ROI. As mentioned in the previous chapters, these variations have not

been a problem for lipreading and as such there has been no requirement for

accounting for such variations. However, as more work in this field is being ap-

plied on real-world data where these variabilities are an issue, precautions should

be made so that a lipreading system can handle them. The only work found to

deal with such variations in the visual domain is by Potamianos and Neti [141],

who found that lipreading performance degrades significantly when deployed in

challenging environments. This mirrors the findings in face recognition where il-

lumination and pose variability have shown itself to be one of the biggest sources

of train/test mismatch causing sever performance degradation [67]. Secondly, by

placing the FMN step directly after the ROI extraction instead of the DCT step,

it is believed that variabilities such as pose and illumination can be dealt with

in a more efficient manner by incorporating an illumination or pose normalisa-

tion step within the FMN module. By allowing this change, the FMN process

essentially acts as an pre-processing step dealing in the image domain rather than

the feature domain. This allows the input ROI image to be enhanced prior to


Mean ROI, I I

To Intra-Frame LDA

2D-DCT

1...

M


t

Input ROIIt

Mean RemovedROI, II

t

Feature MeanNormalisation

Figure 5.9: Block diagram showing the augmented static feature capture systemusing the FMN in the image domain rather than the feature domain.

any feature extraction and normalise for any unwanted variations within the in-

put ROI, hopefully reducing the train/test mismatch thus improving lipreading

performance.

In the next section, experiments will be conducted showing that this basic

change to the system does not affect lipreading performance.

Linear Discriminant Analysis

Whilst the previous two steps have extracted and normalised the features for each

ROI frame, they are only coarse image compression and normalisation techniques

and do not identify those features that will give the best discrimination between

the various speech classes (such as words). For a successful lipreading system to

be employed, it is important that features not only describe each class well, but

also allow distinguishing characteristics of each class to be easily identifiable.

Linear discriminant analysis (LDA) aims to find the optimal transformation

matrix WILDA such that the projected data is well separated. Unlike the DCT,

LDA is a supervised process which uses a predefined set of classes C associated

with their training data vector yIt to determine this optimal transform. The

class set C consists of A number of classes so that c(a) ∈ C. For the task of

lipreading, where recognition of words is the overall goal, these A classes are

normally associated with words. Improved performance can be gained however,

with the increase in the number of classes, so HMM states are used as the class

set. Labeling the training data is done via aligning the training feature vectors

with the state-aligned time labeled transcription which can be obtained from the

audio-only models using HTK [189].


Given a set of NT training examples, XI = [yII1 , . . . , yII

NT], and associated class

labels C, the LDA transformation matrix WILDA can be found by minimising

the intraclass dispersion whilst maximising the interclass distance 1. To formu-

late the criteria for class separability, the within-class scatter matrix Sw and the

between-class scatter matrix Sb are used. The within-class scatter matrix de-

scribes the statistics of the data points around their own expected vector, whilst

the between-class scatter matrix describes the distribution statistics of all class

expected vectors. The within-class scatter matrix can be expressed as

Sw =A∑

a=1

caΣa (5.5)

where ca is the ath class mixture weight and Σa is the ath class covariance matrix.

The between-class scatter matrix can then be expressed as

Sb =A∑

a=1

ca(µa − µ0)(µa − µ0)′ (5.6)

where µa is the ath class mean and µ0 is the mixture mean given by

µ0 =A∑

a=1

caµa (5.7)

The transformation matrix WILDA can then be estimated by maximising

tr(WILDAS−1

w Sb(WILDA)′) [51]. Similar to what occurs in PCA, this translates

to the N greatest eigenvalues and eigenvectors of S−1w Sb. Although both S−1

w and

Sb are symmetric there is no guarantee that S−1w Sb will be symmetric making nor-

mal eigen decomposition impossible. Simultaneous diagonalisation as proposed

by Fukanaga [51] can be used to diagonalise S−1w Sb where

(WILDA)′S−1

w WILDA = I (5.8)

and

(WILDA)′S−1

w WILDA = Λ (5.9)

1Note that in this cascading algorithm there are steps which are repeated several times,hence the need for the indexing of the matrices and vectors to avoid confusion, i.e. W I

LDA andyII

NT.


where Λ and WILDA are the eigenvalues and eigenvectors of the matrices S−1

w Sb. It

must be said that the resulting eigenvectors in the transformation matrix WILDA

are not mutually orthonormal or orthogonal. As a result the transform does not

preserve energy, but preserves class separability as defined by the within-class

and between-class scatter matrices.

It has been shown that LDA does not work well when it is applied directly

to high dimensional data such as images [6, 184]. This is mainly due to its

susceptibility to low energy noise and it being computationally prohibitive to

calculate the LDA matrix when the input matrix is extremely large. To alleviate

this problem, a dimensionality reduction step is normally taken to remove this

low energy noise. This is why the two-dimensional DCT was performed on the

input ROI prior to the LDA step, as it is more effective to work on data of

dimensionality M compared to D, where D >> M .

LDA is similar to PCA where the linear transform matrix WILDA maps the

input data XI of dimensionality M to output matrix YI , which is of dimension

N (where M > N). To achieve this feature reduction, the top N eigenvectors

of WILDA corresponding to the largest N eigenvalues of Λ are retained yielding

W”ILDA. So given the input matrix XI , the output YI can be found by simply

applying

YI = (W”ILDA)TXI (5.10)

where the output from the LDA step is YI = [yIII1 , . . . ,yIII

T ], which corresponds

to the final static feature vector and the third step in the overall cascading al-

gorithm. This LDA step has been termed intra-frame LDA due to this step

occurring within each individual frame [142].

There are assumed constraints associated with using LDA. Firstly, it assumes

each class is described by the same convariance matrix Sb. This can be a large

problem when this approximation does not hold. Additionally, the rank of the

within-class scatter matrix Sw is limited to M ≤ A − 1, limiting the size of the

subspace defined by WILDA to A− 1. Finally, LDA is only suitable for problems

where classes are separated by means not covariances. When this assumption does

not hold it is possible to find clusters in each class that force each distribution to

be described by several unimodal Gaussians of the same covariance matrix.


1...N

Input Static FeatureVector, yIII

t

1...P

Final Dynamic Feature Vector, yV

t

+J

-J

Inter-frame LDA

Concatenated FeatureVector, yIV

t

1

.

.

.

(2J+1)N

Figure 5.10: Block diagram showing the capturing of the dynamic features cen-tered at each ROI frame.

In the current state-of-the-art system, the LDA step is normally followed by

a step of maximum likelihood data rotation (MLLT) [61]. In this thesis however,

the MLLT step was not performed as it did not add to the performance of the

lipreading system during preliminary experiments.

5.3.2 Dynamic Feature Capture

The temporal aspect of the visual speech signal is known to help human percep-

tion of visual speech [152], as mentioned in Chapter 2. There are many ways of

incorporating this temporal information. The most popular method of capturing

the dynamic information is via the first and second derivatives of the feature

vectors [189]. Another method which can give improved results is to use LDA

as a means of learning a transformation matrix which can optimally capture the

dynamic nature of speech. Such a method is depicted in Figure 5.10 and this is

used in the current state-of-the-art system employed by Potamianos et al. [140].

It can be seen in this figure that the transformation matrix is found from the

concatenation of ±J frames centered around the current frame. So each input

frame to the LDA step is represented by

yIVt = [(yIII

t−J)′, . . . , (yIIIt )′, . . . , (yIII

t+J)′]′ (5.11)

The LDA transformation matrix for the dynamic features, W′′IILDA, is calcu-

lated exactly the same way as for the static features, with the classes and the


number of training examples remaining the same. The only difference is that the

input feature vectors span across multiple frames and not just within the frame.

For this reason, this step has been termed inter-frame LDA [142].

Similar to the result in Equation 5.10, the the output YII can be found by

simply applying

YII = (W′′IILDA)′XII (5.12)

where the input is XII = [yIV1 , . . . ,yIV

T ] and the output from the LDA step is

YII = [yV1 , . . . ,yV

T ], which corresponds to the final dynamic feature vector and

the final step in the overall cascading algorithm shown in Figure 5.10. In the

work conducted by Neti et al. [126] and Potamianos et al. [142], they found that

using 5 adjacent frames (J = 2) gave optimal results 2.

5.4 Lipreading from Frontal Views

In this section, experiments are conducted on frontal view data to test the cas-

cading appearance feature extraction method described in the previous section.

Analysis of the features is carried out at each stage, in an effort to show the

importance of each individual stage. As no analysis like this has been conducted

before, this is very important as it shows what impact each stage of the cascade

has on the overall lipreading performance. This analysis is also important in

working out the parameters to yield the optimal performance from the lipreading

system. These experiments also show some of the limitations and restrictions on

associated with some of the stages, which may be overcome with some modifica-

tions.

The frontal pose portion of the IBM smart-room database was used for this

experiment. As for all experiments carried out using this database, the multi-

speaker paradigm using the protocol described in Chapter 3.6.2 was used. As the

dynamic nature of speech is vital in terms of recognising visual speech, it was

2In these systems the visual features were interpolated to the audio rate of 100Hz to alloweasy integration of the audio and visual streams. Because of interpolation, they actually used15 adjacent frames. In this system however, the focus is solely on lipreading and as such nointerpolation was performed, so the value of 5 frames is an approximate equivalent (30Hz vs100Hz).

5.4. Lipreading from Frontal Views 93

decided that difference images would be tested as well as the original images to

see how much impact the temporal aspect of visual speech had on performance.

Given an input ROI image It, the difference image can be defined as

I∗t = It − It−1 (5.13)

The features for the difference ROI images are then calculated the same way

for the original image, except that I∗t is used instead of It. Both the static and

dynamic features are evaluated in these experiments, which are described in the

following subsections.

5.4.1 Static Feature Analysis

The DCT step is a coarse compression technique aimed at reducing the dimen-

sionality of the ROI without adhering to any class structure. As it is the first

step within the cascading algorithm, it is interesting to see how much speech

discrimination power is contained within these early features. In Figure 5.5, it

was visible that when more DCT coefficients were used to represent a ROI, the

more it appeared like the original ROI. It is therefore intuitive that more features

would correspond to improved speech classification, however, this is not possible

as the HMM classifier can only handle a finite number of features. Another factor

to consider is what impact the redundant speaker information has on speech clas-

sification and to that end, what effect the FMN step has on speech classification.

The results are shown in Figures 5.11 3.

Figure 5.11 shows the raw DCT features do not provide much speech clas-

sification performance, achieving a minimum WER of around 87%. However,

when the FMN step is utilised a massive improvement is gained with the WER

of 59% achieved using 40 features. This result shows that speaker information

is an unwanted form of noise and great benefit can be sought be removing this

redundant component from the signal. Conversely, the difference DCT features

performed very well achieving a WER of 48% using 40 features. This highlights

the importance of the temporal nature of visual speech. It is interesting to note

3It is worth noting that“DCT” refers to results of the the DCT of the input ROI and “Diff”refers to the results of the DCT of the difference ROI.


10 20 30 40 50 60 70 80 90 10045

50

55

60

65

70

75

80

Number of features per feature vector (M)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison showing the effect of FMN on static features

DCTMRDCT ftDiffMRDiff ft

Figure 5.11: Plot showing the effect that FMN has on the lipreading performance.

however, that the FMN normalisation step did not improve speech classification

for the difference features. This is because the difference ROI has already un-

dergone a FMN step prior to feature extraction through subtracting the previous

ROI; this result is to be expected as there is no redundant speaker information

to normalise for. Another interesting result to be gained from this experiment

is the impact, or lack of impact, that the number of features have on lipreading

performance. Apart from the DCT result, from 10 to 20 features it can be seen

that there is a jump in performance and this improvement peaks at around 40

features. There are two possible reasons for this. Either, all the visual speech

information is contained in the top 40 features, or the curse of dimensionality is

having an impact on performance. As will be shown later on, it would appear

that the latter is the cause.

Placing the FMN step prior to the DCT step can allow the input ROI to

be enhanced which can allow the lipreading system to deal more easily with the

variations associated with illumination and pose. The next result shows that by

augmenting the feature extraction step in this way no degradation in lipreading

performance is suffered. In fact, by viewing Figure 5.12, it can be seen that

marginal improvement in performance at some levels can be sought by doing the

FMN in the image domain rather than the feature domain. Even though the

improvement is slight (WER of 55.75% compared to 58.89%) for MRDCT using


10 20 30 40 50 60 70 80 90 10046

48

50

52

54

56

58

60

62

64


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison of FMN approaches on static visual features

MRDCT ftMRDCT imMRDiff ftMRDiff im

Figure 5.12: Plot comparing the lipreading performance of both the image basedand feature based FMN methods.

40 features, there is minimal to no improvement sought for the mean removed

difference features (MRDiff)). This suggests that using FMN in the image domain

would be of benefit to the overall lipreading system and so this method was

employed for the rest of the experiments in this thesis.

Using the top M = 100 normalised features from the previous step, intra-

frame LDA is performed to further reduce the dimensionality of the features

whilst maintaining speech classification information. The speech classification

information is based on the class set C, which for these experiments were the

HMM states. The results for these experiments are shown in Figure 5.13. As can

been seen in this figure, the intra-frame LDA step reduces the best case WER

from 55.75% down to 43.39% for MRDCT and from 47.96% down to 32.37% for

MRDiff. These reductions in WER correspond to significant improvements in

lipreading performance, which highlights the importance of LDA to the task of

lipreading. As can be seen from this plot, optimal performance is gained using

N = 10 to 20 features to represent the visual signal, which is a useful result when

performing the inter-frame LDA which is the next step.

As it can be seen from the results shown in this section, significant improve-

ments in lipreading performance can be obtained at each stage of the cascade.

It is worth noting that even though high performance is gained using just the


10 20 30 40 50 60 70 80 90 10030

35

40

45

50

55

60

65

70

Number of features per vector (M for DCT, N for LDA)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison of visual features showing the effect of LDA on lipreading performance

MRDCT imMRDCT im/LDAMRDiff imMRDiff im/LDA

Figure 5.13: Plot of the lipreading results showing the effect that LDA has onimproving speech classification on the final static features over various values ofN .

static frame data, the temporal nature of visual speech does provide significantly

more speech discrimination through the use of difference ROI’s. This point is

highlighted by the fact that after the intra-frame LDA the difference features

obtained a WER of 32.37% compared to the 43.39%. This result is very signifi-

cant if only the static feature capture is able to be implemented into a lipreading

system due to real-time constraints. The results in this section also highlights

the curse of dimensionality when dealing with a classifier like a HMM. In Figures

5.11 and 5.12 the best performance was obtained using 40 features. However, in

the previous result shown in Figure 5.13, it is quite obvious that using around

100 features gave more visual speech information.

5.4.2 Dynamic Feature Analysis

In these experiments many different permutations of the number of input features

to the inter-frame LDA step were used in determining what gave the best lipread-

ing results. Even though in the previous subsection it was found that N = 10 to

20 gave the optimal lipreading results for the static features, this does not nec-

essarily translate to being the best configuration to capture the dynamics of the

speech. This is because their has to be a balance between the number of input


10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

Number of feature used per feature vector (P)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison of features with varying temporal information (J), with input vector of size N=30

MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)


MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3MRDiff/LDA2 J=4

(a) (b)

Figure 5.14: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 30 input features.

features used N and the length of the temporal window J . This is very important

as calculating the transformation matrix WIILDA is computationally expensive and

there is a limit on how large the input matrix XI can be (approximately < 6M

element matrix).

For the sake of clarity only the results of the best performing configuration

using N = 30 for both the MRDCT and MRDiff features are shown in Figure 5.14;

a complete comparison of results using N = 10, 20, 30 and 40 is given in Appendix

A. As can be seen in Figure 5.14(a), the performance of the MRDCT features

improves when the temporal window J is increased from 1 to 2. When the value of

J is increased past 2, performance appears to level off with no real improvement

gained. From these results it appears that the best lipreading performance is

gained when P = 40 features are used, even though their is no real difference

between P = 30 to 60 features. This is in comparison to Figure 5.14(b), where

the improvement in lipreading performance of the MRDiff features when the

value of J is increased from 1 to 2 is not as large. This is because some temporal

information is already included in the MRDiff features, as they are difference

features. This can explain why there is considerable discrepancy between the

performance of the difference ROI image and original image features for the static

features. When the value of J is increased past 2, a similar flattening off of the

lipreading performance is experienced to the MRDCT features. This has resulted

in the performance of both set of features to be essentially equivalent with the


MRDCT features obtaining a best WER of 27.66% with J = 2 and P = 40

compared to the difference features achieving a best WER of 27.95% with J = 3

and N = 50.

Based on the analysis of the different components of the proposed visual fea-

ture extraction system, the configuration of the baseline system which yields

optimal performance is M = 100, N = 30, J = 2 and P = 40. This configuration

is for the augmented static capture module as per Figure 5.9 and is used for the

remainder of this thesis. Also, seeing that there is no discernable advantage in

using the difference ROI images compared to the original ROI images, visual fea-

ture extraction will only be performed on the original ROI’s. These parameters

agree with those found in [142].

5.5 Making use of ROI Symmetry

As was seen in the previous section, the more number of features used to rep-

resent the ROI does not translate to better lipreading performance, partly due

to the curse of dimensionality. As such, it is imperative that the a compact rep-

resentation of the ROI is gained. Although the system presented in this thesis

does this to a certain extent, there are still some measures which can be taken

to maximise the amount of visual speech content captured from each ROI frame.

One such technique is to make use of the symmetrical nature of the ROI for the

frontal pose. If the ROI was perfectly symmetrical around its midpoint then only

half of the ROI would be required to represent the ROI as the other half would

be identical. This corresponds to a 50% reduction in the number of feature that

need to be used. However, accurate ROI localisation is difficult and is prone to

error as was seen in Chapter 5. Potamianos and Scanlon [144] proposed a method

of overcoming this constraint by forcing the lateral symmetry of the ROI in the

frequency domain by exploiting the properties of the DCT by removing the odd

frequency components. The description of the algorithm is shown in the following

paragraphs and it is adapted from the description given in [144].

The two-dimensional DCT given back in Equation 5.1 is just a one-dimensional

DCT applied to the ROI rows, followed by a one-dimensional DCT on the columns

5.5. Making use of ROI Symmetry 99

of the result. As such, the one-dimensional DCT can be computed as

ft(l) =

√2

Lcl

L−1∑i=0

it(i) cos(2i + 1)lπ

2L(5.14)

where

l = 0, 1, . . . , L− 1

for

cl =1√2

for l = 0

= 1 for l 6= 0

where it(i) refers to the vector within the image the one-dimensional DCT is being

applied to and L is the dimension of the vector. It is evident that if the input

signal is laterally symmetric around its midpoint, (L−1)/2, i.e. it(i) = it(L−i−1)

for l = 0, 1, . . . , L/2− 1, then Equation 5.14 can be rewritten as

ft(l) =

√2

Lcl

L/2−1∑i=0

it(i)

[cos

(2i + 1)lπ

2L+ cos

(lπ − (2i + 1)lπ

2L

)]

= 2

√2

Lcl

L/2−1∑i=0

it(i) cos(2i + 1)lπ

2L, if k mod 2 = 0

= 0, if l mod 2 = 1 (5.15)

since

cos(2i + 1)lπ

2L= (−1)−l cos

(lπ − (2i + 1)lπ

2L

)(5.16)

Therefore, the DCT odd frequency components of a symmetric one-dimensional

signal are all zero. Similarly, if ft(l) = 0, for l = 1, 3, . . . , L − 2 (assuming that

N is a power of 2), then the inverse DCT is

it(i) =

√1

Lft(0) +

√2

L

L/2−1∑

l=1

ft(2l) cos(2i + 1)lπ

L(5.17)


(b) (c) (d) (e)(a)

Figure 5.15: Examples showing the reconstructed ROI’s using the top M coeffi-cients for: (a) original, (b) M = 10, (c) M = 30, (d) M = 50 and (e) M = 100.The images on top refer to the reconstructed ROI’s using MRDCT coefficients.The images on bottom refer to the reconstructed ROI’s using the MRDCT withthe odd frequency components removed (MRDCT-OFR).

given that the inverse one-dimensional DCT is given by

it(i) =

√1

Lft(0) +

√2

L

L−1∑

l=1

ft(l) cos(2i + 1)lπ

2L

for l = 0, 1, . . . , L.

Using the trigonometry identity given in Equation 5.16 for Equation 5.17, it

can be shown that the odd frequency DCT components are zero which implies a

symmetric original signal via

it(i)−it(L−i−1) =

√2

L

L/2−1∑

l=1

ft(2l)

[cos

(2i + 1)lπ

L− cos

(2iπ − (2i + 1)lπ

L

)]= 0

for i = 0, 1, . . . , L/2− 1.

The expression derived in Equation 5.16 can be used in the visual feature

extraction process by substituting it with the normal DCT form. As this is applied

to the mean-removed ROI, this technique has been termed as mean-removed DCT

with odd frequencies removed (MRDCT-OFR) compared to just MRDCT. Using

the inverse DCT given in Equation 5.17 on the MRDCT-OFR coefficients, the

ROI’s can be reconstructed using the top M coefficients. These reconstructions

are compared against the reconstructed ROI’s of the MRDCT. Upon inspection


(b) (c) (d) (e)(a)

Figure 5.16: Examples showing the reconstructed half ROI’s using the top Mcoefficients from the MRDCT for each side: (a) original, (b) M = 10, (c) M = 30,(d) M = 50 and (e) M = 100. The top refers to the reconstructed images of theright side of the ROI. The bottom refers to the reconstructed images of the leftside of the ROI. These images are all of size 16× 32 pixels

it can be seen that the MRDCT-OFR coefficients give more detail about the

ROI than the MRDCT coefficients. This is to be expected as the MRDCT-OFR

provide twice the amount of information than the MRDCT. This is evident when

viewing the MRDCT-OFR ROI reconstruction using M = 50 features compared

to the reconstructed MRDCT ROI with M = 100, as they appear to be similar.

In the work conducted by Potamianos and Scanlon [144], they found that the

MRDCT-OFR coefficients also essentially acted as a post-processing step which

could compensate for small ROI localisation errors.

5.5.1 Experimental Results

Similar to the experiments conducted in Chapter 6.4, the performance of the

MRDCT-OFR features were tested at both the static and dynamic feature level

and compared to the MRDCT features. In addition, the lipreading performance

of both the left half of the ROI and the right half of the ROI were evaluated, as

shown in Figure 5.16. This was done to see if the assumption that the ROI’s were

symmetrical were valid and show the benefit of forcing the lateral symmetrical

in the frequency domain rather than the image domain. Ideally, the results for


10 20 30 40 50 60 70 80 90 10052

54

56

58

60

62

64

66

68


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison showing the effectiveness of utilsing the ROI symmetry

MRDCTMRDCT−OFRMRDCT−LeftMRDCT−Right

Figure 5.17: Results showing that removing the odd frequency components of theMRDCT features helps improve lipreading performance.

both the left and right ROI’s should be equivalent to those of the MRDCT-OFR

features. However, it is anticipated that the results for both the left and right

ROI’s will lag behind as ROI symmetry in the image domain is difficult to attain

as previously described. In terms of the visual front-end, these experiments also

give an indication of how well the ROI’s had been located and/or if there exists

a particular bias within the visual front-end. The results for the prospective

features are given in Figure 5.17.

As can be seen in Figure 5.17, it is evident that removing the odd frequency

component to gain a more compact representation of the ROI gives an additional

improvement in lipreading performance. At M = 40 features, the MRDCT-OFR

features achieved a WER of 52.30% compared to the MRDCT features which only

achieve a WER of 55.75%. This result is to be expected as 40 features for the

MRDCT-OFR scheme is essentially equivalent to 80 MRDCT features, however

the dimensionality restriction enforced by the HMM is not in effect here. As

anticipated the lipreading performance of the left (60.06%) and right (60.79%)

ROI’s is somewhat behind at the same level.

The MRDCT features were then subjected to the intra-frame LDA with

M = 100. As can be seen from Figure 5.18, the improvement gained from mak-

ing use of the symmetrical nature of the ROI is nullified by the intra-frame LDA


10 20 30 40 50 60 70

45

50

55

60

65

Number of features used per feature vector (N)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison showing the effectiveness of using the symmetry of the ROI with LDA

MRDCT/LDAMRDCT−OFR/LDAMRDCT−Left/LDAMRDCT−Right/LDA

Figure 5.18: Plot of the results showing that LDA effectively nullifies the benefitof the MRDCT-OFR in the previous step.

step, with the MRDCT-OFR features getting a WER of 43.63% and the MRDCT

features obtaining an almost equal WER of 43.39%, both with N = 10. This sug-

gests that the intra-frame LDA does a good job at extracting out the relevant

discriminating visual speech information and pre-processing steps such as remov-

ing the odd frequency components of the symmetrical ROI gives is of no real

benefit. Again the performance of both the left and right ROI features are lag-

ging behind the holistic representations, with the left getting a WER of 46.49%

and the right with 47.04%. This backs up the inital hypothesis that symmetry in

the image domain is almost impossible to obtain due to the problems in obtain-

ing accurate ROI localisation, which again highlights the impact of the front-end

effect. These results may also suggest that the structure of each speaker’s ROI

is important (i.e lips rounding etc.) and that the full ROI is required for better

speech classification.

These results are also mirrored for the dynamic features, with the same pattern

emerging. Even though the improved performance is gained uniformly across each

set of features, both the MRDCT and MRDCT-OFR features gain approximately

the same performance, with the MRDCT achieving its best WER of 27.66%

compared to MRDCT-OFR 27.75% WER, using the optimal configuration found

in the previous section of J = 2 and P = 40. This is compared to the left ROI


which gained a WER of 32.46% and the right with a WER of 33.32%. Even

though the last result suggests that there may be a small bias in the localisation

of the ROI’s towards the left hand side, the difference in performance is so slight

this can be viewed as insignificant, which indicates that the localisation of the

ROI’s was done satisfactorily.

Overall, it can be seen that making use of the symmetrical nature of the

frontal ROI by removing the odd frequency components in the DCT coefficients

can improve lipreading performance at an early stage. However, this is not a

really necessary step when it is used in conjunction with LDA as LDA appears to

do a more effective job of gaining a compact feature representation of the ROI due

to its ability to maintain the given speech classes. Another apparent advantage of

this method is that by enforcing lateral symmetry in the ROI can normalise the

ROI for variations in lighting and errors in localisation. Even though this may be

the case, no improvement in lipreading performance was gained at the final stage.

Another pertinent point is that this particular feature extraction step can only be

applied to the frontal pose, as that is the only pose where this symmetry exists.

As this thesis aims to develop a lipreading system which can be used irrespective

of pose, the ROI symmetry technique described in this section is not investigate

further.

5.6 Patch-Based Analysis of Visual Speech

Motivated from the work in the previous section, it was decided it would be de-

sirable to determine which areas of the mouth are more salient for the task of

lipreading. The hypothesis behind this investigation was that if there is a ten-

dency for a particular area of the ROI to be more useful in terms of lipreading

than the other areas, maybe that area can be weighted more to improve perfor-

mance over the current holistic representations. This gives rise to patch-based

analysis.

Patch-based analysis of the ROI is heavily motivated by the work conducted

in face recognition. Techniques that decompose the face into an ensemble of

salient patches have reported superior face recognition performance with respect

5.6. Patch-Based Analysis of Visual Speech 105

Bottom

Top

Left Right

7 8 9

4 5 6

1 2 3

(a) (b)

Figure 5.19: Examples of the ROI broken up into: (a) top, bottom, left and rightside patches; and (b) 9 patches, starting from the top, refer to patches 1, 2 and3; the middle band refer to patches 4, 5 and 6; and the bottom band of patchesrefer to patches 7, 8 and 9.

to approaches that treat the face as a whole [15, 87, 111, 121]. The idea behind

breaking the face into a series of patches is that it is easier to take into account

the local changes in appearance due to the faces complicated three-dimensional

shape, in comparison to treating it holistically [103], which is also the motivation

for this work. Furthermore, as no work like this has been conducted before in

the area of lipreading, this would be a provide an understanding as to which

areas of the ROI are more pertinent to visual speech. Apart from the recent

work by Saenko [157], the proposed multi-stream patch-based approach takes a

different path than the current methods which model the ROI in a holistic single

stream fashion. As a precursor to the work conducted in this thesis, work in

[106] showed that a patch-based method of representing visual speech showed

potential. Although it must be noted that the work was conducted on a very

small database and constrained to the task of isolated digit recognition.


The lipreading performance for the patch sides depicted in in Figure 5.19(a) are

given in Table 5.1. As can be seen from these results, all the patches achieve

reasonable lipreading performance, albeit being well behind the performance of

the holistic representation. It also appears that the top of the ROI is the area

which contains the least amount of information which is useful for visual speech


ROI Region Word Error Rate (WER %)

Top 38.97

Bottom 34.17

Left 32.46

Right 33.32

Holistic 27.66

Table 5.1: Lipreading performance of the various regions of the ROI

classification. As the lower areas of the ROI are more prone to move during an

utterance due to the jaw than the upper part of the ROI which is somewhat

fixed, this result supports the hypothesis that the movement within the ROI is

extremely important to lipreading. Although the bottom patch is more prone to

movement due to talking than the other patches, it does not contain much lip

information which is possibly why the bottom patch performance lags behind the

left and right patches.

Seeing that the performance of these patches is somewhat worse than the

holistic representation, the results suggest that the full structure of the ROI is re-

quired to yield the best result. To do simulate this, the various patches were fused

together to determine if the same lipreading performance could be obtained. For

these experiments, the following patches were combined: top and bottom, left and

right, and the top, bottom, left and right (all patches). These combinations were

fused via two methods. The first was the HiLDA feature fusion approach, whilst

the second was the synchronous multi-stream HMM (SMSHMM). The HiLDA

approach was used to see if similar performance could be gained to the holistic

representation and the SMSHMM was used to make use of the saliency of the

different patches and weight the more salient patches accordingly. Both these

approaches have dimensionality constraints to allow the HMM to converge. As

such, for these experiments the total dimensionality of the combined features was

constrained to a total of 40 features. This was also done so that a fair compar-

ison between the holistic and the combined patch-based systems of the various

patches could take place. For HiLDA fusion, all patches had their 40 features


ROI Region HiLDA SMSHMM

(WER %) (WER %)

Top & Bottom 28.94 28.12

Left & Right 28.40 27.28

All Patches 28.31 28.41

Holistic 27.66

Table 5.2: Lipreading performance of fusing the various side patches of the ROItogether.

concatenated into a single feature vector of size 160. HiLDA was then performed

on this feature vector using the same class set, C as used previously, yielding a

final 40 dimensional feature vector. For the multi-stream approach, the top 20

features from each patch were used for the two stream experiments. The top 10

features from the various patches were used in the four stream experiment. The

optimal stream weights for the SMSHMM were found heuristically and applied

to the respective streams. The results from these experiments are given in Table

5.2.

From the results it can be seen that the holistic representation of the ROI

still outperforms the feature fusion results. This suggests that letting the patches

evolve independently over time does not improve lipreading performance. Weight-

ing the various patches with the SMSHMM does seem to have benefits when fusing

the left and right patches together as there appears to be a slight improvement in

performance (27.28% WER compared to 27.66%). No improvement when using

the SMSHMM for all the patches was experienced. However, it must be noted

that this was constrained heavily by dimensionality restriction.

Even though the results so far give an indication which general areas of the

ROI are more salient than the others, they do not show what particular areas or

features of the ROI have the most impact on lipreading performance. That is, it

would be interesting to see how much visual speech information is contained in

the periphery of the ROI such as the areas around the lips and the nose. It would

also be interesting to see how much impact the lip corners have on lipreading,

as well as the center of the lips were teeth and tongue information is prevalent.


WER (%)

Patches 1-3 47.53 54.80 49.19

Patches 4-6 33.98 33.94 33.46

Patches 7-9 39.92 38.55 47.86

Holistic 27.66

Table 5.3: Lipreading performance of the smaller 16 × 16 pixel patches of theROI (overlapping by 50%)

As a result, the ROI was broken up into smaller 16 × 16 patches, which were

overlapped by 50%. Examples of these patches are depicted in 5.19(b) with the

results given in Table 5.3.

From viewing the results from these extended experiments, it suggests that

most visual speech information stems from the middle band of the ROI (patches

4-6). This result is of no surprise though, as these areas of the ROI contain the

most visible articulators such as the lips, teeth and tongue. It can be seen that the

area of the ROI which contains the least amount of visual speech information is

patch 2, which contains the nose and surrounding areas. This result also supports

our initial hypothesis that the top of the ROI is the least effective for lipreading

due to its fixed nature.

These results highlight a potential problem with the holistic approach. Seeing

that most of the lipreading performance stems from the center of the ROI (patches

4-6), it is a possibility that when executing the holistic approach, some of this

speech discrimination power that exists in the center of the ROI is diminished in

an effort to incorporate all of the ROI into the representation. To see if this was

the case or not, it was decided to fuse the holistic representation with the each

of the 16 × 16 pixel patches. By doing this, it was hoped that any important

information which was lost or diminished by the holistic representation will be

reenforced be the introduction of local patch. For these experiments, only the

holistic and individual patches were used and were combined using a two-stream

SMSHMM. For these experiments, a total dimensionality of 60 was used, 40 for

the holistic stream and 20 for each patch. As 40 features was the optimal number


WER (%)

Patches 1-3 27.70 27.98 27.67

Patches 4-6 26.84 26.76 26.79

Patches 7-9 27.02 27.15 28.21

Holistic 27.66

Table 5.4: Lipreading performance of the each individual patch fused togetherwith the holistic representation of the ROI using the SMSHMM

for the holistic approach, it was deemed appropriate to use 20 for each patch as

the HMM convergence was still obtainable with a dimensionality of 60. Again,

the optimal weights for each stream were found heuristically. The results for

these experiments are given in Table 5.4.

From these results, it suggests that by fusing each patch with the holistic rep-

resentation, a slight improvement over the holistic-only result for most patches

can be achieved (except for patch 2). This appears to support the hypothesis

that some important visual speech classifying information is lost when the visual

features are calculated for the entire patch. It appears that this is mostly affected

in the more salient regions of the ROI. However, this shows that by fusing the

features of the more salient regions with the holistic features, some of this im-

portant local information can be retained which improves the overall lipreading

performance. This is highlighted by the performance of patch 5 with the holistic

features, which achieves a WER of 26.76% compared to 27.66% of the holistic

representation.

Even though some improvement was sought fusing the salient patches of the

ROI with the holistic representation, it must be noted that it took a lot of extra

processing power to achieve this slight improvement. When implemented in a full

AVASR system, this type of approach would not be worth the extra complexity

due to the small improvement in performance. In reality. Maybe as the frontal

pose ROI is symmetric, there is no real benefit is applying a patch-based method.

However, this may be useful for non-symmetric ROI’s such as those found in

non-frontal poses and maybe a viable research avenue.


5.7 Summary

In this chapter, visual feature extraction techniques for lipreading were investi-

gated. From the initial review on the various techniques used for visual feature

extraction, it was deemed that the appearance based features were the represen-

tation of choice, as they are heavily motivated by human perception studies and

amendable to real-world implementation. The appearance based features also do

not require further localisation of lip features making them less succeptible to the

front-end effect than the contour and combination based techniques. The cur-

rent state-of-the-art appearance based visual feature extraction scheme based on

the cascading of features [142] was also presented as the baseline system for this

thesis. Each particular module of this algorithm was analysed and the lipreading

performance was also presented. It was shown that the DCT features contained

a lot of speaker information within them which is irrelevant for lipreading, and

that via FMN this irrelevant information could be removed thus improving per-

formance. A variant of the FMN step was also presented which normalises in the

image domain rather than the feature domain which will be useful when different

pose and illumination becomes a concern. It was shown that this image based

FMN slightly outperformed the feature-based FMN.

As the ROI for the frontal-pose is symmetrical, an algorithm presented by

Potamianos and Scanlon [144] was implemented making use of this character-

istic. It was shown that it can improve lipreading at an early level within the

cascading framework, however, this is nullified by the LDA step at a latter level.

Motivated by this work, analysis of the various regions of the ROI were then

conducted using patches which is the first known analysis of its type. From this

novel analysis, it was found that the middle band of the ROI contained most in-

formation pertinent to lipreading whilst the top band provided significantly less

visual speech information. As a means of making use of this prior knowledge,

a novel patch-based representation of the ROI was introduced. Although slight

improvement was sought weighting the pertinent regions of the ROI, only a slight

improvement was gained. It was postulated that this would be a more effective

method for non-symmetrical ROI such as those found in profile views, which is

the topic of the focus chapter.

Chapter 6

Frontal vs Profile Lipreading

6.1 Introduction

In the past two chapters, a lipreading system which recognises visual speech from

a speaker’s fully frontal pose was presented. This mirrors the work that has been

conducted in the field of lipreading over the past two decades. This is mainly

due to the lack of any large corpora which can accommodate poses other than

frontal. But as more work is being concentrated within the confines of a “meeting

room” [52] or “smart room” [131] environment, data is becoming available that

allows visual speech recognition from multiple views to become a viable research

avenue.

In the literature, only three studies were found to be related to lipreading

from side views. In the first work, Yoshinaga et al. [187] extracted lip infor-

mation from the horizontal and vertical variances from the optical flow of the

mouth image. In this paper, no mouth localisation or tracking was performed.

Yoshinaga et al. [188] refined their system in the second work, by incorporat-

ing a mouth tracker which utilises Sobel edge detection and binary images, and

uses the lip angle and its derivative for the visual feature on a limited data set.

The improvement sought from these primitive features was minimal as expected,

essentially due to the fact that only two visual features were used, compared to

most other frontal-view systems that utilize significantly more features [142]. The

third study was a comprehensive psychological study conducted by Jordan and

111

112 Chapter 6. Frontal vs Profile Lipreading

d 0.5d

(a) (b)

Figure 6.1: Synchronous (a) frontal and (b) profile views of a subject recordedin the IBM smart room (see Chapter 3). In the latter, visible facial features are“compacted” within approximately half the area compared to the frontal facecase, thus increasing tracking difficulty.

Thomas [84]. Their findings were rather intuitive, as the authors determined that

human identification of visual speech became more difficult as the angle (from

frontal to profile view) increased.

Other than these works, no other attempts to solve the problem of lipreading

from non-frontal views have been identified in the literature. To remedy this

situation, this chapter presents a novel contribution to the field of lipreading

by presenting a lipreading system which can recognise visual speech from profile

views. This is the first real attempt in determining how much visual speech

information can be automatically extracted from profile views compared to the

frontal view. This chapter also presents the first multi-view lipreading system.

This system is able to recognise visual speech from two or more cameras which

capture the different views of a speaker synchronously.

The task of recognising visual speech from a profile view is in principle very

similar to that of frontal view, requiring to first locate and track the mouth ROI

and subsequently extract the visual features. However, this problem is far more

complicated than the frontal case because the facial features which are required to

be localised and tracked lie in a much more limited spatial plane, as can be viewed

in Figure 6.1. Clearly, much less data is available compared to that of a fully

frontal face, as many of the facial features that are of interest (i.e. eyes, mouth,

chin area etc.) are fully or partially occluded. In addition, the search region for all

6.2. Visual Front-End for Profile View 113

visible features is approximately halved, as the remaining features are compactly

confined within the profile facial region. These facts remove redundancy in the

facial feature search problem, and therefore make robust ROI localisation and

tracking a much more difficult endeavour.

Nevertheless, ROI localisation and tracking can still be achieved by employ-

ing the visual front-end based on the Viola-Jones [180] algorithm presented in

Chapter 4. All that varies is the selection of facial features to locate and track.

Once these selections have been made, the associated classifiers can be trained

and the visual front-end can be developed. Once the ROIs have been extracted,

the rest of the lipreading system is the same as the frontal case. The develop-

ment of visual front-end which can extract profile mouth ROIs is described in

the next section. Following this, the lipreading performance of the profile view

is presented. These results are compared against the frontal view. Similar to the

previous chapter, patch-based analysis is then performed on the profile data to

determine which areas of the ROI are more pertinent to the task of lipreading.

The chapter then concludes by introducing the first known multi-view lipreading

system.

6.2 Visual Front-End for Profile View

The visual front-end for the profile view was developed in a similar manner to

the its synchronous frontal counterpart. Due to the compactness of the facial

features within the dataset, only 7 of the 17 manually labeled facial features were

used. These were the left eye, nose, top of the mouth, center of mouth, bottom of

the mouth, left mouth corner and chin, as depicted in Fig. 6.2. Like the frontal

data, a set of 847 images for training and 37 images for validation were available

to develop the profile visual front-end 1. This provided 847 positive examples

for all 7 facial features. The resulting face training set was included rotations in

the image plane by ±5 and ±10 degrees, providing 4235 positive examples. A

similar amount of negative examples of the background were also employed in

the training scheme. Approximately 5000 negative examples were used for each

1The 847 training images and 37 validation images were the synchronous counterparts tothe frontal images used to train and test the frontal visual front-end in Chapter 4.6


(b)

(c)

(g)

(f)

(a)

(d)

(e)

Figure 6.2: Example of the points labeled on the face: (a) left eye, (b) nose,(c) top mouth, (d) mouth center, (e) bottom mouth, (f) left mouth corner, and(g) chin. The center of depicted bounding box around the eye defines the actualfeature location.

facial feature. These negative examples consisted of images of the other facial

features that surrounded its location as these would be most likely to cause false

alarms, as per the frontal visual front-end.

One difficulty experienced was selecting appropriate facial feature points to

use for the training image normalisation (scaling and rotation). In the frontal face

scenario, eyes are predominately used for this task, but in the profile-view case,

there isn’t the luxury of choosing two geometrically aligned features. Instead

the nose and the chin were used, with a normalised constant (K) distance of 64

pixels between them. This choice was dictated by the head pose variation within

the dataset that had less of an effect on the chosen metric, compared to other

possibilities (such as eye to nose distance, etc.). The top mouth, center mouth,

bottom mouth and left mouth corner were trained on templates of size 10 × 10

pixels, based on normalised training faces. Both nose and chin classifiers were

trained on templates of size 15×15 pixels, and the eye templates were larger,

20×20 pixels. Examples of these facial feature templates are given in Figure 6.3.

The normalised positive face examples were templates of size 16× 16. Examples

of these face templates are shown in Figure 6.4.


Figure 6.3: Examples of the facial feature templates of the profile view used totrain up the respective facial feature classifiers.


Left Eye 86.49

Nose 81.08

Top Mouth 78.37

Center Mouth 81.08

Bottom Mouth 72.97

Left Mouth Corner 86.49

Chin 62.16

Table 6.1: Facial feature localisation accuracy results on the validation set ofprofile images.

Due to the lack of manually labeled faces available, all classifiers were tested

on a small validation set of 37 images which were the synchronous profile view

images of the validation set in the frontal domain. The localisation results of

the various facial features from this validation set gave an indication of what

particular features would give the best chance of reliably tracking the localised

features. These results are shown in Table 6.1. A similar performance metric to

the frontal scenario was also employed where a feature was not considered located

if the location error is larger than 10% of the annotated distance between the nose

and the chin. From these localisation results, it can be seen that along with the

left eye, the left mouth corner yielded the best performance. This is somewhat


Figure 6.4: Examples of the profile face templates used to train up the profileface classifier.

surprising as a close-talking microphone was located near the left mouth corner

for all the speakers in the IBM smart-room databasae. An example of this is

shown back in Figure 6.3. This shows the usefulness of using a corner for facial

feature localisation, as it provides a unique shape within the face which is hard

to get confused with by other objects.

As the left eye and left mouth corner yielded the best results, it was decided

to use these two points for scale normalisation. The only difference between using

the left eye and left mouth corner, compared to the nose and chin is changing

the scaling factor K from 64 to 45. The face localisation accuracy on this test

set was 100%. As no manual labels for the face bounding box was available, the

accuracy was determined upon inspection.


Videoin

FaceLocalisation

Define Normalised

Search Regions

Localise Eye & Nose

No

Yes

No

Lenghthen/Shorten Face Box

No

Normalise ROI (48x48) based on Leftmouth Corner

Track MouthSmoothing

TrackedMouth (32x32)

Retrack everyframe

Yes

Yes

Face Classifier16 x 16 templates

Nose (15x15) and Eye (20x20) Classifiers

Mouth Classifier:

32x32

CalculateRescaling

Metric (metric1)

Metric Outside Limits?

Localise Mouth Region

(below nose)

Localise Leftmouth

Corner

Leftmouth Classifier: (10x10)

DownsampleROI (32x32)

No

No

Calculate ROIRescaling

Metric (metric2)

Figure 6.5: Block diagram of the face and mouth localisation and tracking systemfor profile views.

The final profile ROI localisation and tracking visual front-end is outlined in

Figure 6.5. Given the video of a spoken utterance, face localisation is first applied

to estimate the location of the speaker’s face at different scales as the face size

is unknown. Once the face was located, the left eye and nose were searched over

specific regions of the face (based on training data statistics). During developing

this system, it was found that the bottom of the face bounding box was often far

below the bottom of the subject’s actual face, or well above it. As the face box

defines the search region for the various facial features, this caused the system to

miss locating the lower regions of the face. To overcome this, the ratio (metric1 )

of the vertical eye to nose distance, over the vertical nose to bottom of the face

bounding box distance was used. If metric1 was below a fuzzy threshold (again

determined by training statistics), the box was lengthened, or if it was above

the threshold then it was shortened. It was found this greatly improved the

localisation of the generalised mouth area (trained on normalised 32× 32 mouth

images), which was located next. This step is illustrated in Figure 6.6(b).

Once the generalised mouth region was found, the left mouth corner was

located. The next step was to define a scaling metric, so that all ROI images would

be normalised to the same size. As mentioned previously, the ratio (metric2 ) of

the vertical left eye to left mouth corner distance over some constant K (45) was


y1

metric1 = y1/y2

y2

(a) (b)

y

metric2 = y/K

32x32

30 18

24

24

(c) (d)

Figure 6.6: (a) An example of face localisation. (b) Based on the face localisationresult, a search area to located the left eye and nose is obtained. The face boxis lengthened or shortened according to metric1. (c) The left mouth corner islocated within the generalised mouth region. The ratio (metric2 ) is then usedfor normalising the ROI. (d) An example of the scaled normalised located ROIof size (48× 48) ·metric2 pixels.

used to achieve this (see Figure 6.6). A (48× 48) ·metric2 normalised ROI based

on the left mouth corner was extracted (see Figure 6.6). The ROI was then

downsampled to 32× 32, for use in the lipreading system.

Following ROI localisation, the ROI is tracked over consecutive frames. If the

located ROI is too far away from previous frame, then it is regarded as a failure

and the previous ROI location is used. A mean filter is then used to smooth the

tracking. Due to the speed of the boosted cascade of classifiers, this localisation

and tracking scheme is used for every frame.

Overall, the accuracy of the profile visual front-end was very good, with only

a very few number of sequences in the dataset being poorly tracked. These poorly

tracked sequences were not used for the lipreading experiments 2. A major factor

affecting performance was due to random head movement and some head pose

variability, where subjects exhibit a somewhat more frontal pose than the profile

2Across the total 1700 available synchronous pairs of sequences in the IBM smart-roomdatabase, 1661 pair of sequences were used for the lipreading experiments, with 39 synchronouspairs being omitted due to poor tracking. For the sake of comparison, both the frontal andprofile sequences had to be accurately tracked for them to be used in this thesis. Evaluatingthe accurately tracked sequences was done via manual inspection.

6.3. Profile vs Frontal Lipreading 119

(a) (b) (c)

(d) (e) (f)

Figure 6.7: Examples of accurate (a-d) and inaccurate (e,f) results of the local-isation and tracking system. In (f), it can be seen that the subject exhibits asomewhat more frontal pose compared to the profile view of the other subjects.

view of the majority of the subjects – see also Figure 6.7, where examples of

accurately and poorly tracked ROI’s are depicted. The latter is also the reason

why no rotation normalisation was employed. Many different configurations were

trailed, however, all seemed to cause more problems than they solved. Rotating

the ROI according to the left eye to left mouth corner angle was also tried,

however, the many different head poses made this very problematic. Another

attempt was to rotate the ROI using the angle between the mouth center and

the left mouth center. This also failed, as the distance between these two points

was too small ( 20 pixels), and any slight mistake in the localisation phase gave

large errors.

6.3 Profile vs Frontal Lipreading

Following extracting the profile mouth ROI image from each frame, the same

visual feature extraction process based on a cascade of appearance features which

was used for the frontal view was used on the synchronous profile view data. The

profile features were modeled using a HMM with the same topology and train/test

sequences as the frontal data (see Chapter 3.5 and 3.6.2 for full details). Figure

6.8, shows the lipreading performance of the profile features for the three stages

of the static feature capture (i.e. DCT, MRDCT and intra-frame LDA steps)

compared to their frontal synchronous counterparts. It appears from these results


10 20 30 40 50 60 70 80 90 10040

45

50

55

60

65

70

75

80

85

Number of features per feature vector (M for DCT & MRDCT, N for LDA)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison of frontal vs profile performance for different stages of static features

DCT−FrontDCT−RightMRDCT−FrontMRDCT−RightLDA−FrontLDA−Right

Figure 6.8: Results comparing the front and profile lipreading performance atvarious stages of the static feature capture.

that the same trend that occurred in the frontal pose is happening in the profile

pose, that is the FMN step of removing the mean ROI greatly improves speech

classification over the DCT features (WER of 64.02% compared to 87.12% for M

= 40) and the intra-frame LDA step again improves performance (WER of 56.61%

for P = 10 compared to 64.02 for M = 40). These profile results are however,

significantly worse than the frontal pose for all number of features. This result

was expected though, in line with the human lipreading experiments reported in

[84].

When the temporal information is included via the inter-frame LDA step, the

profile speech features mimic the same trend in the frontal domain, as can be seen

in Figure 6.9 3. For the profile features (Figure 6.9(b)), it can be seen that the

lipreading performance improves when the temporal window J is increased from

1 to 2. When the value of J is increased past 2, the performance appears to level

off with no real improvement gained from using a larger temporal window. From

this plot it can be seen that the best lipreading performance for the profile view

is gained with P = 40 features, with J = 2, achieving a WER of 38.88%. This

is compared to the frontal view, where the the best WER of 27.66%, also using

3It is worth noting that the best lipreading performance was obtained using N = 30 inputfeatures, which was also the case for the frontal view.

6.3. Profile vs Frontal Lipreading 121

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

44

46

48

50


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)



10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

44

46

48

50


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Comparison of profile features with varying temporal information (J), with input vector of size N=30


(a) (b)

Figure 6.9: Comparison of the lipreading performance between the frontal (a)and profile (b) dynamic and final features using various values for J and P usingM = 30 input features.

P = 40 features and J = 2 (Figure 6.9(a)). From these results, the difference

in lipreading performance between synchronous frontal and profile views can be

quantified in terms of WER to be 11.22%.

The discrepancy between the frontal and profile lipreading performances can

be attributed to the amount of visible articulators that the lipreading systems

has exposure to with respect to the respective views. For example, in the frontal

scenario, the lipreading system has full exposure to all the possible visible artic-

ulators such as teeth, lips, tongue and jaw. Conversely, for the profile view, the

lipreading system only has the lips and jaw available, which is only a portion of

the visual information. Another restriction with extracting visual speech from

the profile view lies in the background. In the frontal scenario, a slight locali-

sation/tracking error does not cause significant appearance change, due to the

somewhat uniform background around the lips (i.e., the skin of the speaker). In

contrast, in the profile case, poor localisation/tracking may capture an excessive

amount of the background behind the speaker’s mouth causing the appearance of

the ROI to look unlike a speaker’s mouth. This non-uniformity in the profile view

may also be cause of degradation of the lipreading performance, which suggests

that the profile view may be more susceptible to the front-end effect that the

frontal view.

Considering the difficulties that lie within extracting visual speech from the

profile view, a lipreading performance of 38.88% is still an extremely useful result.


As such, there are many benefits associated with using the profile view. Firstly,

maybe the only view available for use is the profile view. Such a situation could

arise in a car scenario where the only place a camera could be placed is to the

side of the driver. Even though the profile view does not contain as much visual

speech information as the frontal view, in this situation a lipreading performance

of 38.88% is still much better than pure chance, especially if this information

was to be fused with the audio channel. Although the combined audio-visual

scenario is outside the scope of this thesis, it is worth mentioning that when the

profile visual speech data was fused with the audio channel, substantial gains

were achieved over the audio-only results, especially in the presence of noise (see

[105] for full details).

Another benefit to using the profile visual speech information might lie in

combining different viewpoints together to form a combined representation of vi-

sual speech. As there is essentially different information contained various views

available, there may contain some complementary information which exists in one

view that does not exist in the other view. By fusing the various views together,

this may give an improvement over the dominant viewpoint. This is examined at

the end of the chapter, with the introduction of a multi-view lipreading system.

The multi-view system is developed by combining the profile features together

with the frontal features in an attempt to achieve better lipreading performance

than the frontal-only system. The next section however, performs patch-based

analysis of the profile ROIs to determine which areas of the ROI are more perti-

nent for profile lipreading.

6.4 Patch-Based Analysis of Profile Visual Speech

An interesting point to note is that the profile view is not laterally symmetrical as

it was in the frontal view. This constitutes a much different problem then, as there

might be some areas of the profile ROI which contain much more visual speech

information than the others areas. Like the patch-based experiments conducted

for the frontal view, weighting the more pertinent patches of the profile view

higher than the other areas may be of more benefit in the profile scenario. In the

6.4. Patch-Based Analysis of Profile Visual Speech 123

1

7

4

8 9

2 3

5 6

Bottom

Top

Left Right

(a) (b)

Figure 6.10: Examples of the ROI broken up into: (a) top, bottom, left and rightside patches; and (b) 9 patches, starting from the top, refer to patches 1, 2 and3; the middle band refer to patches 4, 5 and 6; and the bottom band of patchesrefer to patches 7, 8 and 9.

ROI Region Word Error Rate (WER %)

Top 56.94

Bottom 46.86

Left 51.19

Right 45.29

Holistic 38.88

Table 6.2: Lipreading performance of the various regions of the profile ROI

this subsection, patch-based analysis of the profile view is undertaken to explore

this possibility. Like the frontal pose, the patches are numbered in the same

fashion, although they correspond to different features as examples in Figure

6.10 show.

The lipreading performance for the profile patch sides depicted in Figure

6.10(a) are given in Table 6.2. From this, all the patches achieve reasonable

lipreading performance, albeit being well behind the performance of the holis-

tic representation. Like the frontal pose, the top of the ROI seems to give the

least amount useful visual speech information, again probably due to the lack

movement within this region compared to the other regions. Also, the left patch

appears to be more useful than the right. A possible reason for this could be the


ROI Region HiLDA SMSHMM

(WER %) (WER %)

Top & Bottom 41.50 39.97

Left & Right 40.31 39.56

All Patches 41.22 40.56

Holistic 38.88

Table 6.3: Lipreading performance of fusing the various side patches of the profileROI together.

fact that the right patch contains a lot of the background and not lip informa-

tion as depicted in Figure 6.10(a). Fusing these patches together using both the

HiLDA and SMSHMM integration strategies did not yield improved performance

over the holistic representation as given in Table 6.2. It is worth noting that all

these patch-based experiments were conducted identically to those for the frontal

pose (see Chapter 5.6.1 for experiment description).

The results for the 50% overlapping, smaller 16 × 16 patches depicted in

Figure 6.10(b) are given in Table 6.4. From viewing these results, it shows that

the region containing the lips and jaw are the most useful for lipreading (patches

5,6 and 8). This again backs up the hypothesis that the movement of the visible

articulators are of most benefit to recognising visual speech. As for the frontal

case, the nose region appears to be of little value for lipreading (patch 2), as well

as the regions which contain the background (patches 1 and 7) or the skin around

the lips (patches 3 and 9), although the relative importance of these patches can

not be quantified yet. This is because some patches may contain important lip

information which is only evident occasionally, such as the background patches

1 and 7 which may contain important lip protrusion information, which may be

complimentary to the frontal pose.

To determine if any information in the holistic representation is lost by in-

cluding the less pertinent areas of the profile ROI, fusion of each of the patches

were performed with the holistic representation using the SMSHMM. The results

for these experiments are shown in Table 6.5. Like the frontal scenario, only a

slight improvement over the holistic-only performance is gained from fusing the

6.4. Patch-Based Analysis of Profile Visual Speech 125

WER (%)

Patches 1-3 69.94 64.97 61.93

Patches 4-6 55.32 48.48 49.38

Patches 7-9 58.60 49.67 66.49

Holistic 38.88

Table 6.4: Lipreading performance of the smaller 16 × 16 pixel patches of theprofile ROI (overlapping by 50%)

WER (%)

Patches 1-3 39.83 39.34 39.20

Patches 4-6 39.04 38.51 38.89

Patches 7-9 39.27 38.91 39.53

Holistic 38.88

Table 6.5: Lipreading performance of the each individual patch fused togetherwith the holistic representation of the profile ROI using the SMSHMM

middle patch (patch 5) with a WER of 38.71% compared to 38.88%. For all other

patches, similar or worse performance was gained for this multi-stream approach

which suggests that little or no additional information is included by using this

approach.

Even though some improvement was sought fusing the salient patches of the

profile ROI with the holistic representation, the additional performance expected

by weighting the various parts of the ROI were not as much as hoped for, with only

a marginal improvement in performance obtained. Hence, this type of approach

would not be viable in a full AVASR system as the extra complexity required to

implement this multi-stream patch approach would not be worth the negligible

gain.


6.5 Multi-view Lipreading

From Section 6.3, it was shown that the frontal view contained far more visual

speech information then the profile view (WER of 27.66% compared to WER

of 28.88%). Even though there is a significant difference in the lipreading per-

formance between the two views (11.22%), there may exist some elements in the

profile view which are not found in the frontal view. As such, it would seem likely

that fusing the holistic representations of both the frontal and profile features to-

gether would improve the overall visual speech intelligibility. As this multi-modal

approach is the key motivation behind AVASR, it is intuitive to follow a similar

path by fusing these two synchronous views together. This gives rise to multi-

view lipreading. Formally, multi-view lipreading can be defined as the scenario

when there are two or more synchronous views of a speaker’s ROI which can be

fused together to form a combined representation of the visual speech. A block

diagram of the multi-view lipreading system is depicted in Figure 6.11

The multi-view system presented in this thesis, uses two methods of combining

the views. The first was the HiLDA feature fusion approach, whilst the second

was the synchronous multi-stream HMM (SMSHMM). As mentioned previously,

the HiLDA approach is a single stream approach and as such the two views

can not be weighted. Contrastingly, the SMSHMM is a multi-stream approach,

which allows for the different views to be weighted. Both these approaches have

dimensionality constraints to allow the HMM to converge. As such, for these

experiments the total dimensionality of the combined features was constrained to

a total of 40 features. This was also done so that a fair comparison between the

multi-view and single view systems of the various views could take place.

For the multi-view lipreading system experiments, the parameters which yielded

the best performing visual features for each view were used. For both views they

were the M = 100, N = 30, J = 2 and P = 40 (final number of features used).

For HiLDA fusion, both 40 feature streams were concatenated into a single fea-

ture of dimensionality 80. HiLDA was then performed on this feature vector

using the same class set, C as used previously, yielding a final 40 dimensional

feature vector. For the multi-stream approach, the top 20 features from each

6.5. Multi-view Lipreading 127

Video 1Frontal

Video 2Profile

Visual FeatureExtraction


HMM(Profile)

HMM(Frontal)

ConcatenateFeatures

HMM(Combined)

RecogniseMulti-view

Speech

RecogniseFrontal-only

Speech

RecogniseProfile-only

Speech

Figure 6.11: Block diagram depicting the various lipreading systems that canfunction when 2 cameras are synchronously capturing a speaker from differentviews. The lipreading system can use only one view (either frontal or profile inthis case), or combine both views to form a multi-view lipreading system (whichis depicted by the dashed lines and bold typeface). The multi-view features caneither be fused at an early stage using feature fusion or in the intermediate levelvia a synchronous multi-stream HMM (SMSHMM).

view were used as the input into the SMSHMM with the weights α for the frontal

and 1−α for the profile view used as the respective stream weights. The optimal

stream weights were found heuristically and these were α = 0.8. The multi-view

lipreading results for both approaches are given in Table 6.6.

From these results it can be seen that combining the two views together using

both the fusion methods mentioned gave an improvement in lipreading perfor-

mance over the frontal view system. In particularly, the SMSHMM approach

achieved the best performance with an WER of 25.36%, which is an improve-

ment over the frontal results by 2.3%. This result is quite significant due to the

fact that it does demonstrate that there exists some information in the profile

view which is not captured by the frontal-view. Although it is not known for

sure, the added information could be that of lip protrusion. This seems to be the

most likely scenario as the lip protrusion information is only contained within the

profile view.


Viewpoint WER (%)

Frontal 27.66

Profile 38.88

Multi-view (HiLDA) 27.50

Multi-view (SMSHMM) 25.36

Table 6.6: Multi-view lipreading performance compared against the single viewperformance.

The multi-view system presented in this section assumed that both camera

views had the speaker’s mouth in frame for the entire set of utterances. This

may not be the case in a realistic scenario though, as the speaker may randomly

move in and out of shot for a portion of time, or one particular view may be

partially or fully occluded by some object, such as the speaker’s hand. In a single

camera system, this would mean that the visual speech information would be lost.

However, using a multi-view type approach, there would be more of a chance that

the speaker’s mouth would be in at least one of the camera views. This highlights

another benefit of employing a multi-view lipreading system. Future research is

needed to look into this aspect of the multi-view lipreading system.

6.6 Summary

In this chapter, a lipreading system which is capable of extracting and recog-

nising visual speech information from profile views was presented. These results

were compared to their synchronous counterparts in the frontal view. This con-

stituted the first published work which quantifies the performance degradation of

lipreading in the profile view compared to the frontal view. In the experiments

presented, it was demonstrated that profile views obtain significant visual speech

information, however, it is of course less pronounced than when using the frontal

view. This profile information was found not to be totally redundant to the

frontal video though, as the “multi-view” experiments demonstrated. In addition

to this work, patch based analysis was also conducted on the profile view. From

the results, it was shown like the frontal view, that the most pertinent areas of

the ROI in terms of visual speech was the center.

Chapter 7

Pose-Invariant Lipreading

7.1 Introduction

In the previous chapter, lipreading was applied across multiple views. This gave

rise to the multi-view lipreading system, which refers to a scenario when there

are two or more views of a speaker’s lips which can be combined together to form

a representation of the visual speech. From those experiments, it was shown

that improvement in lipreading performance can be gained when the views were

of different poses (frontal and right profile). However, the multi-view work was

constrained with each viewpoint having its own dedicated lipreading system (i.e.

two systems, one dedicated for the frontal view and another for the profile view).

A more “real-world” solution to this problem would be to have a single lipreading

system recognise visual speech regardless of head pose. 1. This particular problem

is termed pose-invariant lipreading. Formally, pose-invariant lipreading can be

defined as the ability of the lipreading system to recognise visual speech across

multiple poses given a single camera. An example of this is given in Figure 7.1.

Pose-invariant lipreading can either occur when the speaker is stationary (i.e.

the speaker is fixed to one particular pose for the duration of the utterance)

or continuous (i.e. the speaker is not restricted to any one pose and can move

1In this thesis, the term pose is used instead of view so as to distinguish lipreading systemsusing a single camera and multiple cameras. The term pose is used to denote the head positionof a speaker when there is only one camera used in the lipreading system. Conversely, the termview is used to denote the head position of a speaker in each of the cameras when there is morethan one camera used in the lipreading system.

129

130 Chapter 7. Pose-Invariant Lipreading

Frontor

Profile

Video

Figure 7.1: Given one camera, the lipreading system has to be able to lipreadfrom any pose. In this example, those poses are either frontal or profile poses.

their face during the spoken utterance). The former is the focus of the first part

of this chapter, whilst the latter scenario is what is referred to as continuous

pose-invariant lipreading and is heavily dependent on the accuracy of the visual

front-end which incorporates a pose-estimator. This will be investigated at the

end of this chapter.

The implications of a pose-invariant lipreading system is of major benefit to

lipreading and AVASR in general. By loosening the constraint on the speaker’s

pose, it allows for a much more more pervasive or “real-world” technology to de-

velop, which would be of major benefit to in-car AVASR, for example. Perversely,

by allowing more flexibility in the system, it also introduces more complexity. A

possible solution to this would be to model and recognize each pose indepen-

dently of the other, thus minimising the train/test mismatch. Unfortunately,

this is complicated to achieve in a continuous setting so a one model for all ap-

proach is usually employed. Having one model which can generalise over all poses

is also problematic, as it may over generalize, causing large train/test mismatch.

Train/test mismatch can drastically affect the performance of a classifier.

Given that only one model is used, if some sort of invariance in the feature space

of an input signal is provided then the entire system will benefit. A number

of approaches have been devised in the acoustic speech domain to lessen the

train/test mismatch caused by channel conditions and noise, such as cepstral

mean subtraction (CMS) [110] and RASTA processing [74]. This type of approach

has been used similarly in the visual domain for face recognition, where techniques

7.2. Pose-Invariant Techniques 131

such as linear regression have been used to project the unwanted non-frontal pose

face image into a frontal face image. Blanz et al. [9] cite the major advantage of

doing this is because most state-of-the-art face recognition systems are optimised

for frontal poses of faces only, and their performance drops significantly if the

faces in the input images are shown from non-frontal poses due to large variation

in train/test mismatch. Linear regression has also been used in AVASR, with

Goecke et al. [58] using a linear regression matrix to gain an estimate of the

clean audio features from a combined audio-visual feature vector for audio-only

speech enhancement.

Motivated by these works, this chapter describes a “pose-invariant” lipreading

system, which makes use of linear regression to normalise the visual speech fea-

tures into a single pose. As no prior work on pose-invariant lipreading has been

carried out before, this chapter is also concerned with investigating the various

problems associated with this task.

7.2 Pose-Invariant Techniques

Blanz et al. [9] cites two possible ways of performing pose-invariant face recog-

nition, either via a viewpoint-transformed or a coefficient-based approach. The

viewpoint-transform approach acts in a pre-processing manner to transform/warp

a face representation (i.e. image or feature vector) of an unwanted pose into the

desired pose. Coefficient-based recognition attempts to estimate the face rep-

resentation under all poses given a single pose(i.e. frontal and profile in this

case), otherwise called the lightfield of the face [67]. Although it is not clear

which approach is superior, for the lipreading system presented in this thesis, the

viewpoint-transform approach was employed. The reason behind this particular

approach is that almost all lipreading systems to date have been optimised for

the frontal pose (such as the system described in Chapters 4 and 5). This is sim-

ilar to the motivation cited by Blanz et al. [9] for their face recognition system.

The most common way to perform the viewpoint-transform approach is via linear

regression. This is described in the following subsection.


W

Linear Regression/Tranformation MatrixCalculation - Offline

x

t = W x

t

T X

Figure 7.2: Schematic of the proposed pose-invariant lipreading scheme: Visualspeech features xn extracted from an undesired pose (e.g. profile) are transformedinto visual features tn in the target pose space (e.g. frontal) via a linear regressionmatrix W, calculated offline based on synchronised multi-pose training data Tand X of features extracted from the different poses.

7.2.1 Linear Regression for Pose-Invariant Lipreading

The goal of regression is to predict the value of a target variable given an input

variable [8]. This is normally performed via a linear function, which gives rise to

linear regression. For the problem of pose-invariant lipreading, a linear regression

or transformation matrix W can be found which predicts the target features t of

the desired pose given the features of undesired pose x by

t = y(x,W) = Wx (7.1)

where x is of dimension P and t is of dimension Q. A example of this process is

shown in Figure 7.2, where the x is the features of the unwanted profile pose and

t is the desired frontal pose features. It would be prudent however, to express

Equation 7.1 as a predictive distribution as this displays the uncertainty about

the predicted value of t, for any new value of x. This can be done in such a way

to minimise the expected value of a chosen loss function. A common choice for

the loss function is the squared loss function, for which the optimal solution is

given by the conditional expectation of t [8]. As such, assume t is given by a

deterministic function y(x,W) with additive Gaussian noise so that


t = y(x,W) + εI (7.2)

where ε is a zero mean Gaussian random variable with precision (inverse variance)

β. As such, Equation 7.2 can be written as

p(t|x,W, β) = N (t|y(x,W), β−1I)

= N (t|Wx, β−1I) (7.3)

Now given a set of training set consisting of N offline input examples of the

undesired pose X = {x1, . . . ,xN} and their synchronised target examples in the

wanted pose T = {[t1, 1]′, . . . , [tN , 1]′}, Equation 7.3 can be expressed as the

following log likelihood function

ln(T|X,W, β) =N∑

n=1

lnN (tn|Wxn, β−1I)

=NQ

2ln

(β

2π

)− β

2

N∑n=1

‖tnWxn‖2 (7.4)

where a unit bias has been added to T to allow for any fixed offset in the data. No

such bias was given to the input matrix X. Maximising Equation 7.4 with respect

to W, turns into just maximising the sum of squares error function defined by

ED(W) =1

2

N∑n=1

‖tn −Wxn‖2 (7.5)

A problem of using linear regression however, is that it is overly prone to overfit-

ting [8]. A method of overcoming this phenomenon is by introducing a regulari-

sation term so that the total error function can take the form of

ET (W) = ED(W) + λEW (W) (7.6)

where λ is the regularisation term that controls the relative importance of the

data-dependent error ED(W) and the regularisation term EW (W) [8]. One of

the simplest forms of reguliser is given by the sum-of-squares


EW (W) =1

2‖W‖2 (7.7)

The total error now becomes

ET (W) =1

2

N∑n=1

‖tn −Wxn‖2 +λ

2‖W‖2

= ‖T−WX‖2 +λ

2‖W‖2 (7.8)

The sum-of-squares regulariser given in Equation 7.7 encourages weight values

to decay towards zero, unless supported by data, and in machine learning circles

is known as a weight decay regulariser [8]. Minimising the sum-of-squares error

function given in Equation 7.8 with respect to W, yields the maximum likelihood

function. Consequently, the maximum likelihood solution to W can be found be

minimising Equation 7.8 with respect to W via

∂ET (W)

∂W=

∂

∂W[tr [(T−WX)′(T−WX)] + λtr(W′W)]

=∂

∂W[−2tr(TX′W′) + tr(WXX′W′) + λtr(W′W)] (7.9)

The above derivatives can be solved individually by using matrix identities. The

second derivative,

∂

W[tr(WXX′W′)] =

∂

W[tr(W′XX′W)] (7.10)

can be solved using the sum rule as follows. Let

B = XX′W (7.11)

therefore

∂

∂Wtr[W′B] = B = XX′W (7.12)

as the identity ∂∂B

tr[BA′] = ∂∂B

tr[AB′] = A. Now let A = W′XX′, so that

∂

∂Wtr[AW] =

∂

∂Wtr[W′A′] = A = W′XX′ (7.13)


as per the identity used to find Equation 7.12. Therefore

∂

∂Wtr[W′XX′W] =

∂

∂Wtr[W′B] +

∂

∂Wtr[AW]

= XX′W + XX′W

= 2XX′W (7.14)

Using this result back into in Equation 7.9 yields

∂ET (W)

∂W= −2TX′ + 2WXX′ + 2λW (7.15)

Setting this to zero, gives

0 = −2TX′ + 2WXX′ + 2λW

TX′ = W(XX′ + λI)

W = TX′(XX′ + λI)−1 (7.16)

The matrix W found above, was used to project all visual speech features of

the unwanted pose into the wanted pose domain, in an attempt to normalise for

pose. The next section details the experiments that were conducted in this thesis

to determine if this step was of benefit to a pose-invariant lipreading system. But

before describing these experiments, the importance of the regularisation term λ,

is investigated in the next subsection.

7.2.2 The Importance of the Regularisation Term (λ)

The regularisation term, λ, was introduced in the previous subsection to control

the problem of overfitting. Overfitting refers to the situation where a a model

has too many parameters. The common behaviour of such an occurrence is when

the model is overly tuned to the training data giving a perfect fit to this data.

However, when this model is fitted to the test data, wild oscillations occur. One

way of alleviating the problem of overfitting is to limit the number of parameters

in the model. However as Bishop cites, there is something rather unsatisfying

about limiting the number of parameters in the model as it would seem more

reasonable to choose the complexity of the model according to the complexity of the


problem being solved [8]. A method of doing this is to adopt the linear regression

approach using a regularisation term, which was given in the previous subsection.

By utilising this approach, a complex model can be produced as the regularisation

term, λ, is able to weight values which were not supported by the training data

towards zero. However, this begs the following questions

• “what value of λ should be used ?”, and

• “what impact does the amount of training data have on λ?”

To answer these questions, a demonstration of the effectiveness of the linear

regression technique over various values of λ across a different number of training

images is given. For this demonstration, the linear regression matrix W was

learnt from the frontal and profile grayscale ROI images (32× 32). Different Ws

were calculated across various values of λ = {10−2, 100, 102} and for numerous

values of training images N = 1k, 10k, 75k. These training images were randomly

selected from the entire training set (' 200k images) 2. These different Ws, were

used to project the unwanted profile ROI images, into the wanted frontal domain.

The results of this demonstration are given in Figure 7.3

As can be seen from Figure 7.3, an unwanted profile image can be projected

into the wanted frontal image via the linear regression transformation matrix.

However, the likeness between the actual frontal ROI image and the projected

ROI image varies according to the number of training images used and the value

of λ. For example, when only 1k training images were used and the value of

λ = 10−2, the projected profile ROI resembled a noisy ghost-like ROI which is a

far-cry from the original frontal ROI. This is a prime example of overfitting. In

comparision, when the value of λ was increased to 100 and 102 using 1k training

images, the respective projected profile ROIs looked a lot more like the original

ROI.

This result using 1k training images is in stark contrast to the situation where

the number of training images used was increased (10k and 75k), with the value

2In the stationary pose-invariant experiments it should be noted that visual speech featurevectors were used instead of the images as only the speech information was of interest. Calcu-lating the transformation matrices using the image data would have been prohibitive for thefull training set due to the increase in dimensionality as well.


λ = 10-2 λ = 100λ = 10+2

Number of Trainging

Images1k

10k

75k

Unwanted Profile Image

(Known)

Wanted FrontalImage

(Unknown)

Projected Profile Image into the Frontal Domain

Projected Image = W x Profile Image

Figure 7.3: Given one camera, the lipreading system has to be able to lipreadfrom any pose. In this example, those poses are either frontal or profile poses.

of λ having little to no observable difference on the projected profile ROIs with

all of them looking similar to the original ROI. This result is intuitive as when the

number of training examples (> 10k) is far greater than the number of parameters

(1k), a generalised model is usually obtained.

This demonstration highlights the importance of the regularisation term λ as

it alleviates the problem of overfitting when the number of training examples is

limited. However, when there is an abundance of training examples, the value

of the regularisation term is insignificant as the large amount of training data

ensures that a model which generalises well across the data can be obtained. As

such, for the experiments conducted in the next section the value of λ was set to

100, even though it was not important as the number of training examples was

close to 200k.


7.3 Stationary Pose-Invariant Experiments

7.3.1 Experimental Setup

As it was assumed that prior knowledge of the pose of the speaker was known,

localisation and tracking of the ROI’s was done as per Chapter’s 4 and 6 for the

frontal and profile poses respectively. In a “real-world” scenario, the pose of the

speaker would have to be estimated prior to any ROI localisation detection (this

is investigated later on in Section 7.4). However, as the pose was constant during

each entire utterance and for the purposes of demonstrating the pose-invariant

technique, this approach was deemed to be valid. Similar to the ROI extraction,

the visual features were extracted via the baseline cascading appearance features

described in Chapter 5.3.

In the first round of experiments, three lipreading systems were tested. These

systems were trained on the following data:

• 100% frontal

• 100% profile

• 50% frontal and 50% profile (“combined(50-50)”)

As per the past experiments in this thesis, the same multi-speaker train and

test sets described in Chapter 3.6.2 were utilised, i.e. 1198 training sequences

and 242 test sequences. The frontal system was trained solely on the frontal

features. Similarly, the profile system was trained solely on the profile features.

The training set for the combined(50-50) system was made up of 50% of frontal

features (599) and 50% right (599) profile features. For the combined(50-50)

system, all of the different 1198 sequences were accounted for by randomly sub-

stituting the frontal sequences with profile sequence. These systems were tested

on frontal, profile, projected profile, projected frontal and combined test sets.

Similar to the training sets, the combined test set was made up of 50% frontal

(121) and 50% profile (121) data and also termed combined(50-50). Additional

test sets consisting of 50% frontal and 50% projected profile into frontal features

(“combined-projected profile(50-50)”), and 50% profile and 50% frontal projected

into profile features (“combined-projected frontal(50-50)”) were also included.

7.3. Stationary Pose-Invariant Experiments 139

10 20 30 40 50 6020

30

40

50

60

70

80

90

Number of features used, Q

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Performance of Frontal and Combined(50−50) systems for varying number of features

Front(Front)Profile(Front)Proj−Prof(Front)Front(Comb50−50)Profile(Comb50−50)

10 20 30 40 50 6020

30

40

50

60

70

80

90


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Performance of Profile and Combined(50−50) systems for varying number of features

Front(Profile)Profile(Profile)Proj−Front(Profile)Front(Comb50−50)Profile(Comb50−50)

(a) (b)

Figure 7.4: Plots showing the impact that normalising the pose has on lipreadingperformance for the: (a) frontal and combined(50-50) systems; and (b) profileand combined(50-50) systems. These systems are tested across various numbersof features Q = 10 − 60. In the legend, the first label refers to the test set andthe label within the bracket denotes the system’s name.

In these experiments, one of the goals was to see what effect the number of

features had on the transformed features. To do this, all three systems were

trained and tested using features of dimension Q = 10 to 60. It is worth noting

that two transformation matrices were calculated offline for these experiments

via Equation 7.16. The transformation matrix, WP , which projects the profile

features into the frontal pose was calculated with the full set of training frontal

features as the target variable T, and the full set of training profile features as

the input variable X. The transformation matrix, WF , which projects the frontal

features into the profile pose was found using the opposite configuration.


The results given in Figure 7.4 show the impact of projecting the features into a

single pose has on the lipreading performance. In (a), the frontal system is com-

pared to the combined(50-50) system, while in (b) the profile system is compared

to the combined(50-50) system. In the former, it can be seen that the frontal

system achieves the lowest WER when it is tested on the frontal data (best is

27.66% for Q = 40), while for the former plot the profile system obtains the

lowest WER for the profile data (best is 38.88% for Q = 40). However, when

each system is tested on features of the other pose, the features are essentially


recognised as noise due to the large train/test mismatch (both severely degrad-

ing to approximately 87%). It can be seen by projecting the profile features into

the frontal domain (a), or by projecting the frontal features into profile domain

(b), the mismatch between the features and the models are greatly reduced. For

Q = 20, the improvements are quite significant with the WER reducing from

87.07% down to 54.85% for the frontal system in (a) and 87.45% down to 42.97%

for the profile system in (b). However, when the number of features is increased

from Q = 20 to 60, the performance of the projected features steadily degrades,

with the WER increasing from 54.85% to 74.78% and 42.97% to 67.97% in (a)

and (b) respectively.

The drop off in performance as the number of features increases from Q = 20

to 60, highlights one of the characteristics of using maximum likelihood linear

regression. As the solution for the transformation matrix found in Equation 7.16

was obtained via a Bayesian approach, this meant that the effective number of

parameters were adapted automatically to the size of the data set [8]. For these

experiments, it appears that the number of effective parameters or features were

constrained to Q = 20 to 30.

Another observation worth noting is that the performance of the projected

profile test set of the frontal system is well behind the profile test set for the

combined(50-50) system in (a), with a performance difference range of 7.46%

at Q = 20 to 27.01% for Q = 60. This is in contrast to the improvement the

frontal system enjoys over the combined(50-50) system for the frontal test set,

with an average improvement of 8% gained. A similar trend is experienced in (b),

with the combined(50-50) system for the frontal test set outperforming the profile

system for the projected frontal test set with a performance difference range of

6.56% at Q = 20 to 30.85% for Q = 60, while the profile system outperforms the

combined(50-50) system for the profile test set by an average of 8%.

As it was hard to ascertain which overall system is better, it was necessary

to test the system on the combined(50-50) test set. The results for this ex-

periment is illustrated in Figure 7.5. From this plot, it appears that there is

not much difference between the frontal, profile and combined(50-50) systems

for the combined-projected profile(50-50), combined-projected frontal(50-50) and


10 20 30 40 50 6020

30

40

50

60

70

80

90

Number of feature used, Q

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)orm

ance

Performance of Frontal, Profile and Combined(50−50) systems for varying number of features

Comb50−50(Frontal)Comb−ProjProf50−50(Frontal)Comb50−50(Profile)Comb−ProjFront50−50(Profile)Comb50−50(Comb50−50)

Figure 7.5: Plot showing the impact that normalising the pose has on lipreadingperformance for the frontal, profile and combined(50-50) systems. These systemsare tested across various numbers of features Q = 10−60. In the legend, the firstlabel refers to the test set and the label within the bracket denotes the system’sname.

combined(50-50) test sets respectively. This is especially the case for Q = 10 to

30. For the frontal system, the best result achieved was with a WER of 42.02%

for Q = 20. Similarly with Q = 20, the profile system achieved its best WER

of 42.42%. However, the best overall result was from the combined(50-50) sys-

tem with a WER of 40.83%, with Q = 30. A possible reason for this is that

combined(50-50) system is trained on both sets of data equally and by doing this

the system is able to model both poses equally well, thus yielding better overall

results.

By combining the features of the different poses, a model is created which

effectively averages or generalises across the poses. This generalisation has come

at a cost though, as was mentioned earlier with the frontal system degrading by

an average of approximately 8% and the profile system degrading by an average

of approximately 7%. However for these experiments, having a system which can

generalise over the two poses is still better than a system which normalises all

poses into an uniform one. This is highlighted by the fact that the performance of

the combined(50-50) system for the frontal and profile test sets heavily outperform

the performance of the projected features for frontal and profile systems, which


can be seen back in Figures 7.4(a) and (b). Even though generalising across both

sets of features yielded the best results for these experiments, it must be noted

that this was for the scenario where both poses were equally likely. Generalisation

can be particularly costly however, if one pose is more prevalent than the other.

This is the focus of the experiments given in the next subsection.

7.3.3 Biased Towards Frontal Pose

Most lipreading systems are set up for fully frontal faces. This is due to the fact

that nearly all audio-visual speech databases have been restricted to the frontal

pose due to the high cost associated with capturing video data (see Chapters 2

and 3). Even though this has been widely acknowledged throughout this thesis,

what has not been recognised until now is “why” most databases chose the frontal

pose over the profile pose. The reason is quite obvious, as most lipreading appli-

cations would expect the speaker to be in the frontal pose for the majority of the

time. Consequently, to reflect this fact, it would be intuitive that a lipreading

system be trained more on the frontal pose than the profile pose to cater for this

bias. A bonus of adapting this approach is that the frontal pose yields better

lipreading performance than the profile pose (27.66% to 38.88% WER), so the

overall lipreading performance should improve.

To see what impact biasing the system to the frontal pose over the profile pose

has on the lipreading performance, it was decided that a second set of experiments

would be conducted to reflect this scenario. To do this, it was estimated a speaker

would be in the frontal pose for approximately 80% of the time and in the profile

pose for about 20%. For these extended experiments, the frontal system was

still trained solely on 100% the frontal features. A new system was introduced

however, which was called the “combined(80-20)”. As the name suggests, the

combined(80-20) system was trained up on 80% of the frontal data (958 utterance)

and 20% of the profile data (240 utterances). The profile system was not tested

as part of these experiments as they were biased towards the frontal pose. As

such, the frontal system and combined(80-20) systems were tested on the frontal,

profile, projected profile test sets. The combined(80-20) test set was made up of

80% frontal test sequences (194) and 20% of profile test sequences (48). Similarly,


10 20 30 40 50 6020

30

40

50

60

70

80

90


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Performance of Frontal and Combined(80−20) systems for varying number of features

Front(Front)Profile(Front)Proj−Prof(Front)Front(Comb80−20)Profile(Comb80−20)Proj−Prof(Comb80−20)

Figure 7.6: Plot showing the impact that biasing the system to the frontal posehas on the lipreading performance for the frontal and combined(80-20) systems.These systems are tested across various numbers of features Q = 10 − 60. Inthe legend, the first label refers to the test set and the label within the bracketdenotes the system’s name.

the combined-projected profile(80-20) test set consisted of 80% frontal and 20%

projected profile test sequences. It is worth noting that the regression training

sets remained the same due to the limited number of synchronised examples.

Figure 7.6 shows the lipreading performance for the frontal and combined(80-

20) systems for the frontal, profile and projected profile test sets. From this plot

it should be visible that the overall lipreading performance of the frontal system

has greatly improved. For the projected profile features, the best result sees the

WER come down to 33.90% from 42.02% for Q = 20. The improvement in the

overall performance of the frontal system can be attributed to the fact that this

system is solely trained on the frontal pose, which is the pose these experiments

are biased towards. This frontal system result outperforms the combined(80-20)

system, with a best WER of 36.61% also for Q = 20. This mark is much better

than the combined(50-50) result recorded in the previous experiment, which was

40.83% for Q = 30. This improvement can also be attributed to the biasing of the

system to the frontal data, as well as lessening the impact of generalisation. This

can be seen in Figure 7.6, as the combined(80-20) system curve for the frontal

test set is relatively close to the frontal system curve. This has come at a cost


10 20 30 40 50 6032

34

36

38

40

42

44

46

48


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)

Performance of Frontal, Combined50−50 and Combined80−20 systems for varying number of features

Comb50−50(Comb(50−50)Comb80−20(Front)Comb−ProjProf80−20(Front)Comb80−20(Comb80−20)Comb−ProjProf80−20(Comb80−20)

Figure 7.7: Plot showing the impact that normalising the pose has on lipreadingperformance for the frontal, profile and combined(50-50) systems. These systemsare tested across various numbers of features Q = 10−60. In the legend, the firstlabel refers to the test set and the label within the bracket denotes the system’sname.

though, as the profile performance of the combined(80-20) system has degraded

significantly compared to the combined(50-50) system, with the WER increasing

to the range of 62.11% to 62.55%, from the range of 45.53% to 47.77% shown in

Figure 7.4(a). This suggests that the profile data is not adequately represented

in the combined(80-20) system model and as such the projected profile features

achieve a better WER than the profile features with 57.78% for Q = 20, compared

to 62.11% for Q = 40. As this is the case, it is not surprising that the frontal

system is now the superior system, which is depicted in Figure 7.7. Also, from this

figure it is visible that the projected profile features are now better performing

than the profile features, with the optimal value of Q = 20.

From these experiments, it is evident that when the models are biased towards

one particular pose, such as the frontal one, it is advantageous to normalise all

poses into the strongly trained pose. It would be expected that when the number

of non-dominant poses are increased, this result will be even more dramatic, as

these non-dominant poses increase the amount of variation in the train/test set.

This is the focus of the next experiment, which includes the other profile pose.


Left or

Frontor

RightVideo

Figure 7.8: In these experiments, the lipreading system has to lipread from thefrontal, right and left profile poses, instead of just the frontal and profile (right)poses.

7.3.4 Inclusion of Additional Pose

In the previous section, it was hypothesised that when the number of poses is

increased, the benefit of pose normalisation will be even more pronounced. The

experiments performed in the following section are designed to illustrate this

point. To do this, the left profile pose data was included. With the introduction

of the left profile pose data, the lipreading paradigm has shifted from the one

depicted in Figure 7.1 to the one given in Figure 7.8.

An additional lipreading system was developed to accommodate the additional

pose data. Like in the previous experiment for the combined(80-20) system, the

“combined(80-10-10) system” was trained on data which was biased towards the

frontal data, with 80% being of frontal (958) data, 10% being for the right pro-

file (120) and 10% being for the left profile (120) data. To see the benefit of

the pose normalising over the combined model with the additional pose, the

frontal, combined(80-20) and combined(80-10-10) systems were tested. These

systems were tested on the front, right profile, left profile, right projected pro-

file, left projected profile, combined(80-20), combined-projected profile(80-20),

“combined(80-10-10)” and “combined-projected left and right profile(80-10-10)”

test sets. Like the combined(80-20) and combined-projected profile(80-20) test


sets, the combined(80-10-10) and combined-projected left and right profile(80-

10-10) test sets contained the 3 different poses, consisting of 80% for the frontal

(194), 10% for the right profile and projected (24) and 10% for the left profile

and projected (24) data respectively. It is worth noting that the right profile data

refers to the profile data mentioned in the previous experiments.

For these experiments, the left profile data set was constructed by horizontally

mirroring the right profile ROI images. Once these ROIs were obtained, the visual

feature extraction step was performed as normal. As the left profile ROIs were just

the mirrored right profile images, this meant that the features were effectively the

same due to the DCT step in the visual feature extraction process. As the DCT

is a laterally symmetrical function (see Chapter 5.5), this meant that the only

difference between the left and right profile features were that the odd frequency

components were of opposite polarity, which in turn resulted in essentially the

same visual feature vectors being obtained for both the profile poses. As such,

the lipreading results for each of these poses were identical and as this was the

case they were referred to just as profile in the results.

Table 7.1 shows the results for these experiments. As the results from the

previous experiment showed that optimal performance was gained with Q = 20,

this number of features were used for this experiment. From these results it can

be seen that when data of another pose added, the benefit of the normalising the

pose is more substantial when compared to the combined systems. When only two

poses were used, the performance on the combined(80-20) test set showed that the

combined(80-20) system obtained a WER of 37.33%. Contrastingly, when there

was 3 poses included on the combined(80-10-10) test set, the combined(80-10-10)

system obtained a worse WER of 39.96%, which is a degradation of around 2.6%.

It can be seen that the frontal and the projected profile performance remains con-

stant, however, the performance of the profile degrades from a WER of 62.55% for

the two poses to 69.74% for three poses. Like the previous experiments, this can

be attributed to the lack of classification power the system possesses to accurately

model features across the different poses. In comparison, projecting the features

into an uniform pose does not alter the performance of the lipreading systems at

all. It would be expected that further degradations would occur to the combined

7.4. Continuous Pose-Invariant Lipreading 147

System Tested on

System Frontal Profile Proj Comb Comb Comb Comb

Trained Profile (80-20) Proj (80-10-10) Proj

(80-20) (80-10-10)

Frontal 29.18 87.07 54.85 40.09 33.90 40.07 33.81

Comb(80-20) 32.46 62.55 57.98 37.33 36.61 41.23 40.76

Comb(80-10-10) 32.51 69.74 58.02 38.19 37.31 39.96 36.82

Table 7.1: Lipreading results in WER (%) showing the effect that an additionalpose has on performance for Q = 20. As the left and right profile WER werethe same, profile refers to both poses. The combined(80-10-10) test set refers tofrontal (80%), right (10%) and left (10%) profile poses.

systems when more poses are included into the system (i.e ±30o,±45o,±60o etc.).

However, by utilising the pose-normalising step as described in this chapter, the

degradation to the overall lipreading performance can be minimised.

7.3.5 Limitations of Pose-Normalising Step

The linear regression method mentioned in this section works quite well in min-

imising the train/test mismatch between the visual speech features of the various

poses. However, it must be mentioned that it is anticipated that this method

would only be useful for the task of small vocabulary tasks, as the single tran-

formation matrix can only learn to a certain extent the differences between the

frontal and non-frontal poses. For applications such as large vocabulary lipreading

with many speakers (> 100), it is expected that this method would be prohibitive

due to the large amount of data required to train up a single transformation ma-

trix. It is also unrealistic to think that a single transformation matrix could be

able to learn the differences between all the different speakers and visual sounds

as well.

7.4 Continuous Pose-Invariant Lipreading

At the start of the chapter it was stated that the main motivation behind pose-

invariant lipreading was to make the lipreading system more “real-world” by

allowing the speaker to be in more than one pose during the utterance. However,


Video In

Pose Estimate?(from face)

ROI Localisation

ROI Localisation

ROI Localisation




HMM

NormalisePose

NormalisePose

Recognise VisualSpeech

RightProfileFront

LeftProfile

Figure 7.9: Block diagram of the continuous pose-invariant lipreading system.

all the work presented in this chapter thus far has dealt with the scenario where

the speaker has remained in the same pose for the entire sequence. In addition

to this, the pose of the speaker was assumed to be known, which hardly make

the lipreading system “real-world”. In an attempt to remedy this situation, this

section of the thesis is dedicated to the development of a continuous pose-invariant

lipreading system. To accommodate this work, the CUAVE database [130] was

used as it contained continuous video of speakers talking in three different poses:

frontal, left profile and right profile (see Chapter 3.6.3 for full details).

The deployment of a continuous pose-invariant lipreading system is very sim-

ilar to the stationary scenario, albeit with one modification. This modification is

the inclusion of a pose-estimator at the front of the visual front-end, as depicted


in Figure 7.9. It can be seen that once the pose of the speaker has been estimated,

this estimation is used to direct the system to locate the ROI of that particu-

lar pose. Once the ROI has been extracted, visual feature extraction can take

place and the features can be combined into a single model or normalised into

the frontal pose as described in the previous section. It must be noted that the

addition of a pose-estimator at the start of the lipreading system may seem like

a simple enough solution, however this type of approach can be problematic as it

provides another avenue to introduce error into the lipreading system due to the

front-end effect. Ideally, perfect pose-estimation would be achieved which would

result in the lipreading performance not being affected at all. Unfortunately,

this is extremely difficult to achieve and as a consequence, it is expected that

some error will be introduced into the lipreading system via the pose-estimator.

Through a number of experiments on the CUAVE database, the impact of the

pose-estimation module on the entire lipreading system is analysed in this sec-

tion. Prior to this analysis though, a full description of the pose-estimation and

multi-pose visual front-end is given.

7.4.1 Pose Estimation

In Chapter 4, many different visual front-ends were discussed and whilst all have

some advantages associated with them, the Viola-Jones algorithm [180] was se-

lected to be used in this thesis as it is extremely rapid, accurate and was able

to be used for non-frontal poses as well as frontal. Throughout this thesis, the

benefit of using this algorithm has been illustrated, however, it has only centered

on locating faces and facial features of one specific pose (both frontal and pro-

file). For continuous pose-invariant lipreading, a multi-pose paradigm has to be

visited. This highlights another benefit of the Viola-Jones framework, as it is able

to accommodate for the multi-pose scenario by the inclusion of a pose-estimator,

which still allows for extremely quick localisation of faces and features [82].

According to Jones and Viola [82], the multi-pose visual front-end depicted in

Figure 7.9 is the preferred option amongst researchers. A reason they gave was

that a holistic approach, where a single classifier is trained to detect all poses of

a face, is unlearnable with existing classifiers. In their informal experiments they


found that using the holistic approach yielded extremely inaccurate results. The

initial work in this thesis using this holistic approach also backs up this assertion.

It would appear that like the previous section where the combined HMM classifier

suffered from over generalisation, the boosted cascade of simple classifiers suffer

from the same problem. Another disadvantage of using a single classifier across

all poses is that there is no information about the speaker’s pose is gained. This

means that a pose normalising step using linear regression that was described in

the previous section can not be utilised.

The pose-estimation of a speaker’s face is essentially a chicken or the egg

problem. Firstly, the location of the face has to be known to determine its pose,

but the pose of the face has to be known to find the face. A prudent strategy

to achieve this would be to solve both of these problems simultaneously. To do

this, a face classifier for each pose has to be constructed, then this classifier has

to be scanned exhaustively for each position and scale in the image. As this is

extremely expensive in terms of computation, a rapid detection framework like

the Viola-Jones framework has to be employed. In [82], Jones and Viola did

such a thing by building different detectors for different poses of the face. These

classifiers were then placed in a decision tree to determine the pose of the given

window being tested. Rowley et al. [155], employed a similar strategy but was

reported to not be as quick and accurate as the one devised by Jones and Viola

[82].

For this thesis, a similar strategy to Jones and Viola was used to develop the

pose estimator. A diagram of the devised pose estimator is depicted in Figure

7.10. From this figure it can be seen that given a frame of a speaker’s face,

all the face classifiers are applied to the image to determine the location of the

face. Once a face has been located by a pose specific classifier, this face and

pose information is then used by the continuous pose-invariant lipreading system

which is described by Figure 7.9. This procedure works well when only one of

the poses is estimated, however, it gets complicated when there is more than

one pose estimated as there is no way of knowing which pose is the correct

one. To counteract this problem, the nearest neighbour variable is used. The

nearest neighbour variable is a parameter in OpenCV’s generic object detector


Video Int

Set Nearest Neighbour = 1

Check Front PoseFace Classifier

Check Left PoseFace Classifier

Check Right PoseFace Classifier

How ManyPoses

Estimated?

Run Visual Front-End for Estimated Pose

Increase Nearest Neighbour

Parameter +2

> 1Pose Estimation

Failure

Use PreviousPose

= 1

= 0

Figure 7.10: Block diagram of the pose estimator which incorporates the poseestimation with the face localisation.

[128], which essentially regulates how much an object has to look like the object

of interest before it is recognised as that object. When an object in an image

looks like the object of interest (i.e. face), the object detector puts a number of

rectangles around the object. The more likely the object looks like the object of

interest, the more rectangles are around the object. For example, in Figure 7.11

a speaker’s face is detected as such by a face classifier, which is symbolised by the

three rectangles around the speaker’s face. If the nearest neighbour parameter is

set to three or less, then the face is deemed to be a face. However, if the nearest

neighbour parameter is set to four or above, then the face is not deemed to be a

face.

For the pose-estimator given in Figure 7.10, the nearest neighbour parameter

is set to one and all the pose specific face classifiers are tested on the given frame.


Figure 7.11: Example showing the function of the nearest neighbour variable inthe face localiser.

If only one face/pose is found then that information is used by the lipreading

system. However, if there is more than one pose estimated, the nearest neighbour

parameter is increased by two to determine which is the more likely pose. This

process is continued until only one face/pose is found. If no face/pose is found,

the face and pose information from the previous frame is used.

7.4.2 Experimental Setup

For all the experiments in this section, the isolated digit tasks from the individual

section of the CUAVE database was used. A full description of the database

protocol is given in Chapter 3.6.3. However, it is worth noting that in this data

four different tasks were tested on, i.e. normal (frontal), moving (frontal), left

profile and right profile. For the lipreading results, each of these individual tasks

were compared against the combined performance. For the training of the pose-

estimator and pose specific visual front-ends, only the frontal, left-profile and

right-profile poses were considered. The face and facial feature classifiers for each

pose were trained up on 500 manually annotated positive examples and 2000

negative examples, in the same manner as the previous experiments. The set of

500 positive examples for each pose were taken from all the 33 subjects. This was

because there were not enough speakers to create classifiers to achieve accurate

localisation for the ten different train/test sets devised in Chapter 3.6.3. As such,

only one variant of the pose-estimator and visual front-ends were developed for

these experiments. The set of positive examples for each pose were augmented

by including rotations of ±5o,±10o, providing a set of 2500 positive examples. A


Pose Correctly False Alarm Miss Alarm

Estimated (%) Rate (%) Rate (%)

Front 92.31 0.00 7.69

Right Profile 87.17 5.13 7.69

Left Profile 89.74 5.13 5.13

Total 89.74 3.42 6.84

Table 7.2: Pose Estimate results on the CUAVE validation which consisted of 39images for each pose.

separate validation set of 39 annotated images for each specific pose were used

to test the pose-estimator and pose specific visual front-ends. These results are

presented in the following subsections.

7.4.3 Pose Estimate Results

The pose estimate results are shown in Table 7.2. Determination of whether the

pose of the speaker was correct or not was done by manual inspection. From the

results, it can be seen that the pose-estimator/face localiser achieves reasonable

results, however, it is far from ideal as if it gets it wrong at this stage it will

cause erroneous localisation of the ROI and thus incorrectly recognise the visual

speech. This again shows the impact of the front-end effect.

Most of the false and miss alarms occur when the pose is in transition (i.e. not

quite frontal and not profile). Examples of the pose estimation/face localisation

are shown in Figure 7.12. The top two rows of this figure show the results for

the frontal pose. The more difficult frames for the frontal pose were selected for

testing. This was done as it was expected that these frames would cause the

most trouble for the pose-estimator and for the multi-pose visual front-end to

operate successfully, it would need good performance on such frames. As can be

seen from the frames in the last column, a few miss alarms were incorporated

due to the irregular rotation of the speakers face (i.e. the first one the speaker is

looking upwards, the second is in between front and left profile pose). Overall, it

can be said that the performance for the frontal pose was quite good with only a

small number of miss alarms and no false alarms, however, due to the small size


Figure 7.12: Examples of results from the pose estimator. The first two rows giveresults for the frontal pose. The third and fourth rows give the results for theright profile pose and the last two rows give the results for the left profile pose.The last column gives examples of false estimates and miss estimates.

of the validation set this can not be said with any great confidence. The third

and fourth rows give examples for the right profile pose, whilst the bottom two

rows give examples for the left profile pose. The right profile pose gave the worse

performance at of all poses but this was only marginal. For the left and right

profile poses, there were a few false alarms with these getting confused with the

frontal pose. As the speaker’s do not have a definite frontal or profile pose, the

pose-estimator is getting confused with the ambiguity which causes the errors.

7.4.4 Multi-Pose Localisation Results

Each pose specific visual front-end was developed in the same fashion to those

developed for the experiments conducted on the frontal and profile poses of the

IBM smart-room databases (see Chapter 4.6 and Chapter 6.2 for full details).



Frontal Right Profile Left Profile Total

Right Eye 87.17 - 82.05 84.61

Left Eye 84.62 82.05 - 83.34

Nose 79.49 76.92 79.49 78.63

Right Mouth Corner 82.05 - 79.49 80.77

Top Mouth 81.08 74.36 71.79 75.74

Left Mouth Corner 82.05 79.49 - 80.77

Bottom Mouth 76.92 71.79 74.36 74.36

Center Mouth 82.05 79.49 79.49 80.34

Chin 61.54 53.85 58.97 58.12

Table 7.3: Facial feature localisation accuracy results for all poses on the CUAVEvalidation set.

The localisation results are given in Table 7.3. It is worth noting that a feature

was deemed to be successfully located if it was within 10% of the manually an-

notated distance between the eyes for the frontal pose, and 10% of the manually

annotated distance between the nose and chin for the profile poses. From the

results it appears that the localisation of the facial features is not as good as

the experiments conducted on the IBM smart-room data. These results can be

misleading though, as the actual facial feature localisation performance is pretty

much on par with the smart-room data performance. However, the cause of the

performance degradation was due to the previous step of pose-estimation/face

localisation. In another example of the front-end effect, the false or miss alarms

of the pose-estimator/face localisation module filtered down to the facial feature

localisation step, which caused the degraded results. This was to be expected

though, as this task is much more difficult than the tasks associated with the

smart-room data as the variable of head pose movement is introduced.

Examples of the localised faces and facial features are given in Figure 7.13.

The top row gives examples from the frontal pose, the second row gives examples

from the right profile pose and the third row gives examples from the left profile

pose. The bottom row gives the associated examples of the extracted 32 × 32


Figure 7.13: Examples of face and facial feature localisation from the multi-posevisual front-end. The bottom row gives the associated examples of the extracted32× 32 ROI’s

ROI’s. For the frontal pose, scale and rotation normalisation was performed us-

ing the left and right mouth corners. As all sequences started with the speaker

in the frontal pose, scale normalisation for the profile poses was determined from

the scale metric determined in the initialisation of the visual front-end. Unfortu-

nately, no rotation normalsation for the profile poses was performed due to lack

of horizontally aligned points (see Chapter 6.2). Once the ROIs were extracted,

the same visual feature extraction process performed throughout this thesis were

conducted for these experiments.

7.4.5 Continuous Pose-Invariant Lipreading Results

The experiments in this section were broken up into two sections. The first sec-

tion investigated the lipreading performance of the four individual tasks; normal,

moving, right profile and left profile. For each of these individual tasks, the mod-

els were trained and tested solely on the data which referred to their respective

task. In the second section, the pose-invariant lipreading task were investigated.

For these experiments, one model which was trained up on all the different tasks

was used for testing. This was termed the “combined all” result. In addition

to this, depending on the result from the pose-estimator the features were nor-

malised into the frontal pose, using the pose-invariant technique based on linear


Task WER (%)

Normal 46.88

Moving 67.26

Right Profile 71.95

Left Profile 71.54

Combined Individual 57.97

Combined All 61.20

Pose Normalised 61.49

Table 7.4: The upper part of the table shows the average lipreading performancefor each individual task, whilst the bottom part compares the performance forthe combined individual, combined all and pose normalised tasks, across the 10different train/test sets.

regression introduced at the start of this chapter. As no synchronous data was

available in the CUAVE database to develop the linear regression matrices, the

left and right profile matrices from the IBM smart-room database were utilised

for this task. These results were referred to as the “pose normalised” results.

Both the “combined all” and “pose normalised” results were compared to the

average of the individual results which was termed “combined individual”.

The results for the continuous pose-invariant lipreading system are given in

Table 7.4. For the individual tasks, the normal task achieved the best performance

with an average lipreading WER of 46.88%. This was to be expected as this

was the easiest task to perform, due to the speaker being relatively stationary.

Even though the moving task had the speaker in the frontal pose, having the

speaker move their head back and forth whilst speaking degraded the lipreading

performance markedly to 67.26%. As this task had the speaker moving their

head quite fast, it can be assumed that a major reason for this poor performance

is due to poor tracking of the ROI. The left and right profile tasks achieved

even worse WERs of 71.54% and 71.95% respectively. It must be noted that the

WERs of 46.88% and 71.95% achieved in this experiment for the normal and

right profile tasks are significantly worse than the 27.66% and 38.88% WERs

achieved in the IBM smart-room database for the similar scenario. There are

two reasons for this. Firstly, due to the small size of the CUAVE database, a

speaker-independent lipreading paradigm had to be used in these experiments


using 10 different train/test sets, compared to the multi-speaker paradigm used

in the IBM smart-room experiments. Better lipreading performance is expected

for the multi-speaker paradigm as speakers which are in the train set are also

contained in the test set. Secondly, and probably most importantly, due to the

relatively small size of the CUAVE database there was not enough speech data

to adequately train the models for each task. This would be the case especially

for the profile models, as only ten digits were available from each speaker. This

corresponds to only 250 words available to train the models, which would cause

the models to be grossly undertrained.

For the continuous pose-invariant or combined experiments, it can be seen

that “combined individual” which was the average of the individual tasks yielded

the best lipreading performance with a WER of 57.97%. However, the “combined

all” results were not far behind with a WER of 61.20%. The pose-invariant step

using linear regression did not improve the performance, achieving a WER of

61.49%. This is probably due to the fact that the transformation matrices that

were used were trained on the IBM smart-room database.

Due to the small amount of visual speech data contained within the CUAVE

database, it is hard to qualify the significance of the lipreading results obtained

from these experiments. It must be said however, that they do give an indica-

tion that the goal of continuous pose-invariant lipreading is indeed attainable as

an achieved WER of 61.20% is much better than pure chance. Regardless of

the lipreading results, it is evident that the development of a continuous pose-

invariant lipreading system is the next step in deploying a fully functional “real-

world” AVASR system and the key dilemma to this problem is developing a robust

visual front-end which has an extremely accurate pose-estimator.

7.5 Summary

In this chapter, a very novel and useful contribution to the field of AVASR was

presented with the introduction of a pose-invariant lipreading system. Pose-

invariant lipreading refers to the situation where given a single camera, the system

can recognise visual speech regardless of pose. Two scenarios of the problem were

7.5. Summary 159

visited, i.e. stationary and continuous. The first part of the chapter dealt with

the stationary situation, which refers to the case where the speaker is in the one

pose (frontal or right profile) for the entire utterance. These experiments were

conducted on the IBM smart-room database. In these experiments it was shown

that when the features of one pose were tested on the other pose, the train/test

mismatch between the two was large and the lipreading performance severely

degraded as a consequence. To overcome this problem, a pose-invariant or pose

normalising technique using linear regression was used to project all the features

of the unwanted pose into the wanted pose. This technique was shown to reduce

the train/test mismatch between the different poses, and was shown to be of

particular benefit when one pose was more prevalent than the other (i.e. frontal

over right profile) due to over generalisation. In some extended experiments, it

was shown that the effect of this pose-invariant technique is more pronounced

when more poses are included (left profile), again due to over generalisation.

However, a caveat on this approach was the dimensionality of the features used

to determine the linear regression matrix. It was shown once the dimension is

greater than 30, the benefit of using the pose-invariant technique is diminished

and better performance is gained through a combined model of the different poses.

In the latter part of the chapter, the more realistic continuous scenario was

investigated. Continuous pose-invariant lipreading refers to the speaker changing

their head pose whilst they are speaking. This constituted a much more diffi-

cult problem than the stationary scenario as the pose of the speaker had to be

estimated every frame. In this novel system, the pose-estimator was developed

in conjunction with the face localiser and achieved reasonable results. As the

pose-estimation step was at the front of the lipreading system, it introduced ex-

tra error which affected the overall lipreading performance due to the front-end

effect. The results for these experiments, which were conducted on the CUAVE

database, show this to be the case as they are somewhat behind the lipreading

performance of the IBM smart-room database.

Chapter 8

Conclusions and Future Research

8.1 Summary of Contributions

Over the past twenty years, literally hundreds of articles have been dedicated to

illustrating the benefit of using the visual speech information from a speaker’s

mouth in addition to the audio signal for the task of speech recognition (see Chap-

ter 2). Even though all these works have shown that including the visual channel

to the speech recognition system greatly improves the recognition performance in

the presence of acoustic noise, no serious attempts have been taken in deploying

an AVASR system which can be used in realistic noisy environments, such as an

in-car scenario. A major reason for this is that nearly all the current work carried

out within this field has failed to focus on unwanted variabilities that lie within

the visual domain, such as head pose. In an attempt to remedy this situation,

the work in this thesis has concentrated on researching and developing methods

to recognise visual speech across multiple views. Within this broad problem, the

following specific goals were set as the main objectives for this thesis:

1. Recognise visual speech from profile views and compare it to its synchronous

counterpart in the frontal view,

2. Determine if there is any complimentary information contained within the

profile viewpoint by combining both frontal and profile features together to

form a multi-view lipreading system, and

3. Develop a pose-invariant lipreading system which can recognise visual speech

regardless of the head pose from a single camera.

161

162 Chapter 8. Conclusions and Future Research

All the work contained in this thesis was performed with the intention of address-

ing these novel and previously unsolved objectives. The major original contribu-

tions resulting from this work are summarised as follows:

(i) Prior to any work on lipreading from non-frontal views being conducted, a

thorough investigation of the cascade of appearance based features, which

is the current state-of-the-art visual feature extraction technique, was un-

dertaken on the frontal section of the IBM smart-room database in Chapter

6. In this novel investigation, analysis on each stage of the cascade of

appearance based feature was performed, which displays the problem of di-

mensionality in lipreading using a HMM classifier. Through this analysis

it was shown that certain measures can be taken to maximise the amount

of speech information extracted from the visual domain through the use of

the DCT and LDA techniques. The impact of feature mean normalisation

(FMN) was also quantified in this analysis, with the FMN step shown to

eliminate redundant speaker information which greatly affected the lipread-

ing performance.

(ii) A visual front-end based on the extremely rapid Viola-Jones algorithm was

developed which could locate and track a speaker’s mouth ROI from both

the frontal and profile views (Chapters 4 and 6). For both the frontal and

profile views, a hierarchical approach was utilised which used the previous

located facial feature points for ROI extraction. For the frontal pose, the

left and right mouth corners were used for scale and rotation normalisation.

For the profile pose, the left eye and left mouth corner were used for scale

normalisation. Unfortunately, no rotation normalisation could be performed

on the profile pose as no reliable horizontal facial feature points could be

located.

(iii) The lipreading performance from a speaker’s profile view was quantified in

Chapter 6. This lipreading performance was then compared against its syn-

chronous frontal counterpart. This comparison was novel and unique as it

was the first that showed reasonable lipreading performance can be obtained

from the profile view, albeit, degraded when compared to the frontal view

(38.88% vs 27.66% WER).

8.1. Summary of Contributions 163

(iv) A novel analysis technique using patches was employed on both the frontal

and profile mouth ROIs to determine the saliency of the various regions

of both the ROIs to the task of lipreading. In this innovative analysis, it

was shown that the middle patch containing the most visible articulators,

such as lips, teeth and tongue gave the most visual speech information for

the frontal view. Similarly, in the profile view, the middle patch was also

the most informative patch, however it was hypothesised that in addition

to the lip, teeth and tongue information, the lip protrusion information

was also of benefit. From this patch-based analysis, a new multi-stream

representation of visual speech was developed which fused the most salient

patches of the ROI together via the synchronous multi-stream HMM. Using

this novel approach, it was found that slight gains could be made over the

holistic patch by fusing the holistic patch with the middle patch. This work

was conducted in Chapters 5 and 6.

(v) At the end of Chapter 6, a novel system which fuses both the frontal and

profile synchronous features together was described. This was referred to

as a multi-view lipreading system. The multi-view system presented in this

thesis was unique as it is the first lipreading system to have more than

one camera as its input. From the multi-view experiments, it was shown

that there does exist complimentary information in the profile view, which

in turn improved the overall lipreading performance (multi-view WER =

25.66% compared to frontal WER = 27.66%).

(vi) A unified approach to lipreading in Chapter 7 was presented, by normalis-

ing all poses to a single uniform pose. Given only one camera, this pose-

invariant lipreading system used a transformation matrix based on linear

regression to project the features of the unwanted pose (profile) into the

wanted pose (frontal). These experiments were performed for the station-

ary scenario, where the speaker was fixed in one pose (i.e. frontal or profile)

for the entire utterance and the pose of the speaker was assumed. This

pose-normalising step was shown to lessen the train/test mismatch between

the two poses and was shown to be of particular benefit when the speaker


was in one pose more than the other (i.e. frontal over profile). When more

non-dominant poses were included (such as the other profile pose), the pose-

normalising step also proved to be of benefit.

(vii) A more realistic continuous pose-invariant lipreading system, which allows

the speaker to move their head during the utterance was proposed at the

end of Chapter 7. This constituted a much more difficult problem than the

stationary scenario as the pose of the speaker had to be estimated every

frame. In this novel system, the pose-estimator was developed in conjunc-

tion with the face localiser and achieved reasonable results. As the pose-

estimation step was at the front of the lipreading system, it introduced extra

error which affected the overall lipreading performance due to the front-end

effect. The results for these experiments, which were conducted on the

CUAVE database, show this to be the case as they are somewhat behind

the lipreading performance of the IBM smart-room database.

8.2 Future Research

In this thesis, solutions towards the problem of lipreading from multiple views

were investigated, with results from a multitude of experiments involving non-

frontal views presented for the small-vocabulary task of connected-digit recogni-

tion. Although this is a major problem, other variabilities such as illumination,

appearance, speaking style, image alignment (registration) and speaker emotion

and expression need to be investigated as well. A much more robust AVASR

system could be obtained if research into lipreading across these variables were

investigated. In addition to this, future research needs to be conducted on large-

vocabulary data for this technology to become a viable option, However, to fa-

cilitate this research, databases which are for large-vocabulary tasks as well as

containing these visual variabilities need to become available.

In Chapters 4 and 6, as part of the visual front-end, the Viola-Jones algorithm

[82, 180] was used for locating a speaker’s face and facial features for both frontal

and non-frontal views. The main motivation behind using this algorithm was

8.2. Future Research 165

that is was extremely quick and was reasonably accurate. Recently, a fast imple-

mentation of active appearance models (AAMs) using a variant of the gradient

descent algorithm has emerged which can run in real-time. As AAMs fit a 3-D

mesh onto a speaker’s face, this method promises to improve locating/tracking

performance as well as improve the pose-estimation process. Future research

needs to be conducted into this area as accuarate location of a speaker’s ROI is

central to the success of a lipreading system.

In Chapter 7, a viewpoint-transformed method using linear regression to

project visual features from an unwanted viewpoint into a wanted viewpoint was

developed. In addition to the viewpoint-transformed method, coefficient-based

methods, such as the light-field type approach [67], exist to perform the same type

of task. Future research is required to compare the coefficient-based methods to

the viewpoint-transformed methods, to get some kind of indication to which type

of approach is more suited to the task of lipreading.

In the far distant future, it is possible that lipreading could evolve into one

of the key technologies that are used online. With the recent advent of the

extremely popular YouTube 1, users across the world have access to literally

billions and billions of video clips on the internet. Having a lipreading system

which can automatically detect who is speaking, when they are speaking and what

they are saying within a video clip would be of major benefit for automatically

authenticating and possibly censoring these video clips. Even though this task is

outside the scope of this thesis, it is worth noting some of the potential that this

technology possesses.

1http://www.youtube.com

Bibliography

[1] A. Adjoudani and C. Benoit, “On the integration of auditory and visual

parameters in an HMM-based ASR,” in Speechreading by Humans and Ma-

chines (D. G. Stork and M. E. Hennecke, eds.), pp. 461–471, Berlin, Ger-

many: Springer, 1996.

[2] A. Adjoudani, T. Guiard-Marigny, B. LeGoff, L. Reveret, and C. Benoit,

“A multimedia platform for audio-visual speech processing,” in Proceed-

ings of the European Conference on Speech Communication and Technology,

(Rhodes, Greece), pp. 1671–1674, 1997.

[3] P. Aleksic, J. Williams, Z. Wu, and A. Katsaggelos, “Audiovisual speech

recognition using MPEG-4 compliant visual features,” EURASIP Journal

of Applied Signal Processing: Special Issue on Joint Audio-Visual Speech

Processing, vol. 2002, no. 11, pp. 629–642, 2002.

[4] E. Aronson and S. Rosenblum, “Space perception in early infancy: percep-

tion within a common auditory-visual space,” Science, vol. 172, pp. 1161–

1163, 1971.

[5] J. P. Barker and F. Berthommier, “Estimation of speech acoustics from

visual speech features: A comparison of linear and non-linear models,” in

Proceedings of the International Conference on Auditory-visual Speech Pro-

cessing, (Santa Cruz, USA), pp. 112–117, 1999.

[6] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs Fisherfaces:

Recognition using class specific linear projection,” IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.

167

168 Bibliography

[7] C. Benoit, T. Guiard-Martigny, B. L. Goff, and A. Adjoudani, “Which

components of the face do humans and machines best speechread?,” in

Speechreading by Humans and Machines (D. Stork and M. Hennecke, eds.),

Berlin, Germany: Springer, 1996.

[8] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[9] V. Blanz, P. Grother, P. Phillips, and T. Vetter, “Face recognition based on

frontal views generated from non-frontal images,” in Proceedings of the In-

ternational Conference on Computer Vision and Pattern Recognition, vol. 2,

(San Diego, CA, USA), pp. 454–461, 2005.

[10] H. Bourland and S. Dupont, “A new ASR approach based on independent

processing and recombination of partial frequency bands,” in Proceedings

of International Conference on Spoken Language Processing, (Philadelphia,

PA, USA), pp. 426–429, 1996.

[11] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models

for complex action recognition,” in Proceedings of the International Confer-

ence on Computer Vision and Pattern Recognition, (San Juan, Puerto Rico),

pp. 994–999, 1997.

[12] C. Bregler, H. Hild, S. Manke, and A. Waibel, “Improving connected letter

recognition by lipreading,” in Proceedings of the International Conference on

Acoustics, Speech and Signal Processing, (Minneapolis, USA), pp. 557–560,

1993.

[13] C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” in Pro-

ceedings of the International Conference on Acoustics, Speech and Signal

Processing, vol. 2, (Adeliade, Australia), pp. 669–672, 1994.

[14] N. Brooke and A. Summerfield, “Analysis, synthesis, and perception of vis-

ible articulatory movements,” Journal of Phonetics, pp. 63–76, 1983.

[15] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15,

no. 10, pp. 1042–1052, 1993.

Bibliography 169

[16] U. Bub, M. Hunke, and A. Waibel, “Knowing who to listen to in speech

recognition: Visually guided beamforming,” in Proceedings of International

Conference on Acoustics, Speech, and Signal Processing, (Detroit, MI, USA),

pp. 848–851, 1995.

[17] R. Campbell, “Seeing brains reading speech: A review and speculations,” in

Speechreading by Humans and Machines (D. Stork and M. Hennecke, eds.),

pp. 115–133, Berlin, Germany: Springer-Verlag, 1996.

[18] M. Cathiard, M. Lallouache, and C. Abry, “Does movement on the lips

mean movement in the mind?,” in Speechreading by Humans and Machines

(D. Stork and M. Hennecke, eds.), pp. 211–219, Berlin, Germany: Springer-

Verlag, 1996.

[19] M. Chan, “HMM-based audio-visual speech recognition integrating geomet-

ric and appearance-based visual features,” in Proceedings of the Interna-

tional Workshop on Multimedia Signal Processing, (Cannes, France), pp. 9–

14, 2001.

[20] M. Chan, Y. Zhang, and T. S. Huang, “Real-time lip tracking and bimodal

continuous speech recognition,” in Proceedings of the International Work-

shop on Multimedia Signal Processing, (Los Angeles, CA, USA), pp. 65–70,

1998.

[21] D. Chandramohan and P. Silsbee, “A multiple deformable template approach

for visual speech recognition,” in Proceedings of the International Conference

on Spoken Language Processing, (Philadelphia, PA, USA), pp. 50–53, 1996.

[22] T. Chen, “Audiovisual speech processing,” IEEE Signal Processing Maga-

zine, pp. 9–31, 2001.

[23] T. Chen, H. Graf, and K. Wang, “Lip synchronization using speech-assisted

video processing,” IEEE Signal Processing Letters, vol. 2, no. 4, pp. 57–59,

1995.

[24] C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal

170 Bibliography

recognition”,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–37,

2002.

[25] C. Chibelushi, S. Gandon, J. Mason, F. Deravi, and D. Johnston, “Design

issues for a digital integrated audio-visual database,” in IEE Colloquium on

Integrated Audio-Visual Processing for Recognition, Synthesis and Commu-

nication, (London, UK), pp. 7/1–7/7, 1996.

[26] CHIL: Computers in the Human Interaction Loop.

http://chil.server.de

[27] G. Chiou and J. Hwang, “Lipreading from color video,” IEEE Transactions

on Image Processing, vol. 6, pp. 1192–1195, August 1991.

[28] S. Chu and T. Huang, “Bimodal speech recognition using couple hidden

Markov models,” in Proceedings of the International Conference on Spoken

Language Processing, (Beijing, China), pp. 747–750, 2000.

[29] S. Chu and T. Huang, “Audio-visual speech modeling using coupled hidden

Markov models,” in Proceedings of International Conference on Acoustics,

Speech and Signal Processing, (Orlando, Fl, USA), pp. 2009–2012, 2002.

[30] M. Cohen and D. Massaro, “What can visual speech synthesis tell visual

speech recognition?,” in Proceedigns of Asilomar Conference on Signals, Sys-

tems, and Computers, (Pacific Grove, CA, USA), 1994.

[31] A. Colmenarez and T. Huang, “Face detection with information-based

maximum discrimination,” in Proceedings of the International Conference

on Computer Vision and Pattern Recognition, (San Juan, Puerto Rico),

pp. 782–787, 1997.

[32] J. Connell, N. Haas, E. Marcheret, C. Neti, G. Potamianos, and S. Veli-

pasalar, “A real-time prototype for small-vocabulary audio-visual ASR,” in

Proceedings of the International Conference on Multimedia Expo, vol. 2, (Bal-

timore, MD, USA), pp. 469–472, 2003.

Bibliography 171

[33] T. Cootes, G. Edwards, and C. Taylor, “Active appearance models,” in Pro-

ceedings of the European Conference on Computer Vision, vol. 2, (Freiburg,

Germany), pp. 484–498, 1998.

[34] T. Cootes, A. Hill, C. Taylor, and J. Haslam, “Use of active shape models

for locating structures in medical images,” Image and Vision Computing,

vol. 12, pp. 355–365, July/August 1994.

[35] E. Cosatto, G. Potamianos, and H. Graf, “Audio-visual unit selection for the

synthesis of photo-realistic talking-heads,” in Proceedigns of International

Conference on Multimedia and Expo, (New York, NY, USA), pp. 1097–1100,

2000.

[36] P. Cosi and E. Caldognetto, “Lips and jaw movements for vowels and con-

sonants: Spatio-temporal characteristics and bimodal recognition applica-

tions,” in Speechreading by Humans and Machines (D. Stork and M. Hen-

necke, eds.), pp. 291–313, Berlin, Germany: Springer-Verlag, 1996.

[37] S. Cox, I. Matthews, and J. A. Bangham, “Combining noise compensation

with visual information in speech recognition,” in Proceedings of the Work-

shop on Audio-Visual Speech Processing, (Rhodes, Greece), 1997.

[38] D. Cristinacce, T. Cootes, and I. Scott, “A multi-stage approach to facial

feature detection,” in Proceedings of the British Machine Vision Conference,

(London, England), pp. 277–286, 2004.

[39] P. D. Cuetos, C. Neti, and A. Senior, “Audio-visual intent to speak detection

for human computer interaction,” in Proceedings of International Conference

on Acoustics, Speech, and Signal Processing, (Istanbul, Turkey), pp. 1325–

1328, 2000.

[40] L. Czap, “Lip representation by image ellipse,” in International Conference

on Spoken Language Processing, vol. 4, (Beijing, China), pp. 93–96, 2000.

[41] D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Fused HMM-adaptation of

multi-stream HMMs for audio-visual speech recognition,” in Proceedings of

Interspeech (accepted), (Antwerp, Belgium), 2007.

172 Bibliography

[42] D. Dean, T. Wark, and S. Sridharan, “An examination of audio-visual fused

HMMs for speaker recognition,” in Proceedigns of Second Workshop on Mul-

timodal User Authentication, (Toulouse, France), 2006.

[43] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete

data via the EM algorithm,” Royal Statistical Society, vol. 39, pp. 1–38, 1977.

[44] B. Dodd, “The acquisition of lip-reading skills by normally hearing children,”

in Hearing by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell,

eds.), pp. 163–175, London, England: Lawerence Erlbaum Associates Ltd,

1987.

[45] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating

automatic speech recognition and lip reading,” in Proceedings of the Interna-

tional Conference on Spoken Language and Processing, (Yokohama, Japan),

pp. 547–550, 1994.

[46] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley

and Sons, Inc., 2nd ed., 2001.

[47] S. Dupont and J. Luettin, “Audio-visual speech modeling for continu-

ous speech recognition,” IEEE Transactions on Multimedia, vol. 2, no. 3,

pp. 141–151, 2000.

[48] K. Finn, An investigation of visible lip information to be used in automated

speech recognition. PhD thesis, Georgetown University, Washington DC,

USA, 1986.

[49] Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line

learning and an application to boosting,” in Computational Learning Theory:

Eurocolt ’95, pp. 23–37, Springer-Verlag, 1995.

[50] H. Frowein, G. Smoorenburg, L. Pyters, and D. Schnikel, “Improved speech

recognition through video telephony: Experiments with the hard of hearing,”

IEEE Journal of Selected Areas in Communications, vol. 9, pp. 611–616, May

1991.

Bibliography 173

[51] K. Fukunaga, Introduction to statistical pattern recognition. Academic Press

Inc., 2nd ed., 1990.

[52] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I. McCowan, “Multimodal

multispeaker probabilistic tracking in meetings,” in Proceedings of the In-

ternational Conference on Multimodal Interfaces, 2005.

[53] L. Girin, A. Allard, and J. Schwartz, “Speech signals separation: A new

approach exploiting the coherence of audio and visual speech,” in Proceed-

ings on the Workshop on Multimedia Signal Processing, (Cannes, France),

pp. 631–636, 2001.

[54] L. Girin, G. Feng, and J. Schwartz, “Noisy speech enhancement with fil-

ters estimated from the speaker’s lips,” in European Conference on Speech

Communication and Technology, (Madrid, Spain), pp. 1559–1562, 1995.

[55] L. Girin, J. Schwartz, and G. Feng, “Audio-visual enhancement of speech

in noise,” Journal of the Acoustical Society of America, vol. 109, no. 6,

pp. 3007–3020, 2001.

[56] R. Goecke, A stereo vision lip tracking algorithm and subsequent statistical

analysis of the audio-video correlation in Australian English. PhD thesis,

Australian National University.

[57] R. Goecke and J. Millar, “A detailed description of the AVOZES data cor-

pus,” in Proceedings of the 10th Australian International Conference on

Speech Science and Technology, (Sydney, Australia), pp. 486–491, 2004.

[58] R. Goecke, G. Potamianos, and C. Neti, “Noisy audio feature enhancement

using audio-visual speech data,” in International Conference on Acoustics,

Speech and Signal Processing, (Orlando, FL, USA), pp. 2025–2028, 2002.

[59] A. J. Goldschen, O. N. Garcia, and E. Petajan, “Continuous optical au-

tomatic speech recognition by lipreading,” in Proceedings of the Asilomar

Conference on Signals, Systems and Computers, (Pacific Grove, CA, USA),

pp. 572–577, 1994.

174 Bibliography

[60] A. Goldschen, O. Garcia, and E. Petajan, “Rationale for phoneme-viseme

mapping and feature selection in visual speech recognition,” in Speechreading

by Humans and Machines (D. Stork and M. Hennecke, eds.), pp. 505–515,

Berlin, Germany: Springer, 1996.

[61] R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for

classification,” in Proceedings of the International Conference on Acoustics,

Speech and Signal Processing, (Seattle, WA, USA), pp. 661–664, 1998.

[62] J. Gowdy, S. Amarnag, C. Bartels, and J. Bilmes, “DBN based multi-stream

models for audio-visual speech recognition,” in Proceedings of the Interna-

tional Conference on Acoustics, Speech and Signal Processing, (Montreal,

Canada), pp. 993–996, 2004.

[63] G. Gravier, G. Potamianos, and C. Neti, “Asynchrony modeling for audio-

visual speech recognition,” in Proceedings of the Human Language Technol-

ogy Conference, (San Diego, CA, USA), pp. 1–6, 2002.

[64] M. Gray, J. Movellan, and T. Sejnowski, “A comparison of local versus global

image decompositions for visual speechreading,” in Fourth Joint Symposium

on Neural Computation, pp. 92–98, 1997.

[65] M. Gray, J. Movellan, and T. Sejnowski, “Dynamic features for visual speech-

reading: A systematic comparision,” in Advances in Neural Information

Processing (M. Mozer, M. Jordan, and T. Petsche, eds.), pp. 751–757, Cam-

bridge, MA: MIT Press, 1997.

[66] K. Green, “The use of auditory and visual information in phonetic percep-

tion,” in Speechreading by Humans and Machines (D. Stork and M. Hen-

necke, eds.), pp. 55–77, Berlin, Germany: Springer-Verlag, 1996.

[67] R. Gross, I. Matthews, and S. Baker, “Appearance-based face recognition

and light-fields,” IEEE Transactions on Pattern Analysis and Machine In-

telligence, vol. 26, pp. 449–465, April 2004.

Bibliography 175

[68] S. Gurbuz, Z. Tufekci, E. Patterson, and J. Gowdy, “Application of affine-

invariant fourier descriptors to lipreading for audio-visual speech recogni-

tion,” (Salt Lake City, UT, USA), pp. 177–180, 2001.

[69] M. Heckmann, F. Berthommier, and K. Kroschel, “A hybrid ANN/HMM

audio-visual speech recognition system,” in Proceedings of the International

Conference on Auditory-Visual Speech Processing, (Aalborg, Denmark),

pp. 190–195, 2001.

[70] M. Heckmann, F. Berthommier, and K. Kroschel, “Optimal weighting of

posteriors for audio-visual speech recognition,” in Proceedings of the Inter-

national Conference on Acoustics, Speech, and Signal Processing, vol. 1, (Salt

Lake City, UT, USA), pp. 161–164, 2001.

[71] M. Heckmann, F. Berthommier, and K. Kroschel, “Noise adaptive stream

weighting in audio-visual speech recognition,” EURASIP Journal on Applied

Signal Processing, vol. 2002, no. 11, pp. 1260–1273, 2002.

[72] M. Heckmann, K. Kroschel, and C. Savariaux, “DCT-based video features for

audio-visual speech recognition,” in Proceedings of International Conference

on Spoken Language and Processing, (Denver, CO, USA), pp. 1925–1928,

2002.

[73] M. Hennecke, D. Stork, and K. Prasad, “Visionary speech: Looking ahead to

practical speechreading systems,” in Speechreading by humans and machines

(D. Stork and M. Hennecke, eds.), pp. 331–349, Berlin, Germany: Springer-

Verlag, 1996.

[74] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans-

actions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.

[75] F. Huang and T. Chen, “Consideration of lombard effect for speechread-

ing,” in Proceedings of Workshop on Multimedia Signal Processing, (Cannes,

France), pp. 613–618, 2001.

[76] J. Huang, Z. Liu, Y. Wang, and E. Wong, “Integration of multimodal

features for video scene classification based on HMM,” in Proceedings of

176 Bibliography

the Workshop on Multimedia Signal Processing, (Copenhagen, Denmark),

pp. 53–58, 1999.

[77] J. Huang, G. Potamianos, J. Connell, and C. Neti, “Audio-visual speech

recognition using an infrared headset,” Speech Communication, vol. 44, no. 4,

pp. 83–96, 2004.

[78] A. Hyvarinen and E. Oja, “Independent component analysis: Algorithms

and applications,” Neural Networks, vol. 13, no. 4-5, pp. 411–430, 2000.

[79] A. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: A review,”

IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 15,

no. 1, pp. 4–37, 2000.

[80] O. Jesorsky, K. Kirchberg, and R. Frischholz, “Robust face detection using

the Hausdorff distance,” in Proceedings of the International Conference on

Audio and Video Biometric Person Authentication, (Halmstad, Sweden),

pp. 90–95, June 2001.

[81] J. Jiang, G. Potamianos, H. Nock, G. Iyengar, and C. Neti, “Improved face

and feature finding for audio-visual speech recognition in visually challenging

environments,” in Proceedings of the International Conference on Acoustics,

Speech and Signal Processing, vol. 5, (Montreal, Canada), pp. 873–876, 2004.

[82] M. Jones and P. Viola, “Fast multi-view face detection,” Tech. Rep. TR2003-

96, MERL, June 2003.

[83] T. Jordan and P. Sergeant, “Effects of facial image size on visual and audio-

visual speech,” in Hearing by Eye II (R. Campbell, B. Dodd, and D. Burn-

ham, eds.), pp. 155–176, Hove: Psychology Press Ltd. Publishers, 1998.

[84] T. R. Jordan and S. M. Thomas, “Effects of horizontal viewing angle on

visual and audiovisual speech recognition,” in Journal of Experimental Psy-

chology: Human Perception and Performance, vol. 27, pp. 1386–1403, 2001.

[85] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker

verification,” in Proceedings of the International Conference on Audio- and

Video-Based Biometric Person Authentication, pp. 319–326, 1997.

Bibliography 177

[86] J. Junqua, “The lombard reflex and its role on human listeners and auto-

matic speech recognizer,” Journal of Acoutic Society of America, vol. 93,

pp. 510–524, 1993.

[87] T. Kanade and A. Yamada, “Multi-subregion based probabilistic approach

towards pose-invariant face recognition,” vol. 2, (Kobe, Japan), pp. 954–959,

2003.

[88] M. Kaynak, Q. Zhi, A. Cheok, K. Sengupta, Z. Jian, and K. Chung, “Lip geo-

metric features for human-computer interaction using bimodal speech recog-

nition: Comparison and analysis,” Speech Communication, vol. 43, no. 1-2,

pp. 1–16, 2004.

[89] E. Kreyszig, Advanced Engineering Mathematics. John Wiley and Sons, Inc,

7 ed., 1993.

[90] G. Krone, B. Talle, A. Wichert, and G. Palm, “Neural architectures for sen-

sor fusion in speech recognition,” in European Tutorial Workshop on Audio-

Visual Speech Processing, (Rhodes, Greece), pp. 57–60, 1997.

[91] K. Kumar, T. Chen, and R. Stern, “Profile view lip reading,” in Proceedings

of the International Conference on Acoustics, Speech and Signal Processing,

vol. 4, (Honolulu, Hawaii), pp. 429–432, 2007.

[92] F. Lavagetto, “Converting speech into lip movements: a multimedia tele-

phone for hard of hearing people,” IEEE Transactions on Rehabilitation

Engineering, vol. 3, no. 1, pp. 90–102, 1995.

[93] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu,

and T. Huang, “AVICAR: An audiovisual speech corpus in a car environ-

ment,” in Proceedings of the International Conference on Spoken Language

Processing, (Jeju Island, Korea), pp. 2489–2492, 2004.

[94] R. Leinhart and J. Maydt, “An extended set of Haar-like features,” in Pro-

ceedings of the International Conference on Image Processing, (Rochester,

NY, USA), pp. 900–903, 2002.

178 Bibliography

[95] M. Lew, “Information theoretic view-based and modular face detection,” in

Proceedings of the International Conference on Automatic Face and Gesture

Recogntion, (Killington, VT, USA), pp. 198–203, 1996.

[96] S. Li, J. Sherrah, and H. Liddell, “Multi-view face detection using support

vector machines and eigenspace modelling,” in Proceedings of the Interna-

tional Conference on Knowledge-Based Intelligent Engineering Systems and

Allied Technologies, (Brighton, UK), pp. 241–244, 2000.

[97] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum, “Statistical learn-

ing of multi-view face detection,” in Proceedings of the European Conference

on Computer Vision, (Copenhagen, Denmark), pp. 38–44, May 2002.

[98] L. Liang, X. Liu, Y. Zhao, X. Pi, and A. Nefian, “Speaker independent audio-

visual continuous speech recognition,” in Proceedings of the International

Conference on Multimedia and Expo, vol. 2, (Lausanne, Switzerland), pp. 25–

28, August 2002.

[99] M. Lievin and F. Luthon, “Unsupervised lip segmentation under natural

conditions,” in Proceedings of the International Conference on Acoustics,

Speech and Signal Processing, (Phoenix, AZ, USA), pp. 3065–3068, March

1999.

[100] F. Liu, R. Stern, X. Huang, and A. Acero, “Efficient cepstral normalization

for robust speech recognition,” in Proceedings of the Workshop on Human

Language Technology, (Morristown, NJ, USA), pp. 69–74, 1993.

[101] S. Lucey, Audio-Visual Speech Processing. Phd thesis, Queensland Univer-

sity of Technology, Brisbane, Australia, 2002.

[102] S. Lucey, “An evaluation of visual speech features for the tasks of speech

and speaker recognition,” in Proceedings of the International Conference of

Audio- and Video-Based Person Authentication, (Guildford, U.K.), pp. 260–

267, 2003.

[103] S. Lucey and T. Chen, “Learning patch dependencies for improved pose

mismatched face verification,” in Proceedings of the International Conference

Bibliography 179

on Computer Vision and Pattern Recognition, vol. 1, (New York, NY, USA),

pp. 909–915, June 2006.

[104] P. Lucey, D. Dean, and S. Sridharan, “Problems associated with current

area-based visual speech feature extraction techniques,” in Proceedings of the

International Conference on Auditory-Visual Speech Processing, (Vancouver

Island, Canada), pp. 73–78, 2005.

[105] P. Lucey and G. Potamianos, “Lipreading using profile versus frontal

views,” in Proceedings of the IEEE International Workshop on Multimedia

Signal Processing, (Victoria, BC, Canada), pp. 24–28, 2006.

[106] P. Lucey and S. Sridharan, “Patch-based representation of visual speech,”

in HCSNet Workshop on the Use of Vision in Human-Computer Interaction,

(VisHCI 2006) (R. Goecke, A. Robles-Kelly, and T. Caelli, eds.), vol. 56 of

CRPIT, (Canberra, Australia), pp. 79–85, ACS, 2006.

[107] J. Luettin, G. Potamianos, and C. Neti, “Asynchronous stream modeling

for large vocabulary audio-visual speech recognition,” in Proceedings of the

International Conference on Acoustics, Speech, and Signal Processing, vol. 1,

(Salt Lake City, UT, USA), pp. 169–172, 2001.

[108] J. Luettin, N. Thacker, and S. Beet, “Speaker identification by lipreading,”

in Proceeding of the International Conference on Spoken Language Process-

ing, vol. 1, (Philadelphia, PA, USA), pp. 62–65, 1996.

[109] J. Luettin, N. A. Thacker, and S. W. Beet, “Speechreading using shape

and intensity information,” in Proceedings of the International Conference

on Spoken Language Processing, vol. 1, (Philadelphia, PA, USA), pp. 58–61,

1996.

[110] R. J. Mammone, X. Zhang, and R. P. Ramachandran, “Robust speaker

recognition: A feature based approach,” IEEE Signal Processing Magazine,

vol. 13, pp. 58–70, September 1996.

[111] A. M. Martinez, “Recognizing imprecisely localized, partially occluded, and

expression variant faces from a single sample per class,” IEEE Transactions

180 Bibliography

on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 748–763,

2002.

[112] K. Mase and A. Pentland, “Automatic lipreading by optical-flow analysis,”

Systems and Computers in Japan, vol. 22, no. 6, pp. 67–76, 1991.

[113] I. Matthews, J. Bangham, and S. Cox, “Audio-visual speech recognition us-

ing multiscale nonlinear image decomposition,” in International Conference

on Spoken Language Processing, (Philadelphia, PA, USA), pp. 38–41, 1996.

[114] I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey, “Extraction

of visual features for lipreading,” IEEE Transactions on Pattern Analysis

and Machine Intelligence, vol. 24, no. 2, pp. 198–213, 2002.

[115] I. Matthews, T. Cootes, S. Cox, R. Harvey, and J. A. Bangham, “Lipreading

using shape, shading and scale,” in Proceedings of the International Confer-

ence on Auditory-Visual Speech Processing, (Sydney, Australia), pp. 73–78,

1998.

[116] I. Matthews, G. Potamianos, C. Neti, and J. Luettin, “A comparison

of model and transform-based visual features for audio-visual LVCSR,” in

Proceedings of International Conference on Multimedia and Expo, (Tokyo,

Japan), 2001.

[117] M. McGrath and Q. Summerfield, “Intermodal timing relations and audio-

visual speech recognition,” Journal of the Acoustical Society of America,

vol. 77, pp. 678–685, February 1985.

[118] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature,

pp. 746–748, December 1976.

[119] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB:

The extended M2VTS database,” in International Conference on Audio and

Video-based Biometric Person Authentication, (Washington D.C., USA),

1999.

Bibliography 181

[120] A. Mills, “The development of phonology in blind children,” in Hearing

by Eye: The Psychology of Lipreading (B. Dodd and R. Campbell, eds.),

pp. 145–161, London, England: Lawerence Erlbaum Associates Ltd, 1987.

[121] B. Moghaddam and A. Pentland, “Probabilistic Visual Learning for Ob-

ject Representation,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 19, pp. 696–710, July 1997.

[122] J. R. Movellan and G. Chadderdon, “Channel separability in the audio

visual integration of speech: A bayesian approach,” in Speechreading by Hu-

mans and Machines (D. G. Stork and M. E. Hennecke, eds.), pp. 473–487,

Berlin: Springer, 1996.

[123] S. Nakamura, “Fusion of audio-visual information for integrated speech pro-

cessing,” in Audio and Video-based Biometric Person Authentication (J. Bi-

gun and F. Smearaldi, eds.), pp. 127–143, Berlin, Germany: Springer-Verlag,

2001.

[124] A. Nefian and M. Hayes, “Face detection and recognition using hidden

Markov models,” in Proceedings of the International Conference on Image

Processing, (Chicago, IL, USA), pp. 141–145, 1998.

[125] A. Nefian, L. Liang, X. Pi, X. Liu, and C. Mao, “A coupled HMM for audio-

visual speech recognition,” in Proceedings of the International Conference

on Acoustics, Speech and Signal Processing, (Orlando, FL, USA), pp. 2013–

2016, 2002.

[126] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri,

“Large-vocabulary audio-visual speech recognition: A summary of the Johns

Hopkins summer 2000 workshop,” in Proceedings of the Workshop on Multi-

media Signal Processing, Special Section on Joint Audio-Visual Processing,

(Cannes, France), 2001.

[127] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri,

182 Bibliography

J. Sison, A. Mashari, and J. Zhou, “Audio-Visual Speech Recognition, Fi-

nal Workshop 2000 Report,” tech. rep., Center for Language and Speech

Processing, The John Hopkins University, Baltimore, 2000.

[128] Open Source Computer Vision Library.

http://www.intel.com/research/mrl /research/opencv

[129] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object

detection,” in Proceedings of International Conference on Computer Vision,

(Bombay, India), 1998.

[130] E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “CUAVE: A new audio-

visual database for multimodal human-computer interface research,” in Pro-

ceedings of the International Conference on Acoustics, Speech and Signal

Processing, (Orlando, FL, USA), 2002.

[131] A. Pentland, “Smart rooms, smart clothes,” in Proceedings of the Inter-

national Conference on Pattern Recognition, vol. 2, (Brisbane, Australia),

pp. 949–953, 1998.

[132] E. Petajan, “Automatic lipreading to enhance speech recognition,” in IEEE

Global Telecommunications Conference, (Atlanta, GA, USA), pp. 265–272,

IEEE, 1984.

[133] S. Pigeon and L. Vandendorpe, “The M2VTS multimodal face database,”

in Proceedings of the International Conference on Audio and Video-based

Biometric Person Authentication, (Crans-Montara, Switzerland), 1997.

[134] G. Potamianos, E. Cosatto, H. Graf, and D. Roe, “Speaker independent au-

diovisual database for bimodal ASR,” in Proceedings of the European Tuto-

rial Workshop on Audiovisual Speech Processing, (Rhodes, Greece), pp. 65–

68, 1997.

[135] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream

exponents for audio-visual speech recognition,” in Proceedings of the Inter-

national Conference on Acoustics, Speech and Signal Processing, (Seattle,

WA, USA), pp. 3733–3736, 1998.

Bibliography 183

[136] G. Potamianos and H. Graf, “Linear discriminant analysis for speechread-

ing,” in Proceedings of the Workshop on Multimedia and Signal Processing,

(Los Angeles, CA, USA), pp. 221–226, 1998.

[137] G. Potamianos, H. Graf, and E. Cosatto, “An image transform approach

for HMM based automatic lipreading,” in Proceedings of International Con-

ference on Image Processing, vol. 3, (Chicago, IL, USA), pp. 173–177, 1998.

[138] G. Potamianos and P. Lucey, “Audio-visual ASR from multiple views inside

smart rooms,” in Proceedings of the International Conference on Multisen-

sor Fusion and Integration for Intelligent Systems, (Heidelberg, Germany),

pp. 35–40, 2006.

[139] G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features

for audio-visual LVCSR,” in Proceedings of the International Conference on

Acoustics, Speech and Signal Processing, pp. 165–168, 2001.

[140] G. Potamianos and C. Neti, “Improved ROI and within frame discriminant

features for lipreading,” in Proceedings of International Conference on Image

Processing, vol. 3, (Thessaloniki, Greece), pp. 250–253, 2001.

[141] G. Potamianos and C. Neti, “Audio-visual speech recognition in challenging

environments,” in Proceedings of the European Conference on Speech Com-

munication and Technology, (Geneva, Swizterland), pp. 1293–1296, 2003.

[142] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, “Recent

advances in the automatic recognition of audio-visual speech,” Proceedings

of the IEEE, vol. 91, no. 9, 2003.

[143] G. Potamianos, C. Neti, G. Iyengar, A. Senior, and A. Verma, “A cascade

visual front end for speaker independent automatic speechreading,” Inter-

national Journal of Speech Technology, vol. 4, no. 3-4, pp. 193–208, 2001.

[144] G. Potamianos and P. Scanlon, “Exploiting lower face symmetry in

appearance-based automatic speechreading,” in Proceedings of the Auditory-

Visual Speech Processing International Conference 2005, (British Columbia,

Canada), pp. 79–84, 2005.

184 Bibliography

[145] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade

image transform for speaker independent automatic speechreading,” in Pro-

ceedings of the International Conference on Multimedia and Expo, vol. 2,

(New York, NY, USA), pp. 1097–1100, 2000.

[146] L. R. Rabiner, “A tutorial on hidden Markov models and selected applica-

tions in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257–286,

February 1989.

[147] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Englewood

Cliffs, N.J.: Prentice Hall, 1993.

[148] M. Ramos Sanchez, J. Matas, and J. Kittler, “Statistical chromaticity mod-

els for lip tracking with B-splines,” in Proceedings of the International Con-

ference on Audio and Video based Biometric Person Authentication, (Crans-

Montara, Switzerland), pp. 69–76, 1997.

[149] R. Rao and R. Mersereau, “Lip modelling for visual speech recognition,” in

Proceedings of the Asilomar Conference on Signals, Systems and Computers,

vol. 1, (Pacific Grove, CA, USA), pp. 587–590, 1994.

[150] J. Robert-Ribes, J. Schwartz, T. Lallouache, and P. Escudier, “Comple-

mentarity and synergy in bimodal speech: Auditory, visual, and audio-visual

identification of french oral vowels in noise,” Journal of the Acoustical Society

of America, vol. 103, no. 6, pp. 3677–3689, 1998.

[151] L. Rosenblum and H. Saldaa, “An audiovisual test of kinematics primitives

for visual speech perception,” Journal of Experimental Psychology: Human

Perception and Performance, vol. 22, no. 2, pp. 318–331, 1996.

[152] L. Rosenblum and H. Saldana, “Time-varying information for visual speech

perception,” in Hearing by Eye II (R. Campbell, B. Dodd, and D. Burnham,

eds.), pp. 61–81, Hove, United Kingdom: Psychology Press Ltd. Publishers,

1998.

[153] L. Rothkrantz, J. Wojdel, and P. Wiggers, “Comparison between different

feature extraction techniques in lipreading applications,” in Proceedings of

Bibliography 185

the International Conference Speech and Computer, (St. Petersburg, Russia),

2006.

[154] H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detec-

tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 20, pp. 23–38, 1998.

[155] H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural-network

based face detection,” in Proceedings of the International Conference on

Computer Vision and Pattern Recognition, (Santa Barbara, CA, USA),

pp. 38–44, 1998.

[156] K. Saenko, “Articulatory features for robust visual speech recognition,”

Masters Thesis, Massachusetts Institue of Technology, MA, USA, 2004.

[157] K. Saenko, T. Darrel, and J. Glass, “Articulatory features for robust vi-

sual speech recognition,” in Proceedings of the International Conference on

Mulitmodal Interfaces, (State College, PA, USA), pp. 152–158, 2004.

[158] K. Saenko and K. Livescu, “An asynchronous DBN for audio-visual speech

recogntion,” in Proceedings of the Workshop on Spoken Language Technol-

ogy, (Palm Beach, Aruba), pp. 92–98, 2006.

[159] C. Sanderson, “The VidTIMIT database,” in IDIAP Communication,

(Martigny, Switzerland), 2002.

[160] C. Sanderson, Automatic person verfication using speech and face informa-

tion. PhD thesis, Griffths University, Brisbane, Australia, 2004.

[161] P. Scanlon and R. Reilly, “Feature analysis for automatic speechreading,” in

Proceedings of the Workshop on Multimedia and Signal Processing, (Cannes,

France), pp. 625–630, 2001.

[162] R. Schapire, Y. Freund, P. Bartlett, and W. Lee, “Boosting the margin: A

new explanation for the effectiveness of voitng methods,” in Proceedings of

the International Conference on Machine Learning, (Nashville, TN, USA),

pp. 322–330, 1997.

186 Bibliography

[163] H. Schneiderman and T. Kanade, “A histogram-based method for detec-

tion of faces and cars,” in Proceedings of the International Conference on

Computer Vision and Pattern Recognition, (Hilton Head Island, SC, USA),

pp. 504–507, 2000.

[164] P. Silsbee and A. Bovik, “Computer lipreading for improved accuracy in au-

tomatic speech recognition,” IEEE Transactions on Speech and Audio Pro-

cessing, pp. 337–351, 1996.

[165] Q. Su and P. Silsbee, “Robust audiovisual integration using semicontinuous

hidden Markov models,” in International Conference on Spoken Language

Processing, (Philadelphia, PA, USA), 1996.

[166] W. Sumby and I. Pollack, “Visual contribution to speech intelligibility,”

Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215,

1954.

[167] A. Summerfield, “Some preliminaries to a comprehensive account of audio-

visual speech perception,” in Hearing by Eye: The Psychology of Lip-Reading

(B. Dodd and R. Campbell, eds.), pp. 3–51, London, United Kingdom: Law-

erence Erlbaum Associates, 1987.

[168] A. Summerfield, “Lipreading and audio-visual speech perception,” Philo-

sophical Transactions of the Royal Society of London, Series B, pp. 71–78,

1992.

[169] A. Summerfield, A. MacLeod, M. McGrath, and M. Brooke, “Lips, teeth,

and the benefits of lipreading,” in Handbook of Research on Face Processing

(A. Young and H. Ellis, eds.), pp. 223–233, Amsterdam, The Netherlands:

Elsevier Science Publishers, 1989.

[170] K. Sung and T. Poggio, “Example-based learning for view-based human

face detection,” IEEE Transactions on Pattern Analysis and Machine Intel-

ligence, vol. 20, pp. 39–51, 1998.

Bibliography 187

[171] S. Tamura, K. Iwano, and S. Furui, “Multi-modal speech recognition us-

ing optical-flow analysis for lip images,” Journal of VLSI Signal Processing

Systems, vol. 36, no. 2-3, pp. 117–124, 2004.

[172] P. Teissier, J. Robert-Ribes, J. Schwartz, and A. Gurin-Dugu, “Compar-

ing models for audiovisual fusion in a noisy-vowel recognition task,” Speech

Communication, vol. 7, no. 6, pp. 629–642, 1999.

[173] A. Teklap, Digital Video Processing. Prentice-Hall, 1995.

[174] Y. Tian, T. Kanade, and J. Cohn, “Robust lip tracking by combining shape

color and motion,” in Proceedings of the Asian Conference on Computer

Vision, (Taipei, Taiwan), pp. 1040–1045, 2000.

[175] M. Tomlinson, M. Russell, and N. Brooke, “Integrating audio and visual

information to provide highly robust speech recognition,” in Proceedings of

the International Conference on Acoustics, Speech and Signal Processing,

vol. 2, (Atlanta, GA, USA), pp. 821–824, 1996.

[176] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive

Neuroscience, vol. 3, no. 1, 1991.

[177] A. Varga and R. Moore, “Hidden Markov model decomposition of speech

and noise,” in Proceedings of the International Conference on Acoustics,

Speech and Signal Processing, vol. 2, (Albuquerque, NM, USA), pp. 845–

848, 1990.

[178] E. Vatikiotis-Bateson, G. Bailly, and P. Perrier, eds., Audio-Visual Speech

Processing. MIT Press, 2006.

[179] E. Vatikiotis-Bateson, K. Munhall, M. Hirayama, Y. Lee, and D. Terzopou-

los, “The dynamics of audiovisual behaviour in speech,” in Speechreading

by Humans and Machines (D. Stork and M. Hennecke, eds.), pp. 221–232,

Berlin, Germany: Springer-Verlag, 1996.

[180] P. Viola and M. Jones, “Rapid object detection using a boosted cascade

188 Bibliography

of simple features,” in Proceedings of the International Conference on Com-

puter Vision and Pattern Recognition, vol. 1, (Kauai, HI, USA), pp. 511–518,

2001.

[181] C. Wang and M. Brandstein, “Multi-source face tracking with audio and vi-

sual data,” in Proceedings of the Workshop on Multimedia Signal Processing,

(Copenhagen, Denmark), pp. 475–481, 1999.

[182] T. Wark, Multi-modal Speech Processing for Automatic Speaker Recogni-

tion. PhD Thesis, Queensland University of Technology, Brisbane, Australia,

2001.

[183] Wikipedia, “Kitt — Wikipedia, the free encyclopedia,” 2007.

http://en.wikipedia.org/wiki/KITT

[Online; accessed 02-September-2007]

[184] M. Yang, N. Abuja, and D. Kriegman, “Mixtures of linear subspaces for

face detection,” in Proceedings of the International Conference on Automatic

Face and Gesture Recognition, (Grenoble, France), pp. 70–76, 2000.

[185] G. Yang and T. Huang, “Human face detection in complex background,”

Pattern Recognition, vol. 27, no. 1, pp. 53–63, 1994.

[186] M. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: A

survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 24, no. 1, pp. 34–58, 2002.

[187] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio visual speech

recognition using lip movement extracted from side-face images,” in Pro-

ceedings of the Workshop on Auditory Visual Speech Processing, (St Jorioz,

France), pp. 117–120, 2003.

[188] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, “Audio visual speech

recognition using new lip features extracted from side-face images,” in Pro-

ceedings of the Workshop on Robustness Issues in Conversational Interac-

tion, (Norwich, England), 2004.

Bibliography 189

[189] S. Young, G. Everman, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ol-

lason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book (for HTK

Version 3.2.1). Entropic Ltd, 2002.

[190] A. Yuille, P. Hallinan, and D. Cohen, “Feature extraction from faces using

deformable templates,” International Journal of Computer Vision, vol. 8,

no. 2, pp. 99–111, 1992.

[191] X. Zhang, C. Broun, R. Mersereau, and M. Clements, “Automatic

speechreading with applications to human-computer interfaces,” EURASIP

Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1228–1247, 2002.

[192] Z. Zhang, G. Potamianos, S. Chu, J. Tu, and T. Huang, “Person tracking

in smart rooms using dynamic programming and adaptive subspace learn-

ing,” in Proceedings of the International Conference on Multimedia and Expo,

(Toronto, Canada), pp. 2061–2064, 2006.

190 Bibliography

Appendix A

Dynamic Parameter Analysis

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

Number of features used per feature vector (P)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)



10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)Comparison of features with varying temporal information (J), with input vector of size N=10


(a) (b)

Figure A.1: Plots of the lipreading results for the dynamic and final features onthe MRDCT (a) and MRDiff (b) features using various values for J and P usingN = 10 input features.

As mentioned in Chapter 5.4.2, many different permutations of input features

to the inter-frame LDA step were used to determine the optimal lipreading results.

The obtain the best lipreading performance from the final dynamic feature vector,

their has to be a trade-off between the length of the input static feature vector,

N , and the number of adjacent frames J used. This fine balance is required as

calculating the transformation matrix, WIILDA, is quite computationally expensive

and there is a limit on how large the input matrix XI can be (approximately

< 6M element matrix). In Figures A.1(a) and (b), only N = 10 static input

features were used across J = 4 to 7 adjacent frames. From these figures it can

be seen that the performance for the MRDCT static features hovers just below

191

192 Appendix A. Dynamic Parameter Analysis

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)


MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3MRDCT/LDA2 J=4MRDCT/LDA2 J=5

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)


MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3MRDiff/LDA2 J=4MRDiff/LDA2 J=5

(a) (b)


10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)



10 20 30 40 50 60 70

28

30

32

34

36

38

40

42


Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)



(a) (b)


the 30% WER, whilst the MRDiff static features are just above the 30% WER

mark. Compared to the best case accuracy of 27.66% using N = 30 and J = 2,

it can be seen that these parameters do not give the optimal performance. Also,

it is also worth noting that there is no discernable distinction of performance

between using parameters J = 4 to 7.

In Figures A.2(a) and (b), N = 20 input static features are used across J = 1

to 5 adjacent frames. In Figure A.2(a), it is visible that when the temporal win-

dow J is increased from 1 to 2 the lipreading performance improves significantly

(by an average of 5%). When the value of J was increased past 2, no real improve-

ment was gained. In Figure A.2(b), as some temporal information was already

193

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

Number of features per feature vector (P)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)


MRDCT/LDA2 J=1MRDCT/LDA2 J=2MRDCT/LDA2 J=3

10 20 30 40 50 60 70

28

30

32

34

36

38

40

42

Number of features per feature vector (P)

Lipr

eadi

ng p

erfo

rman

ce, W

ER

(%

)


MRDiff/LDA2 J=1MRDiff/LDA2 J=2MRDiff/LDA2 J=3

(a) (b)


included in the difference features, no real benefit was sought from increasing

the amount of temporal information included in the final dynamic feature vector.

Across both plots, the best lipreading performance gained was around 28.5%.

Even though an improvement was gained in increasing the amount of input

static features from N = 10 to 20, the best lipreading performance was sought

using N = 30 input MRDCT features with a WER of 27.66% for P = 40. This

can be seen in Figure A.3(a), with J = 2. The number of static features was

increased to N = 40 in Figures A.4(a) and (b), however the performance using

these parameters were not quite as good as those for N = 30. As a result of these

experiments, the optimal parameters to be used for this thesis were therefore

chosen to be M = 100, N = 30 and P = 40, using a temporal window of

J = 2. These parameters were used for all experiments in this thesis, unless

stated otherwise.

LIPREADING ACROSS MULTIPLE VIEWS...1.1 Block diagram of an AVASR system, which is a combination of...

Documents

Transcript of LIPREADING ACROSS MULTIPLE VIEWS...1.1 Block diagram of an AVASR system, which is a combination of...