Speech Reading with Deep Neural...

UPTEC IT 18 012

Examensarbete 30 hpJuli 2018

Speech Reading with Deep Neural Networks

Linnar BillmanJohan Hullberg

Institutionen för informationsteknologiDepartment of Information Technology

1

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Speech Reading with Deep Neural Networks

Linnar Billman and Johan Hullberg

Recent growth in computational power and available data has increased popularity and progress of machine learning techniques. Methods of machine learning are used for automatic speech recognition in order to allow humans to transfer information to computers simply by speech. In the present work, we are interested in doing this for general contexts as e.g. speakers talking on TV or newsreaders recorded in a studio. Automatic speech recognition systems are often solely based on acoustic data. By introducing visual data such as lip movements, robustness of such system can be increased.

This thesis instead investigates how well machine learning techniques can learn the art of lip reading as a sole source for automatic speech recognition. The key idea is to use a sequence of 24 lip coordinates to feed to the system, rather than learning directly from the raw video frames.This thesis designs a solution around this principle empowered by state-of-the-art machine learning techniques such as recurrent neural networks, making use of GPUs. We find that this design reduces computational requirements by more than a factor of 25 compared to a state-of-art machine learning solution called LipNet. This however also scales down performance to an accuracy of 80% of what LipNet achieves, while still outperforming human recognition by a factor of 150%. The accuracies are based on processing of yet unseen speakers.

This text presents this architecture. It details its design, reports its results, and compares its performance to an existing solution. Basedon this, it is indicated how the result can be further refined.

Tryckt av: Reprocentralen ITCUPTEC IT 18 012Examinator: Lars-Åke NordénÄmnesgranskare: Kristiaan PelckmansHandledare: Mikael Axelsson

2

Popularvetenskaplig Sammanfattning

Den stora utvecklingen av berakningskraft tillsammans med det okade flodet av tillgangligdata har lett till ett okat intresse och utveckling av maskininlarning. Framforallt den del avmaskininlarning som kallas deep learning.

Deep learning har lett till forbattringar av manga applikationer och omraden. En av dessaomraden ar automatisk taligenkanning: tekniken att trana en dator att kanna igen talkommandon.Taligenkanning anvands till manga applikationer sasom personliga assistenter, rostkontrolleradesystem i fordon, utbildningsapplikationer och manga fler.

Taligenkanningssystem anvander sig oftast endast av akustisk data, i.e. ljud. Ett sadantsystem ar valdigt beroende av kvaliten pa ljudet da otillracklig kvalite kan leda till att urskiljningav ord blir svart. En losning till detta problem ar att introducera visuell data sasom videor avlapprorelser av talaren tillsammans med ljudet. Genom att introducera visuell data har systemeten chans att urskilja ord aven nar ljudet fallerar.

Denna rapport menar att trana ett lapplasningssystem med hjalp av deep learning och sedanjamfora detta med det nuvarande ledande systemet inom lapplasning: LipNet.

3

Acknowledgements

First of all we would like to thank our supervisor Mikael Axelsson for supporting us through thisproject and providing helpful ideas and a positive atmosphere.

We would like to thank Consid AB for providing us with a wonderful office to work at withpleasant colleges and great coffee.

Thank you Kristiaan Pelckmans, our reviewer at Uppsala University, for taking your time toprovide us with helpful feedback throughout the project.

A special thanks to the team behind LipNet for providing us with a bible in the form of yourreport and source code which we could refer to in doubt.

4

“ Never half-ass two things.Whole-ass one thing. ”

Ron Swanson, Parks and Recreation, 2012

5

Contents

1 Introduction 121.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Related Work 152.1 LipNet: End-to-End Sentence-level Lipreading . . . . . . . . . . . . . . . . . 162.2 Various Works Related to Lip Reading . . . . . . . . . . . . . . . . . . . . . . 16

3 Theory 173.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 203.1.5 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.6 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 223.1.7 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 243.1.8 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.9 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1 Lip Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . 323.2.3 Moving Average Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 LipNet 364.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.5 Prediction and Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6

5 Methods 405.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.1 Mouth Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.2 Moving Average Filtering . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.5 Deep Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5.2 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5.3 Model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5.4 Model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5.5 Model 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.6 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.7 Results on Real-World Appliance . . . . . . . . . . . . . . . . . . . . . . . . . 505.8 Memory Usage and Training Time . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Discussion 526.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Adam parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.4 Smoothing, Noise and Normalization . . . . . . . . . . . . . . . . . . . . . . . 546.5 Facial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.6 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Conclusion 567.1 Reading Lips with Facial Landmarks . . . . . . . . . . . . . . . . . . . . . . . 577.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Bibliography 59

7

List of Figures

3.1 The structure of an Artificial Neuron . . . . . . . . . . . . . . . . . . . . . . . 213.2 The structure of a shallow Artificial Neural Network . . . . . . . . . . . . . . 213.3 The structure of a Recurrent Neural Network . . . . . . . . . . . . . . . . . . 223.4 Gates of a Long-Short Term Memory unit . . . . . . . . . . . . . . . . . . . . 233.5 Convolutional Neural Network applied on a RGB image . . . . . . . . . . . . 253.6 A max-pooling layer retrieves the maximum value in a region and reduces the

dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.7 Classification of a feature vector . . . . . . . . . . . . . . . . . . . . . . . . . 273.8 Probabilistic Neural Network for classification . . . . . . . . . . . . . . . . . . 293.9 Salient Point Detection detecting a triangle . . . . . . . . . . . . . . . . . . . . 303.10 Moving average filter. Original signal (blue) and filtered signal(green). . . . . . 34

4.1 Graph representation of the architecture of LipNet . . . . . . . . . . . . . . . . 38

5.1 Pictures of two speakers from the GRID corpus . . . . . . . . . . . . . . . . . 415.2 The 68 facial landmarks identified with dlib’s facial recognition . . . . . . . . 425.3 Vector representation of extracted mouth coordinates from two frames . . . . . 425.4 Plot of a sequence of one (y) coordinate over 75 frames. One line is the original

coordinate (blue) and the other is a smoothed version (green). . . . . . . . . . . 435.5 KNN WER for different K-values . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Graph representation of the architecture of model 1 . . . . . . . . . . . . . . . 455.7 Graph representation of the architecture of model 2 . . . . . . . . . . . . . . . 465.8 Graph representation of the architecture of model 3 . . . . . . . . . . . . . . . 475.9 Graph representation of the architecture of model 4 . . . . . . . . . . . . . . . 485.10 Graph representation of the architecture of model 5 . . . . . . . . . . . . . . . 495.11 Comparison on real world appliances between Model 5 and LipNet . . . . . . . 51

8

List of Tables

5.1 Results for logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 WER of model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 WER of model 2 (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 WER of model 2 (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5 WER of model 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6 WER of model 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.7 WER of model 5 (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.8 WER of model 5 (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9

Acronyms

AI Artificial Intelligence. 13

AN Artificial Neuron. 20, 21

ANN Artificial Neural Network. 13, 21

ASR automatic speech recognition. 13, 22, 33, 53, 58

AVR Audio-visual recognition. 16

BiGRU bidirectional GRU. 37, 48, 49, 53, 55

BiLSTM bidirectional LSTM. 45–48, 53

BiRNN bidirectional RNN. 24, 45, 53

BN Biological Neuron. 20

CER Character Error Rate. 34, 49, 54

CNN Convolutional Neural Network. 16, 25, 37, 48, 53, 55, 57

CTC Connectionist Temporal Classification. 16, 33, 37, 39, 45–48, 50

DL Deep Learning. 13, 14, 17, 22

DNN Deep Neural Network. 45

GD Gradient Descent. 19

GRU Gated Recurrent Unit. 24, 48, 53

HMM Hidden Markov Model. 33

KNN K-Nearest Neighbors. 30, 43, 55

LM Language Model. 32, 50

10

LR Logistic Regression. 18, 44, 55

LSTM Long Short-Term Memory. 16, 23, 24, 45, 46, 48, 53

MA Moving Average. 34, 54

ML Machine Learning. 13, 17, 18, 22, 55

NLP Natural Language Processing. 45

NN Neural Network. 16, 18, 22, 23

PNN Probabilistic Neural Network. 27

PR Pattern Recognition. 20

RGB Red-Green-Blue. 25, 26

RNN Recurrent Neural Network. 16, 22–24, 33, 45, 46, 48, 53, 57

SL Supervised Learning. 18, 19

SPD Salient Point Detector. 30

SR Speech Recognition. 16, 17, 32

WER Word Error Rate. 16, 33, 34, 37, 43–50, 53, 55

11

Chapter 1

Introduction

In this chapter the background and motivation behind this thesis is presented. It briefly mentionssome history behind machine learning. It also explains the purpose and goal of this thesis.

12

Introduction 13

1.1 Background

With the recent popularity with Machine Learning (ML), it is easy to believe it to be a brand newconcept. This is however very far from the truth. Ever since the dawn of computers the idea of amachine being able to emulate human thinking and to learn has existed. Alan Turing, famouslyknown as one of the biggest contributers to computer science and Artificial Intelligence (AI) [1],created a test named the Turing Test [2]. This test was supposed to determine if a machine couldbe seen as intelligent by displaying behavior indistinguishable to that of a human.

In the 1950s Arthur Samuel created the first computer learning program which was designedto play checkers [3]. Around the same time, Frank Rosenblatt invented the perceptron [4],giving grounds to the Artificial Neural Network (ANN). 1967 the nearest neighbor algorithm waspresented [5], allowing computers to use basic pattern recognition. In 1997, IBM’s computerDeep Blue was able to beat the world champion of Chess, Garri Kasparov [6][7]. In 2011, IBM’sWatson was able to defeat two champions in the game show Jeopardy! [8].

One of the reasons behind the recent spurt in popularity and advancement in ML is theincrease of available computational power along with the ever increasing amount of availabledata. This advancement has resulted in more advanced problems being solved with ML than everbefore. One of the biggest advancements is the growing use of Deep Learning (DL).

DL is a subset of ML that uses computational models that are built up with numerousprocessing layers, allowing the learning of data with several levels of abstraction. DL hasimproved the state-of-the-art in automatic speech recognition (ASR), object recognition, objectdetection along with many other fields drastically [9][10].

The improvement in ASR has led to many applications e.g. personal assistants, such asApple’s Siri [11] or Microsoft’s Cortana [12], system controls in vehicles [13], assistance forpeople with disabilities [14], and many more. Even though ASR has come a long way, it still hasmany challenges left to overcome [15].

1.2 Purpose

ASR using only acoustic data depends heavily on the quality of the sound. If the data is pollutedwith noise or other disturbances an ASR system will have a much more difficult task than if thedata was in perfect condition. As ASR systems are becoming increasingly popular with mobiledevices and home entertainment systems, the demand increases for ASR to be robust to real-worldnoise and acoustic disturbances [16].

One technique for making ASR more robust is using lip reading when possible [17]. Bytraining a model to not only look at the acoustic data of the speaker but also the visual features,the visual data could help the model predict the correct answers even when the acoustic data isless than ideal [18]. An extreme case of acoustic disturbance is where there is no usable acousticdata at all. This would require the ASR system to go by visual data alone, leading to a morecomplex problem: pure lip reading.

The current state-of-the-art lip reading system is LipNet [19]. LipNet was trained on an

Introduction 14

Nvidia DGX-1 [20]: absolute state-of-the-art hardware for AI and DL, not accessible to everyone.The heavy training of LipNet is not applicable or at least not practical on all hardware and mighttherefore be inaccessible to many.

This thesis aims to explore the possibilities of reproducing the accuracies shown by LipNetwith a model able to adequately train using the limited hardware available for this project: oneNvidia GTX 1060 GPU with 6GB memory. This will be attempted by extracting visual features,i.e. facial landmarks, from the videos before feeding said features to the model. This thesis alsoaims to evaluate the differences in accuracy and computational requirements between the createdmodel and LipNet.

1.2.1 Problem Statement

• Can facial landmarks be used as efficient data for lip reading?

• What are the most important visual features when performing lip reading with DL?

• Does it require less computation when using facial landmarks compared to using images?

• Is it possible to train a model with DL on facial landmarks using an Nvidia GTX 1060GPU or even more limited hardware?

• Is it possible to replicate the results of LipNet with this model?

1.3 Delimitations

In order to have a consistent dataset of reasonable size that is available for training and testing themodels, the first delimitation will be to contain only english. The dataset will itself have a limitedvocabulary consisting of certain words in a specific pattern, explained further in section 5.1.

The approach to use facial landmarks implicitly sets a delimitation of the project, unlessadressed. Using each frame of a video as input to the model also includes the entire region aroundthe speaker’s mouth, which may have an impact on the performance of the lip reading ability.This project will simply use the 24 coordinates gathered by an existing library that correspond todifferent parts of the speaker’s lips. This project will not include any other algorithm that tracksfacial features.

Chapter 2

Related Work

This chapter presents previous solutions and research related to the subject of this thesis. Itdescribes several methods for solving problems related to lip reading with Artificial NeuralNetworks, with a focus on LipNet as mentioned in the previous chapter.

15

Related Work 16

2.1 LipNet: End-to-End Sentence-level Lipreading

Y.Assael et al. [19] introduces LipNet, a Neural Network (NN) architecture that maps sequencesof video frames to text, on a sentence-level prediction instead of wordwise. LipNet uses spa-tiotemporal convolutions, a Recurrent Neural Network (RNN) and the Connectionist TemporalClassification (CTC) loss, trained entirely end-to-end with variable-length videos. On the GRIDcorpus [21] it achieves 4.8% Word Error Rate (WER) using an overlapped speaker split, classi-fying full sentences instead of words, phonemes or visemes. The overlapped speaker split usesvideos from all speakers to train but withholds a few videos from each speaker for validationpurposes.

LipNet uses a couple of libraries to extract a small section from each frame containing a cen-tered image of the speaker’s mouth, and sends it as input to the model. As part of the evaluationof LipNet, three hearing-impaired members of Oxford Students’ Disability Community wereintroduced to the GRID corpus and shown 300 random videos to measure their ability to readthe lips of those speakers. They were on average able to achieve a WER of 47.7% on videos ofunseen speakers.

LipNet will be used as a comparative tool throughout the project and is described further inchapter 4.

2.2 Various Works Related to Lip Reading

Audio-visual recognition (AVR) is a solution to Speech Recognition (SR) problem where theaudio is corrupted. The goal of AVR is to use the information from one modality and use itto complement information in the other. The problem however is to find the correspondencebetween the audio and visual streams. A.Torfi et al. [22] uses a coupled 3D Convolutional NeuralNetwork (CNN) to find the correlation between the visual information and audio information.

H.Akbari et al. [23] uses a combination of CNNs, Long Short-Term Memory (LSTM)networks, and fully connected layers to reconstruct the original auditory spectrogram with a 98%correlation from silent lip movement videos.

N.Rathee [24] proposes an algorithm consisting of feature extraction and classification forword recognition. The word prediction is done by a Learning Vector Quantization neural network.The algorithm is applied for recognizing ten Hindi words and achieves an accuracy of 97%.

A.Garg et al. [25] proposes various solutions based on CNNs and LSTMs for lip reading.The best performing model, using the concatenated sequence of all frames of the speakers face,achieves a validation accuracy of 76%.

Gregory J. Wolff et al. [26] proposes visual preprocessing algorithms for extracting relevantfeatures from the frames of a grayscale video to use as input to a lipreading system. They alsopropose a hybrid speechreading system with two time delayed NNs, one for image and one foracoustics, integrating their responses by independent opinion pooling. The hybrid system hasa 25% lower error rate than the acoustic system alone, indicating that lipreading can improveSR.

Chapter 3

Theory

This chapter presents theory about the cornerstones of this project. It will mention ML, DL andthe underlying network layers in the models, object detection and recognition, some linguisticsand the ability to read lips, as well as SR.

17

Theory 18

3.1 Machine Learning

Broadly defined, ML is a collection of computational methods or algorithms making accuratepredictions based on collected data [27]. The learning algorithm improves by using its experienceof previous data to provide more accurate predictions on new data by finding patterns. Theaccuracy and success rate of a learning algorithm depends greatly on the data used when trainingit. The sample complexity and size must be sufficiently large to allow the algorithm to analyzethe data and find these patterns.

3.1.1 Supervised Learning

Supervised Learning (SL) tries to infer a function from labeled data [28]. A SL algorithm usesthe labeled training data to try to determine a function that maps the input to the desired output.This function is then used to determine the output of new examples. More formally, given aset of N training examples {(x1, y1), · · · , (xN , yN )}, where xi is the feature vector of the ith

input object in the dataset and yi is its corresponding output target, the learning algorithm tries todetermine a function g : X → Y , where X is the input space and Y is the output space.

There are several steps that must be taken to solve a supervised learning problem:

1. The first part is the decision of training examples. If training a speech recognition system,for example, these could be single letters, words or whole sentences.

2. Collecting a dataset. Either build one or find an existing one relevant to the task at hand.This dataset should consist of input objects and corresponding outputs.

3. Choosing the input feature representation. The learning success of the algorithm candepend highly on how the input object is represented.

4. Choosing a learning algorithm, e.g. support vector machines, decision trees or NN

5. Running the training algorithm on the dataset.

6. Evaluating the learned function.

3.1.2 Logistic Regression

Logistic Regression (LR) [29] is a statistical method for analyzing a dataset. It is used to classifybinary problems, i.e. problems with only two output classes. The goal of LR is to find the bestcoefficients which describes the relationship between the input and the output classes.

In the case of more than two output classes, multinomial logistic regression can be used.Another solution is to divide the problem into smaller binary problems where a predictionis made for each output class and then compared to the rest. This is called One-vs-Rest orOne-vs-All.

Theory 19

3.1.3 Optimization

Gradient Descent

Gradient Descent (GD) [30] is an optimization algorithm widely used in SL. It is used to findthe parameters of a function that minimizes a cost function. For a neuron, it calculates its errorapproximation to the target. The most commonly used error is the sum of squared errors

ε =

PT∑p=1

(tp − op)2

where tp is the target output and ot is the actual output for the p-th pattern and PT is the totalnumber of patterns in the training set.The goal is to minimize ε. To do this, the gradient of ε is calculated in weight space, then theweight values are moved along the negative gradient. Given a training pattern, the weights areupdated with

vi(t) = vi(t− 1) + ∆vi(t)

with

∆vi(t) = η(− ∂ε∂vi

)

where

∂ε

∂vi= −2(tp − op)

∂f

∂netpzi,p

and η > 0 is the learning rate, or the size of the steps taken when changing the weights. ∂netp isthe net input for pattern p and zi,p is the ith input signal corresponding to pattern p.

Adam

Adam [31] is an optimization algorithm that can be used instead of the classical GD algorithm.It is a gradient-based optimization algorithm that computes adaptive learning rates for eachparameter. It updates exponential moving averages of the gradient (mt) and the squared gradient(vt). The exponential decay rates of these are controlled by the hyper-parameters β1, β2. Thefollowing are the configuration parameters of Adam.

• α is the learning rate.

• β1 is the exponential decay rate of the first moment estimation. 0.9 is a recommendedvalue of this parameter.

Theory 20

• β2 is the exponential decay rate of the second moment estimation. 0.999 is a recommendedvalue of this parameter.

• ε is a small number to prevent division by zero. Recommended value is 10−8.

Pseudo code for the Adam optimization algorithm, requiring the above parameters:

m0 ← 0 (Initialize first moment vector)v0 ← 0 (Initialize second moment vector)t← 0 (Initialize timestep)while θt not converged do

t← t+ 1gt ← ∇θft(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)mt ← β1 ·mt−1 + (1− β1) · gt t (Update biased first moment estimate)vt ← β2 · vt−1 + (1− β2) · g2t (Update biased second raw moment estimate)mt ← mt/(1− βt1) (Compute bias-corrected first moment estimate)vt ← vt/(1− βt2) (Compute bias-corrected second raw moment estimate)θt ← θt−1 − α · mt(

√vt + ε) (Update parameters)

end whilereturn θt t (Resulting parameters)

In his paper, [32], S.Ruder compares several gradient descent optimization algorithms where hisconclusion recommends Adam as the best overall choice.

3.1.4 Artificial Neural Networks

The human brain is an extraordinarily complex computer. It has the incredible ability to memorizeand learn, to complete complex tasks such as Pattern Recognition (PR) much faster than anycomputer. The brain is built up of large networks of simple Biological Neuron (BN)s. Signalspropagate through these large networks where neurons are connected via synapses. If the inputsignal to a neuron surpasses a certain threshold, the neuron transmits a signal to its connectedneurons [33].

An Artificial Neuron (AN) is modeled on a BN. It receives input from other connected ANswhere each input signal is weighted with a numerical weight associated with each connection.The excited signal from the AN is controlled by a function, called the activation function. Whenthe AN receives inputs, it uses the sum of the weighted signals as an input to the activationfunction which calculates the output of the AN as seen in the figure below [30].

Theory 21

x2 w2 Σ f

Activationfunction

y

Output

x1 w1

x3 w3

Weights

Inputs

Figure 3.1: The structure of an Artificial Neuron

An ANN consists of layered networks of ANs. The first layer is called the input layer, the last iscalled the output layer and all layers in between are called hidden layers.

Inputlayer

Hiddenlayer

Outputlayer

Input 1

Input 2

Input 3

Input 4

Output

Figure 3.2: The structure of a shallow Artificial Neural Network

ANNs are used in many different types of applications today such as speech recognition, imageprocessing, pattern recognition and classification. These are however only a small part of theapplications using ANNs.

3.1.5 Deep Learning

The performance of machine learning algorithms depend highly on the representation of the datagiven, i.e. what features are included. Manually choosing which features in the data are importantand which ones are not can be quite difficult. What a human considers an important feature might

Theory 22

not be considered an important feature by a computer. Allowing the computer to not only mapthe feature representation to the output, but also to map the data to the feature representation,often results in better performance than with manually designed feature representations [10].

DL is a subfield of ML which utilizes deeper network architectures to enable the computerto build complex concepts out of simpler ones [34]. This enables the network to, by itself, findlower level representations of higher level features, allowing it to represent functions with highercomplexity [10].

DL algorithms has led to many state-of-the-art results in several areas, among them isASR[35][36][37].

3.1.6 Recurrent Neural Networks

The idea behind RNN is derived from the human ability to understand by sequence. Manytasks, such as understanding speech or remembering the alphabet, are based on sequence [38].Traditional feed-forward NNs do not have the ability to remember sequences.

To address this issue RNNs introduces loops. The output of the network in one time step isfed back as input in the next time step [39]. Given a sequence x = (x1, x2, · · · , xT ), the hiddenstate ht is updated by

ht =

{0, t = 0

φ(ht−1, xt), otherwise,

where φ is a nonlinear function such as a composition of a logistic sigmoid with an affinetransformation [40]. This results in an internal memory, allowing the network to processsequences. The update of the hidden state ht is implemented as

ht = g(Wxt + Uht−1),

where g is a smooth bounded function such as a logistic sigmoid, W and U being weights.

ht

A

xt

=

h1

A

x1

h2

A

x2

h3

A

x3

ht

A

xt. . .

Figure 3.3: The structure of a Recurrent Neural Network

Even with the ability to remember sequences, RNNs suffer from the problem of long-termdependencies [41]. If two sets of information are separated with a too large time step gap, theRNN will have difficulty connecting them.

Theory 23

Long Short-Term Memory

LSTM is a variant of the RNN architecture which improves the long-term dependency [42].While a RNN only has a simple structure of one NN, simply passing the output of one time stepas an input to the next, a LSTM has a more complex structure of four NNs, called gates. Thesegates determine what information should be kept in the system and what information should beremoved, preventing the decay of important information.

ht−1

A

xt−1

ht

xt

σ σ tanh σ

x +

x x

tanh

ht+1

A

xt+1

Figure 3.4: Gates of a Long-Short Term Memory unit

Walking through figure 3.4, starting with the line running along the top of the unit, this is calledthe cell state [39]. The cell state is a flow of information passed from one time step to the next.The information in the cell state can be altered by the gates. The leftmost gate on the bottom rowdecides what information should be removed from the cell state, called the forget gate layer (f ).It looks at the information passed as input xt as well as the output of the previous time step ht−1.It then outputs a number between 0 and 1 where 1 means the information should be completelykept in the cell state while a 0 means it should be completely removed.

ft = σ(Wfxt + Ufht−1 + bf ).

W ∈ Rh×d, U ∈ Rh×hand b ∈ Rh

The forget gate is represented by f, and the formula contains three different weights where d andh refer to number of input features and hidden units respectively.

The next gate decides what information should be added in the cell state. This is divided intotwo parts. First the input gate layer (i) which decides what information should be updated, then atahn layer creates a vector of candidate values that could be added. These are then combined toupdate the cell state.

it = σ(Wf · [ht−1, xt] + bi)

Ct = tahn(WC · [ht−1, xt] + bC).

When adding the removal of information ft · Ct−1 and the addition of information it · Ct to theold cell state Ct−1 we get the new cell state.

Ct = ft · Ct−1 + it · Ct.

Theory 24

Lastly a filtered version of the cell state is multiplied with the final sigmoid layer to produce theoutput (o).

ot = σ(Wo[ht−1, xt] + bo)

ht = ot · tahn(Ct).

Gated Recurrent Unit

The Gated Recurrent Unit (GRU) is another variant of the RNN architecture. Similar to theLSTM, the GRU has gate units controlling the information flow in the unit.

The update gate (zt) decides how much the unit updates its content, computed as

zt = σ(Wzxt + Uzht−1).

The forget gate (rt) allows the unit to forget the previously computed state, given as

rt = σ(Wrxt + Urht−1).

The candidate activation function (ht) is given as

ht = tahn(Wxt + U(rt � ht−1)),

where � is an element wise multiplication. Finally, the activation (ht) of the GRU is computedby

ht = (1− zt)ht−1 + ztht.

Bidirectional RNN

It can often be useful to analyze both the future and past at a given point in a sequence. RNNs arehowever designed to analyze a sequence in one direction [43]. A solution to this is the bidirec-tional RNN (BiRNN), where the input is presented forwards and backwards to two separate RNNsthat are connected to the same output layer. Thus, given a sequence of inputs (x1, · · · , xT ), oneRNN maps (x1, · · · , xT ) → (

−→h1, · · · ,

−→hT ) and the other maps (xT , · · · , x1) → (

←−h1, · · · ,

←−hT ),

then ht = [−→ht ,←−ht ]. BiRNNs have shown improved results in sequence learning compared to

RNNs, notably in speech processing [44][45].

3.1.7 Convolutional Neural Network

The main issue of linear classifier is that accuracy decreases if the classes are not separable by ahyperplane. This may require a transformation of the data into a new space where the classesare linearly separable. The transformation becomes more difficult as the dimensionality of theinput vector increases. Fully connected feed-forward networks are commonly used to learn to

Theory 25

classify the data, however the number of neurons could become very large when applied to highdimensionality data, such as images. A CNN reduces the number of parameters for the featuretransformation function allowing networks to be deeper with fewer parameters [46].

To reduce the amount of parameters the neurons are rearranged into blocks, allowing them toshare the same weights within the block, called weight sharing. A pixel in an image is highlycorrelated with the close neighbors, and neurons with the same coordinates in each block can beused to extract information from a region of pixels in the image. By convolving each filter weighton the input image, the result is a series of images representing the output of the convolutionallayer. With an image with the dimensions W × H and a convolutional layer with P filterswith size M × N , the output of the convolutional layer is P feature maps with dimensionsW −M + 1×H −N + 1. An activation function is then applied on each of these feature mapsseparately in an element-wise fashion as shown in figure 3.5. [46].

Using convolutional filters in regard to image processing works well due to the ability tokeep the dimensions of the filters the same regardless if the image is gray scale, RGB or anyother format. The filters are separately applied on each color channel. Convolving a filter witha multichannel input results in one single channel as output. The filters are three-dimensionalarrays in image processing and the size of the dimensions are the width, height and the amount ofchannels of the image. Adding another convolutional layer and using the first layer as input, thethird dimension would be the same size as the amount of images produced by the first layer. Theoutput of a convolutional layer is called feature maps where each map is the result of convolvinga filter with the input of the layer [46].

1

B

G

R

L1 F1

65432G

1

2

3

4

5

6

Figure 3.5: Convolutional Neural Network applied on a RGB image

In 3.5 a convolutional layer (L1) is applied to the Red-Green-Blue (RGB) channels, resultingin feature maps (F1). An activation function (G) on the feature maps results in an output of6 multichannel images. The dimensionality of the feature maps can get quite large when theamount of input channels and other dimensions grow, and may need downsampling (pooling) tobe reduced to a more reasonable amount. The stride (s) of the pooling decides which elementsin a vector to skip. However, important information could get discarded this way. Thereforethe size of the local neighborhood (d) is introduced. This is used in max pooling, where themaximum value in the region is used. Pooling feature maps in convolutional layers divides theimage into a d× d region every s pixels row-wise and column-wise. It is applied to each featuremap separately, meaning the amount of channels remain the same but the other dimensions arereduced with a factor equal to the stride.

Theory 26

32x32

111

140

97

123 16x16

max = 140

Figure 3.6: A max-pooling layer retrieves the maximum value in a region and reduces thedimensionality

3.1.8 Object Detection

Detection and recognition in images are closely related, however there is a distinction betweenthe two. The goal for object detection is to determine if a given type of object is present in animage. For instance the detection algorithm could conclude that there is a car in the image, whilethe recognition algorithm could tell us which brand. [47].

A single image contains multiple colors, often depicted in the RGB color model, where eachcolor can be represented with a dimension. Local structures in the images can be detected withstructural tensors describing the objects. Putting multiple images or frames together into a videoadds another dimension to the tensor. Structures found in images are encoded with their colorwhere the orientation of areas are represented by different colors. Strong signal variations, suchas edges, affect the saturation value. This encoding gives colorful edges with high saturations,creating high contrasts with the dark, weak structures. This proves useful in order to detectobjects in images [47].

Classification

A discriminant function can tell if a component of the feature vector corresponds to the certainclassification the discriminant function is searching for. The higher the count of componentsmatching this function, the more likely it is that the algorithm will choose that class. It looks atthe input feature vector and sends the components to the discriminant functions, it then choosesthe class depending on which discriminant function has the most matches [47].

Theory 27

Feature vector Discriminant functionsMaximized score

xL

x4

x3

x2

x1

gc

g3

g2

g1

MAX

Figure 3.7: Classification of a feature vector

Probabilistic Neural Networks

Probabilistic Neural Network (PNN) follows a combination of two classification methods: Bayesmaximum a posteriori classification scheme and Parzen method for nonparametric density esti-mation.

ωc′ = argmax1≤c≤C

{P (x | ωc)P (ωc)}

Bayes maximum a posteriori classification finds the class with the highest probability, given theinput x. By getting the maximum argument of the discrimination functions, the function returnsthe class with maximal response.

ωc′ = argmax1≤c≤C

{gc(x)}

The Parzen method:

p(x) =1

N

N∑i=1

1

hLiK(

x− xihi

)

Where K is a function with highly localized response, called a kernel-function, that takes thedifference of two vectors (x and xi) that lie within a hypercube of radius hi. N is the total numberof points, and hLi is the volume of the hypercube in L-dimensional space. The formula is slightly

Theory 28

altered to allow the neurons to compute a kernel function of the reference patterns and the presentinput into

Wcn(x) = K(x− xcnhc

), for 1 ≤ c ≤ C, and 1 ≤ l ≤ Nc

The neurons Wcn, belonging to only one class, computes the kernel function with input X anda n-th prototype from a class xcn, divided by the kernel width for that class. The number ofavailable class prototypes Nc defines the width.The output from each neuron is fed from the pattern layer to the summation layer, leading to thediscrimination functions with a scaling parameter αc:

gc(x) = αc

Nc∑n=1

Wcn(x)

Input pattern vectors of dimension L are fed to the input layer, where each vector can belong toone of the determined classes. The pattern layer contains the number of weights that store thecomponents of the reference patterns. The resulting architecture looks similar to figure 3.8 andcan be seen below [47].

Theory 29

Input Patterns layer Summation layer Output Class

xL

x2

x1

W11

W12

W1N1

W21

W22

W2N2

WC1

WC2

WCNC

∑1

∑2

∑C

MAX

g1(x)

g2(x)

gC(x)

Figure 3.8: Probabilistic Neural Network for classification

The input features are sent to each pattern’s weights, and summarized in discrimination functionsto find the most probable class for the input given the classification scheme.

Detection

To classify objects one must be able to characterize the traits, e.g. color, shape or texture. Directpixel classification can often distinguish between objects and the background of the image. Oneway to classify images is by the color of objects, as colors can reveal a lot of information aboutthe contents of an environment. The object classification requires a set or range of colors that isdescriptive of the certain object type, traffic signs for example have certain specific colors thatcan be distinguished from the background environment [47].

Clustering is a way to improve the accuracy of classifying more complicated data sets, wherethe input data is divided into partitions with similar properties that are sufficiently separated fromother partitions. This heavily relies on similarities within the data set, with a representative set of

Theory 30

features. The K-Nearest Neighbors (KNN) algorithm classifies data points into clusters, and newdata is assigned a class given the closest points around it. The numerical parameter K looks at theK closest points and classifies the new point to the same category as the majority of its neighbors.

A fundamental low-level task in computer vision is the ability to detect basic shapes such aslines, circles and ellipses described by a certain mathematical model. Structural tensors are usefulto detect basic shapes. For each point the tensor is computed and provides information if saidpoint belongs to an edge, what the local phase is and what type of local structure it belongs to.The remaining parameter of the tensor is the distance between the coordinate system’s origin andthe tensor. By analyzing the local phase and the coherence of a tensor the computation becomesquick and accurate [47].

Regular shapes can easily be detected as they have characteristic (salient) points, such ascorners or edges. A Salient Point Detector (SPD) has some predefined rules about how thesecommon shapes are constructed, e.g. a triangle has three corners with lines between them. Theneighborhood of each pixel is divided into a number of regions, and each region is analyzedto see if it contains certain selected features. The regions can be compared with each other, orwith a predefined model of the shape. By segmenting an image into a binary space and lookingat the distribution of selected features in each of the regions, the image can be compared topredefined models for these shapes. If a match is found then the pixel can be classified as asalient point of that given type. SPD is therefore efficient in detecting triangles, rectangles anddiamonds [47].

P

Figure 3.9: Salient Point Detection detecting a triangle

Shape deformations and noise may occur in images for which the SPD adapts and returns anumber of points that fulfill a predefined rule instead of a single location. By processing thesegroups of points and determining their center of gravity, the cluster can be replaced with a singlelocation that corresponds to the center of gravity of the cluster [47].

The SPD technique can only be used to find basic shapes with a few characteristic points asmore complex shapes may need more features to be defined. A technique called the adaptivewindow growing method aims to address this issue by finding dense clusters in the image that aredetected based on other characteristic features of the object, i.e. color or texture. The methodlooks at a dense cluster, that represents a high probability of the object residing in that part of theimage. A rectangular window is inserted at the location of the cluster and expands in all directionsuntil the borders of the object is detected or until the stop criteria, the expansion threshold, forsaid direction is reached. If the window grows one pixel each step the neighbor-connected shapescan be detected, if the steps are larger a more sparse version is obtained [47].

Theory 31

In order to verify if separated salient points belong to the same shape or figure, the detectionalgorithm must subsequently check all possible configurations of the points such as size, or if theshape is occluded or rotated. This gives flexibility to have a simple formulation of the shape witha dynamic processing of the rules of the shape [47].

3.1.9 Object Recognition

An object can be viewed from angles with different scales, rotations and lighting conditions.Trying to figure out which object it is can be challenging for many other reasons as well.Prototype templates may provide a certain test pattern for the algorithm, but the dimensionalityof the template grows with the different possibilities of how an object can match with the testpattern. It is challenging to tell if a pattern is present in an image and how reliable the result is.Methods mentioned in the previous subsection are some of the many methods used to addressthese issues [47].

One of the main problems in object recognition is how the objects change in the observedscene, where the template may not always match the object very well. Instead one can findcharacteristic features of the object, that are as much invariant to object transformations aspossible to adjust for this. Features such as geometric properties remain the same, e.g. the lengthof a line segment remains the same regardless of rotation of the object, are useful for this [47].

Distance transformations of images and templates can help detect features by creating abinary image of the template and the image. A distance map is constructed from the binary imagewhile the template is processed into several maps, where each is a combination of horizontal andvertical shifts as well as rotations and scale changes. Each of the template’s maps is comparedto the distance map of the image to find features that are close to the template. The smaller thedistance between the image’s distance map and the template, the more similar it is. The methodmeasuring the distance can provide different results, as some methods have greater resistance tomissing features due to occlusions for example [47].

Combining multiple classifiers can improve the accuracy and is often used in facial detection,where a cascade of weak classifiers can be configured to work as serial operators on the images.The training procedure of this ensemble of classifiers should be organized in a way to increasediversity of the classifiers given a training set, to react unanimously when observing known databut making them as different as possible when making errors [47].

3.2 Speech Recognition

3.2.1 Lip Reading

Lip reading, also called speech-reading, is a technique to interpret the movements of a person’slips, mouth and face to understand speech. Facial speech gestures can aid in understandingwhat a person is saying when there is a lot of noise in the environment. However even skilledspeech-readers are seldom able to perfectly interpret sentences, and it is even more difficult to

Theory 32

understand unrelated sentences where the performance rarely exceeds 10-30% accuracy [48].The ability to accurately read lips relies heavily on the psychological and cognitive processes

to comprehend language, for the message to have a clear context and even the use of guesses [49].The ability to decode information and the processing speed are two general factors that affect theperformance of speech reading, and plain guessing is important in situations when the contextualsupport is low. [48].

A study by Bernstein et al. [50] with formal and informal communication, the most proficienttest subjects were able to accurately get approximately 80% words correct in the sentences [49].

Speech gestures, movements of the face, mouth and lips, together with body languageis primarily used as visual cues for reading lips, together with complementary help of thecommunicative context. Studies have shown that the message content is more informative forspeech-readers than facial expressions and body language, as a hypothesis that emotional expres-sion could improve speech-readers ability to understand the content has been disproved [48].

A phoneme is a unit of sound that can distinguish words in a particular language, the charac-ters in phonetic text. Phonemes that are stressed in speech to articulate more properly are easierto discriminate in noise, but much harder when it comes to speech-reading.

The ability to identify phonemes correctly, depending on phonetic context, have been shownto be below 50%, as many of the phonemes are hard to distinguish by only using sight and aretherefore easy to confuse [49].

3.2.2 Automatic Speech Recognition

When solving a SR problem the algorithm must first be able to decode the message, usually byconverting it into a series of numeric values to represent the characteristic vocals, or movements,and speech patterns. The numerical representation can then be mapped to a lexicon or dictionary.The mapped words can then be passed to a Language Model (LM) which follows certain rulesfor the particular language about the structure of sentences. The grammatical rules can improveaccuracy as it may eliminate some possibilities when the algorithm is trying to determine thewords in a sentence [51].

Visemes are the visually indistinguishable unit of speech and are claimed to be the visualequivalent of phonemes. To be able to relate the two, one of them needs to be mapped to theother. It is common to map phonemes to visemes, as many phonemes cannot be distinguishedvisually, but how the mapping should be performed for optimal results is still under investigation.However, studies [52] [53] have shown that visemes are suboptimal recognition units comparedto using the phonemes [54].

Hidden Markov Models are efficient in capturing the temporal behavior of the visual speechwhen the duration can vary for each utterance of the same word, and one model can be trained foreach word to be detected. This however requires a significant amount of task specific knowledgeto design the state models [55].

Theory 33

Connectionist Temporal Classification

As RNNs require presegmented training data, and the network outputs require postprocessing togive a final label sequence, it is hard to apply direct sequence labelling. A CTC attempts to labelunsegmented data sequences with a modified RNN to approach this issue by using the training setto train a temporal classifier to classify previously unseen input in a way that minimises the rateof transcription mistakes. Several parameters are needed to calculate the most probable labellingfor an input sequence:

h(x) = argmaxl∈L≤T

p(l | x)

Where L≤T is the many-to-one mapping B of possible labellings for the alphabet used B : L′T 7→L≤T , where the possible paths are mapped to the set of possible labellings, and T is the length ofthe input sequence x. Finding this label is refered to as decoding [56].

A CTC network can be trained with a forward-backward algorithm, with an iterative sumover the paths corresponding to prefixes of that labelling with recursive forward and backwardvariables, similar to the algorithm for Hidden Markov Model (HMM)s by [57]. This reducesthe amount of computations required to calculate the sum over all paths corresponding to a label,as this number can grow quite large. The forward-backward algorithm allows the network tobe trained with a maximum likelihood algorithm that maximises the probability of all correctclassifications in the training set [56].

Accuracy in Speech Recognition

There are different approaches to measuring accuracy when an algorithm performs ASR, the mostcommonly used being WER [58]. By comparing the reference sentence to a hypothesis sentencethe WER can be calculated with:

WER =S +D + I

NAccuracy = 1−WER

This can be done for each character in a word, or each complete word in a sentence. S representsthe number of substitutions, when the algorithm predicts the wrong word. D represents thenumber of deletions, where a word is missing. I represents the insertions, when a word is added.N is the number of words in the reference sentence.

• Reference: This is a sample sentence

• S-Hypothesis: This id a sample sentence (WER=1/5)

• D-Hypothesis: This is sample sentence (WER=1/5)

• I-Hypothesis: This is a nice sample sentence (WER=1/5)

Theory 34

The WER of two strings ref and hyp is given by the Levenshtein distance levref,hyp(|ref|, |hyp|)where:

levref,hyp(i, j) =

max(i, j) if min(i, j) = 0

min

levref,hyp(i− 1, j) + 1 Deletionlevref,hyp(i, j − 1) + 1 Insertionlevref,hyp(i− 1, j − 1) + frefi 6=hypj S/C

otherwise

The indicator frefi 6=hypj function adds +1 if the words do not match (S) and adds 0 if the wordsmatch (C). The same algorithm can be used on separate words to calculate the Character ErrorRate (CER), that looks at the Levenhstein distance between two words instead of two sentences.CER is another possible measurement of accuracy.

Language Model

A N-gram language model can be generated by looking at the all the possible labels, andcalculating the probability of one word following another. The N stands for how long theseknown sequences are, e.g. a 3-gram model knows the probability of sequences of three wordsgiven the dataset’s dictionary.

3.2.3 Moving Average Filter

Moving Average (MA) filter [59] is used to smooth time series of data. Smoothing is applied toremove noise from a sequential dataset while still capturing the important features and patterns inthe data. MA filters are a simple and common type of smoothing. MA calculates the averagevalue of the data over a window of time steps to create a smooth approximation of the originalsequence.

Figure 3.10: Moving average filter. Original signal (blue) and filtered signal(green).

Theory 35

Centered Moving Average

The smoothed value y of time T over a window of N time steps are calculated with T as centersuch that:

y =x(T − N

2 ) + ...+ x(T − 1) + x(T ) + x(T + 1) + ...+ x(T + N2 )

N,

Where x(t) is the value at time t. This approach is only possible if future values are known. Thisis therefore used when the full dataset is known.

Trailing Moving Average

The smoothed value y of time T over a window of N time steps are calculated with T as leadingtime step such that:

y =x(T −N) + ...+ x(T − 2) + x(T − 1) + x(T )

N,

Where x(t) is the value at time t. This approach is possible even if future values are unknown andis therefore used on time series forecasting.

Chapter 4

LipNet

In this section LipNet is thoroughly explained. The network model, preprocessing, trainingparameters as well as output decoding is explained to give clarity to the system.

36

LipNet 37

4.1 Overview

LipNet is a state-of-the-art lip reading network mapping video sequences to text, achieving anaccuracy of 95.2% (4.8% WER) in sentence-level, using overlapped speaker split to train andvalidate. LipNet is able to handle unseen speakers (cross-validation) from the GRID corpus withan accuracy of 88.6% (11.4% WER). It is open source, allowing anyone to use it and read thesource code. It is trained and tested on the GRID corpus, an open source dataset easily availablefor download. These parameters makes LipNet a reasonable model to use as a comparativetool for this project. While this project might not reach the impressive accuracy displayed byLipNet, it might at least offer a less computationally heavy model available for use on a personalcomputer.

4.2 Model

LipNet consists of three layers of 3D CNNs with normalization and max pooling. These layersare followed by a pair of bidirectional GRU (BiGRU) ending with a dense layer, an activationlayer and finally CTC.

LipNet 38

Conv3DBatchNormalization

ReLu

SoftmaxCTC

5D vector input

MaxPooling3D


ReLu MaxPooling3D


ReLu MaxPooling3D

Bidirectional GRU Bidirectional GRU

Dense

Figure 4.1: Graph representation of the architecture of LipNet

LipNet 39

4.3 Preprocessing

The network requires a sequence of 75 images of 100x50 pixels of the speaker’s mouth as input.The frames of the videos in the GRID corpus are not of these dimensions and the videos alsoinclude more than just the speaker’s mouth. Therefore some preprocessing must be performedbefore feeding the data to the network. The align files containing the words spoken in the videos,are coded into series of numbers as labels for the network. The labels are also padded to ensureall are of equal length, which is required by the network.

The preprocessing script splits each video into 75 frames. In each frame, the mouth of thespeaker is located and the video is cropped to a rectangle of 100x50 pixels around the mouth.The 75 frames of 100x50 pixels are then saved as a sequence and can be used as input to thenetwork.

4.4 Training

LipNet is set to train for 5000 epochs with the following Adam parameters:

lr = 0.0001 β1 = 0.9 β2 = 0.999 ε = 1E− 08

4.5 Prediction and Decoding

When the network has been trained, the trained weights can be loaded into the model to performpredictions on new videos. A video can be used as input to predict the sentence spoken. Theprediction the video undergoes the same preprocessing as the training videos before being fedinto the network. The output is then decoded using Keras CTC decoder. Finally the labels aretranslated back from numbers to text. The spelling of the resulting text is then corrected usingsome static rules and is finally presented.

4.6 Evaluation

LipNet uses two measurements for their accuracy, overlapping speakers or unseen speakers.When the algorithm is trained with overlapping speakers, all speakers are present in the training.However some of each speaker’s videos are withheld from the training, to be used as test data.The other measurement uses test data that consists of speakers that are entirely withheld from thetraining process, meaning speakers the model has never seen before.

Chapter 5

Methods

In this chapter the dataset used in the project is described as well as the software developed duringthe project together with the methods for said software.

40

Methods 41

5.1 Dataset

The GRID corpus [21] was used as a dataset. It contains videos of 1000 sentences spoken byeach of the 34 speakers (18 male, 16 female). The sentences follow a certain pattern: command +color + preposition + letter + digit + adverb. Commands include {bin, lay, place, set}, colors are:{blue, green, red, white}. The prepositions used are: {at, by, in, with} and all latin letters exceptW are used. Digits are between zero and nine, and finally the adverbs are: {again, now, please,soon}. One example sentence in the dataset is: ’bin white at f zero again’. For each of the videosthere is a transcription containing the words spoken as well as information on when each word isspoken. The videos of speaker 21 is however not available, leaving 33 speakers available for thedataset.

Figure 5.1: Pictures of two speakers from the GRID corpus

5.1.1 Subset

For the evaluation of the proposed models a subset of 14000 (14 different speakers) videos fromthe GRID corpus was used: 6000 for the training set, 6000 for the validation set and 2000 forthe testing. The reason for using a subset of all the videos was to limit the time spent on trainingthe various models. The test set consists of videos of speakers not included in the training orvalidation set.

When the most suitable models had been established the full dataset of 31000 videos (31speakers) were used for training said models with the remaining two unseen speakers with 2000videos used for testing.

5.2 Preprocessing

The preprocessing performed on the data prior to training and evaluation is described.

Methods 42

5.2.1 Mouth Tracking

To prepare the data for the training, each video in the dataset is split into 75 frames. Eachframe is then analyzed using the Face Recognition API [60] for Python [61] which is writtenusing dlib’s [62] face recognition built with deep learning. It extracts the (x,y) coordinates of 68landmark features of the face including eyes, nose, mouth and chin shown in figure 5.2.

Figure 5.2: The 68 facial landmarks identified with dlib’s facial recognition

The coordinates corresponding to the speaker’s mouth is saved while the rest are discarded.They are then normalized such that the mouth originates from the smallest possible (x,y) coordi-nates.

Figure 5.3: Vector representation of extracted mouth coordinates from two frames

Methods 43

5.2.2 Moving Average Filtering

The coordinates described in 5.2.1 fluctuated between frames despite the lack of movement ofthe speaker’s lips. To counter these fluctuations two versions of a MA filter were developed, onecentered and one trailing, to smooth each coordinate in the sequence of the frames of the video.The window sizes for these filters were chosen to the smallest possible, 3 for the centered and 2for the trailing, as to minimize the removal of important features or patterns from the data.

Figure 5.4: Plot of a sequence of one (y) coordinate over 75 frames. One line is the originalcoordinate (blue) and the other is a smoothed version (green).

5.3 Clustering

KNN was implemented as a naive solution in this project. It was built with scikit-learn [63]. Foreach video sample, each word was extracted with the corresponding coordinates for those framesand used to train a KNN algorithm. It was trained and tested on different amount of neighbors,where the distance measurement was optimized based on the training data using the sklearnneighbors library [64]. The best accuracy recorded was a WER of 69.91%, when the algorithmused the single closest neighbor to classify new data.

Methods 44

Figure 5.5: KNN WER for different K-values

5.4 Logistic Regression

A simple LR model was built and trained using scikit-learn as a second naive solution to theproblem. The dataset was reshaped in the same fashion as in section 5.3. Each unique word in thedataset was encoded to a number. The coordinates in each sample was then mapped to a specificencoded word. The tflearn framework was used for this model [65].

The LR model was tested with several training parameter combinations on a small subset ofthe dataset to find the optimal combination. The dataset was the same used in section 5.3. Themost optimal combination leading to the highest accuracy is described in table 5.1. The accuracywas based on training on the full dataset.

Penalty C DualFit

interceptIntercept Solver

Warmstart

WER

l1 0.1 false true 1 liblinear false 76.2%

Table 5.1: Results for logistic regression

5.5 Deep Neural Network Models

In this section all tested models are presented with motivation for each particular model as well asthe best results obtained. All models were built and trained using Keras [66] with TensorFlow [67].Each model was trained on the subset of 14000 videos mentioned in section 5.1.1. Model 2 and 5were also trained and tested on the full dataset of 33000 videos as they achieved the best results

Methods 45

out of all models on the smaller dataset. All models were trained for 500 epochs, as no majorimprovements were observed after that.

Each model was evaluated with 1000 videos from the test set of 2000 videos, randomlychosen. This selection was performed for each evaluation. The videos consisted of two speakersthat the models had not encountered previously. The documented accuracy is an average result ofthe evaluation run five times.

5.5.1 Model 1

Based on the success of RNNs in Natural Language Processing (NLP), they appeared to be a wisechoice as the base of the Deep Neural Network (DNN) models. LSTMs were chosen as the RNNsof model one because they are among the most common RNNs. As described in section 3.1.6,BiRNNs have been shown to have improved results in speech processing compared to regularRNNs.

Model 1 consists of a bidirectional LSTM (BiLSTM) with 256 nodes and a dense layerfollowed by a softmax activation layer. The model was trained with CTC.

Bidirectional LSTMDense

SoftmaxCTC

3D vector input

Figure 5.6: Graph representation of the architecture of model 1

The lowest WER obtained with model 1 trained on 14000 videos was:

lr β1 β2 ε WER0.001 0.9 0.999 1E− 08 65.82%

Table 5.2: WER of model 1

Methods 46

5.5.2 Model 2

It has been observed that stacked RNNs can perform better than single ones [68]. Model 2 wastherefore expanded with another LSTM on top of model 1 to create a deeper network.

Model 2 consists of two BiLSTMs with 256 nodes each and a dense layer followed by asoftmax activation layer. The model was trained with CTC.



lr β1 β2 ε WER0.001 0.9 0.999 1E− 08 60.50%

Table 5.3: WER of model 2 (1)

and the lowest WER obtained with model 2 trained on 31000 videos was:

lr β1 β2 ε WER0.0001 0.9 0.999 1E− 08 42.30%


Methods 47

5.5.3 Model 3

Using dropout can significantly improve performance of a model by reducing overfitting [69].Model 3 is a version of model 2 with added dropout layers.

Model 3 consists of two BiLSTMs with 128 nodes each, two dropout layers and two denselayers followed by a softmax activation layer. The model was trained with CTC.


The lowest WER obtained with model three trained on 14000 videos was:

lr β1 β2 ε WER0.001 0.9 0.999 1E− 08 69.66%


Methods 48

5.5.4 Model 4

O. Abdel-Hamid et. al. [70] used CNNs to find pattern in speech data to increase accuracy inspeech recognition. Based on this, a CNN was stacked on top of model 2.

Model 4 consists of a one dimensional CNN, two BiLSTMs with 128 nodes each, and adense layer followed by a softmax activation layer. The model was trained with CTC.



lr β1 β2 ε WER0.001 0.9 0.999 1E− 08 66.84%


5.5.5 Model 5

Another widely used RNN is the GRU. The GRU is similar to the LSTM. However, it is lesscomplex. Because of this it can often train faster than the LSTM.

Model 5 consists of three BiGRUs. A version with two BiGRUs was tried which resulted inbetter accuracy than model 2 (using BiLSTMs). With the information that two stacked RNNs

Methods 49

performed better than one, deeper stacks of BiGRUs were tried. After a depth of three stackedBiGRUs, no significant change in accuracy was observed.

Biderectional GRUBiderectional GRUBiderectional GRU

SoftmaxCTC

3D vector input

Dense



lr β1 β2 ε WER0.001 0.9 0.999 1E− 08 51.60%



lr β1 β2 ε WER0.0001 0.9 0.999 1E− 08 35.12%


As model 5 performed best out of the models, another measurement was made to see theperformance in regards to CER. The average result of the model trained with 31000 videos was30.46% CER.

Methods 50

5.6 Language Model

The output of the CTC layer often requires some form of processing as letters may be missing ordue to other errors. The output was processed with a spelling algorithm that looked at the wordsin the dictionary and found the one with the smallest Levenhstein distance, and in the case ofmultiple candidates it chose the word with most occurances in the training labels.

A 3-gram model was generated with kenLM [71] in ARPA-format [72] and used withinthe evaluation process of a model. The CTC output was directed to the LM after the spellingalgorithm was done. A combination of CTC with a N-gram LM can improve the accuracy ofspeech recognition [73].

The ARPA-model was generated from the text file containing all sentences that occurredin the GRID corpus. The probability of a word was given in log10, and a word outside of thevocabulary was given a great penalty of -100 probability, as the model did not really know thetrue probability of that word.

5.7 Results on Real-World Appliance

Model 5 that showed most promise with the lowest WER was chosen to be tested on peopleoutside of the dataset. This was to explore what impact of things such as resolution, distance tothe camera, light settings and angles could have on the results. Eight volunteers were given onlythe instruction to record themselves with the camera on their own mobile phones and utter thephrase: ’bin blue at f one again’ from the dataset. Therefore the participants all had differentdistances between their faces and the camera, their cellphones had different resolutions, and thelight setting and angles of the faces were varying. The participants differed in gender and facialhair.

Running the prediction several times on one video did not yield any different result. Theresults for the eight participants were ranging between a WER of 83.33% and 33.33%, with anaverage of 52.00% WER.

The same videos were used to evaluate LipNet. The results were ranging between a WER of66.67% and 0.00%, with an average of 35.42% WER.

Methods 51

Figure 5.11: Comparison on real world appliances between Model 5 and LipNet

5.8 Memory Usage and Training Time

The measurements made in this section were performed on a Nvidia GTX 1060 GPU with 6GBof memory. The training for each model was performed with the full dataset of 31000 videos

To fit the training of LipNet in the memory of the GTX 1060, the batch size needed to belimited to 40 instead of the original size of 50. With a batch size of 40, LipNet required allmemory available: 5.5GB. Each epoch of training took 20 minutes.

To test the memory requirements of model 5 the GPU memory available to the model wasmanually limited. The batch size was chosen to be 40 to mimic the conditions of LipNet. Thememory required for model 5 was 210MB. Each epoch of training took 5 minutes.

Chapter 6

Discussion

In this chapter the cornerstones of the development are evaluated and analyzed together with theresults achieved by the models, and how it all connects to the purpose of the project.

52

Discussion 53

6.1 Models

The choice of RNN proved to be an important one. The use of GRUs over LSTMs resulted inmarginally better results. This might be due to the complexity of the networks. With a simplerstructure the GRU can train faster than the LSTM. With longer training time, the models basedon LSTMs might have caught up with the GRU models.

The addition of a CNN did not improve upon the RNN models. While CNNs have beenproven useful in ASR [70], this was not the case here. This might be due to the difference indata when performing ASR with acoustic data or with visual data. In this case, the visual dataused is simply the coordinates representing the position of the speaker’s lips. With this simplerepresentation of the visual data, acoustic data is more complex. A CNN might be able to capturesome patterns in said high complexity data and improve the performance of an acoustic ASRalgorithm, while being excessive in this case.

Introducing dropout to the RNN models did not improve performance. This is probably dueto the models not suffering from overfitting.

The depth of stacked RNNs proved to have some significance to the performance. Whenstacking networks, each network can process some part of the task at hand, working together tosolve the larger task. It has even been observed that the depths of stacked RNNs is more importantthan the number of cells in each layer [74]. The results of the stacked RNNs are therefore notsurprising. The increase in performance did however stop at a depth of three BiGRUs. Thissuggests that three layers are enough to extract all relevant information of the features.

Some models other than the ones described in section 5.5 were trained and tested. Thesehowever were either too similar to the ones documented, had very poor results, or both. Onearchitecture was a mix of BiLSTMs and BiGRUs which performed slightly better than thepure LSTM models but slightly worse than the pure BiGRU models. This corresponds to theresults and discussion between the use of LSTMs over GRUs, where the latter comes out ontop. Another model that was tested used simple LSTMs instead of BiLSTMs, this performedworse as was expected due to previous research proving the effectiveness of BiRNNs in speechrecognition.

6.2 Adam parameters

The following parameter settings were recommended by the creators of the algorithm:

lr = 0.001 β1 = 0.9 β2 = 0.999 ε = 1E− 08

No improvements were observed when changing the β parameters.By tuning the learning rate, better results were observed. With the smaller dataset, the

recommended learning rate of lr = 0.001 seemed to result in lower WER. However with thelarger dataset, a lower learning rate lr = 0.0001 resulted in lower WER.

Discussion 54

6.3 Dataset

The dataset consists of 33 speakers with a limited dictionary only consisting of 51 unique words,and follows a certain pattern as described in section 5.1. The models constructed are thereforeunable to recognize any words outside of the dictionary or in any other language. Increasing theamount of words in the dictionary would allow the models to learn more words and possiblystring together real sentences instead of following the very specific pattern of the limited dataset.

The 3-gram language model was based on all labels in the dataset and therefore the pattern ofthe sentences was embedded in the ARPA model. Since the dictionary only contained sentencesthat followed this pattern, no other probabilities were listed.

The spelling algorithm corrected the CTC-output to the word closest in the vocabulary, bycalculating the smallest Levenhstein-distance to measure the CER. This could lead to a word thatwould be correctly spelled, but not following the pattern. The language model applied in the nextstep would then not be able to calculate the probability, as the sentence did not follow the patternit was familiar with.

6.4 Smoothing, Noise and Normalization

When studying the results of all models it became quite clear that the MA filtering did not improveaccuracy at all, and in most cases it actually lowered it. While this was not initially the expectedresult, it is reasonable that lowering the variance in the training data decreases the generalizationability of the model. Instead of trying to smooth the data, the opposite was tested. By introducingsome Gaussian white noise to the training data the variance was increased. This addition of whitenoise did however not improve the accuracy of the model. It seems then that the informationextracted from the facial features is better left untouched rather than smoothing or the introducingnoise.

With different resolutions and distances of the cameras filming the speakers, the extractedcoordinates have varying distances between them. Some normalization of said distances mightincrease the generalization ability of the system. The correction of any varying camera angelscould also increase this ability.

6.5 Facial Features

The facial detection used in the project only retrieved a set of coordinates on the speaker’s upperand lower lips. As LipNet uses images of the mouth, the teeth and tongue of the speaker is alsoembedded as input. These features could improve accuracy as the visemes differ between wordsand the combination of teeth and tongue differentiate between two similar words [19]. This wouldrequire more preprocessing for the input data.

There may be other features in the region around the mouth that could help the algorithm bemore successful in its task and the distance between the coordinates may be too large to capture

Discussion 55

the necessary information, as lips are highly deformable.One of the difficulties in ML is to find and choose useful features to build a good predic-

tor [75], as seen in LipNet’s saliency maps there are some potential features between the lips [19].It is possible that the error rate of the models of this project would decrease if even more usefulfeatures were used with the input data.

Three of the eight participants in the experiment explained in section 5.7 performed on parwith the testing of model 5, scoring the lowest WER of 33.33%.

One participant scored a perfect result using LipNet, 0% WER, while the majority scored33.33% WER.

LipNet performed better than model 5, as expected due to previous experiments. Bothnetworks were given data they were unfamiliar with, except the known sentence. Unseen speakerswith different setup from the GRID corpus resulted in worse performance, as expected. Thevideos all had different resolutions and the speakers were not in the same setup as the speakersin the GRID corpus. Showing that ML algorithms may be difficult to handle general cases, butbecome efficient in solving problems they have been trained on with a certain setup.

6.6 Performance Comparison

While model 5 did not reach the accuracy of LipNet, the hardware requirements are marginallysmaller. The use of facial landmarks instead of images as input to the network allows theexclusion of CNNs. Because of the CNN’s complex operation, they are slow to train compared toBiGRUs. With LipNet consisting of three CNNs and two BiGRUs where model 5 only consistsof three BiGRUs, model 5 is marginally less complex and therefore faster to train and requiresless memory. This however comes with the price of worse accuracies. The facial landmarkrepresentation of the mouth seems to be absent of some information important to lip readingcompared to images as discussed in section 6.5.

Even though model 5 did not come close to LipNet, it was able to outperform the three humanlip readers the from Oxford Students’ Disability Community who had a WER of 47.7% [19].It was also able to outperform both naive solutions, i.e. KNN and LR, which suggests that theproblem is complex enough to require deeper models to find pattern not reachable by neitherKNN nor LR.

With sufficient hardware and enough time to spend on training LipNet is the better choicein terms of accuracy. However with limited hardware, model 5 may be the only viable optionbetween the two.

Chapter 7

Conclusion

In this chapter a conclusion is drawn according to the problem statement. It also briefly discussespotential future work for this project.

56

Conclusion 57

7.1 Reading Lips with Facial Landmarks

The main purpose of this thesis was to explore the possibilities of training a model in lip readingusing facial landmarks as data representation. Furthermore, if this model compare to LipNet inaccuracy while still requiring considerably less hardware resources.

The facial landmark representation used in this project was able to capture enough informa-tion about the speaker for the proposed model to outperform the three lip readers from the OxfordStudents’ Disability Community. However, it seems that this representation does not contain allimportant features for lip reading which are present in images. With more specialized landmarkextractions more important features, such as teeth and tongue, could possibly be captured toimprove the accuracy of the proposed model.

One advantage of this representation over images is the hardware requirements. By notrequiring CNNs, the proposed model was able to train comfortably on the Nvidia GTX 1060 usedin this project and was even able to run on as low GPU-memory as 210MB.

In the end the proposed model was not able to perform on par with LipNet. LipNet’s complexarchitecture and its ability to capture all the important features of the speakers results in a systemout of reach of the one designed in this project.

There are some options to increase the performance of the proposed model:

• Build a facial landmark extraction algorithm able to find even more features in the moutharea, such as teeth and tongue.

• With the added features, the model could benefit from a deeper RNN network.

• Perform better normalization of the facial landmark coordinates.

7.2 Future Work

With this project concluded, there are still plenty of features which could be added as well asexisting features with room for improvement.

The current algorithm for extracting the facial features of the speaker can not detect anyfeatures of the mouth except for the lips. Based on the results as well as the discussion insection 6.5, the ability to track teeth and tongue would most likely increase the model’s ability todistinguish between visemes and therefore improve accuracy.

With added information in the coordinates, the ideal network might be different than theone found in this project. It is highly possible that a model trained on more detailed coordinateswould benefit from a deeper stack of RNNs.

There are plenty of room for experimentation in terms of preprocessing to increase accuracy.Different resolutions will result in different distances between the coordinates, normalizing thedistance between the coordinates would decrease the differences between videos and thereforeincrease the generalization ability of the model. If the speaker is not facing the camera directly,rotation correcting of the coordinates would also be beneficial.

In any real-world application, the use of a different dataset would be required. The dataset

Conclusion 58

used in this project is not representative of real-world sentences. One example of a dataset moreconsistent with reality than the GRID corpus is the Multi-View Lip Reading Sentences [76][77].It consists of videos from BBC news segments, with more unique words and sentences, as well asdifferent positioning and angles of the camera.

In section 2.2 it is mentioned that the combination of an acoustic and a visual system results inhigher accuracy than for each system alone. It would therefore be possible to combine the modeldesigned in this project with an acoustic ASR system to increase the accuracy of both.

Bibliography

[1] S. Cooper and J. van Leeuwen, Alan Turing: His Work and Impact. Elsevier Science, 2013.[2] A. M. TURING, “I.—computing machinery and intelligence”, Mind, vol. LIX, no. 236,

pp. 433–460, 1950.[3] A. L. Samuel, “Some studies in machine learning using the game of checkers. i”, in

Computer Games I, D. N. L. Levy, Ed. New York, NY: Springer New York, 1988, pp. 335–365.

[4] F. Rosenblatt, The Perceptron, a Perceiving and Recognizing Automaton Project Para,ser. Report: Cornell Aeronautical Laboratory. Cornell Aeronautical Laboratory, 1957.

[5] T. Cover and P. Hart, “Nearest neighbor pattern classification”, IEEE Transactions onInformation Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.

[6] Y. Seirawan, H. A. Simon, and T. Munakata, “The implications of kasparov vs. deep blue”,Commun. ACM, vol. 40, no. 8, pp. 21–25, Aug. 1997.

[7] (Mar. 7, 2012). IBM100 - deep blue, [Online]. Available: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/ (visited on May 9, 2018).

[8] D. A. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally,J. William Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty, “Building watson:An overview of the deepqa project”, vol. 31, pp. 59–79, Sep. 2010.

[9] J. Schmidhuber, “Deep learning in neural networks: An overview”, CoRR, vol. abs/1404.7828,2014. arXiv: 1404.7828. [Online]. Available: http://arxiv.org/abs/1404.7828.

[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http ://www.deeplearningbook.org.

[11] M. Galeso, Apple Siri for Mac: An Easy Guide to the Best Features. USA: CreateSpaceIndependent Publishing Platform, 2017.

[12] Personal digital assistant - cortana home assistant - microsoft, Microsoft Cortana, yourintelligent assistant, [Online]. Available: https://www.microsoft.com/en-us/cortana (visitedon May 10, 2018).

[13] C. Y. Loh, K. L. Boey, and K. S. Hong, “Speech recognition interactive system for vehicle”,in 2017 IEEE 13th International Colloquium on Signal Processing its Applications (CSPA),Mar. 2017, pp. 85–88.

[14] B. B. Mosbah, “Speech recognition for disabilities people”, in 2006 2nd InternationalConference on Information Communication Technologies, vol. 1, 2006, pp. 864–869.

[15] J.-P. Haton, “Automatic speech recognition: Past, present, and future”, May 2018.

59

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/deepblue/

http://arxiv.org/abs/1404.7828


http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://www.microsoft.com/en-us/cortana

BIBLIOGRAPHY 60

[16] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise-robust automaticspeech recognition”, IEEE/ACM Transactions on Audio, Speech, and Language Processing,vol. 22, no. 4, pp. 745–777, Apr. 2014.

[17] J. F. G. Perez, A. F. Frangi, E. L. Solano, and K. Lukas, “Lip reading for robust speechrecognition on embedded devices”, in Proceedings. (ICASSP ’05). IEEE InternationalConference on Acoustics, Speech, and Signal Processing, 2005., vol. 1, Mar. 2005, pp. 473–476.

[18] D. G. Stork, G. Wolff, and E. Levine, “Neural network lipreading system for improvedspeech recognition”, in [Proceedings 1992] IJCNN International Joint Conference onNeural Networks, vol. 2, Jun. 1992, 289–295 vol.2.

[19] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, “Lipnet: Sentence-levellipreading”, CoRR, vol. abs/1611.01599, 2016. arXiv: 1611.01599. [Online]. Available:http://arxiv.org/abs/1611.01599.

[20] NVIDIA DGX-1: Essential instrument of AI research, NVIDIA, [Online]. Available:https://www.nvidia.com/en-us/data-center/dgx-1/ (visited on May 18, 2018).

[21] The GRID audiovisual sentence corpus, [Online]. Available: http://spandh.dcs.shef.ac.uk/gridcorpus/ (visited on Feb. 13, 2018).

[22] A. Torfi, S. M. Iranmanesh, N. M. Nasrabadi, and J. M. Dawson, “Coupled 3d convolutionalneural networks for audio-visual recognition”, CoRR, vol. abs/1706.05739, 2017. arXiv:1706.05739. [Online]. Available: http://arxiv.org/abs/1706.05739.

[23] H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2audspec: Speech reconstructionfrom silent lip movements video”, CoRR, vol. abs/1710.09798, 2017. arXiv: 1710.09798.[Online]. Available: http://arxiv.org/abs/1710.09798.

[24] N. Rathee, “A novel approach for lip reading based on neural network”, in 2016 Inter-national Conference on Computational Techniques in Information and CommunicationTechnologies (ICCTICT), Mar. 2016, pp. 421–426.

[25] A. Garg, J. Noyola, and S. Bagadia, “Lip reading using cnn and lstm”, 2016.[26] G. J. Wolff, K. V. Prasad, D. G. Stork, and M. Hennecke, “Lipreading by neural networks:

Visual preprocessing, learning, and sensory integration”, in Advances in Neural InformationProcessing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, Eds., Morgan-Kaufmann,1994, pp. 1027–1034. [Online]. Available: http://papers.nips.cc/paper/858-lipreading-by-neural-networks-visual-preprocessing-learning-and-sensory-integration.pdf.

[27] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. Cam-bridge, UNITED STATES: MIT Press, 2014. [Online]. Available: http://ebookcentral.proquest.com/lib/uu/detail.action?docID=3339482 (visited on Jan. 22, 2018).

[28] ——, Foundations of Machine Learning. The MIT Press, 2012.[29] D. A. Freedman, “Maximum likelihood”, in Statistical Models: Theory and Practice,

2nd ed. Cambridge University Press, 2009, pp. 115–154.[30] A. P. Engelbrecht, Computational Intelligence: An Introduction, 2nd. Wiley Publishing,

2007.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, CoRR, vol. abs/1412.6980,

2014. arXiv: 1412.6980. [Online]. Available: http://arxiv.org/abs/1412.6980.



https://www.nvidia.com/en-us/data-center/dgx-1/

http://spandh.dcs.shef.ac.uk/gridcorpus/

http://spandh.dcs.shef.ac.uk/gridcorpus/





http://papers.nips.cc/paper/858-lipreading-by-neural-networks-visual-preprocessing-learning-and-sensory-integration.pdf

http://papers.nips.cc/paper/858-lipreading-by-neural-networks-visual-preprocessing-learning-and-sensory-integration.pdf

http://ebookcentral.proquest.com/lib/uu/detail.action?docID=3339482




BIBLIOGRAPHY 61

[32] S. Ruder, “An overview of gradient descent optimization algorithms”, CoRR, vol. abs/1609.04747,2016. arXiv: 1609.04747. [Online]. Available: http://arxiv.org/abs/1609.04747.

[33] P. Rutecki, “Neuronal excitability”, vol. 9, pp. 195–211, May 1992.[34] D. Yu and L. Deng, “Deep learning and its applications to signal and information processing

[exploratory dsp]”, IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 145–154, Jan.2011.

[35] S. Fernandez, A. Graves, and J. Schmidhuber, “An application of recurrent neural networksto discriminative keyword spotting”, in Proceedings of the 17th International Conferenceon Artificial Neural Networks, ser. ICANN’07, Porto, Portugal: Springer-Verlag, 2007,pp. 220–229. [Online]. Available: http://dl.acm.org/citation.cfm?id=1778066.1778092.

[36] D. Chicco, P. Sadowski, and P. Baldi, “Deep autoencoder neural networks for gene ontologyannotation predictions”, in Proceedings of the 5th ACM Conference on Bioinformatics,Computational Biology, and Health Informatics, ser. BCB ’14, Newport Beach, California:ACM, 2014, pp. 533–540. [Online]. Available: http://doi.acm.org/10.1145/2649387.2649442.

[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu-tional neural networks”, in Proceedings of the 25th International Conference on NeuralInformation Processing Systems - Volume 1, ser. NIPS’12, Lake Tahoe, Nevada: CurranAssociates Inc., 2012, pp. 1097–1105. [Online]. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257.

[38] L. C. Jain and L. R. Medsker, Recurrent Neural Networks: Design and Applications, 1st.Boca Raton, FL, USA: CRC Press, Inc., 1999.

[39] Understanding LSTM networks – colah’s blog, [Online]. Available: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (visited on Jan. 25, 2018).

[40] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrentneural networks on sequence modeling”, CoRR, vol. abs/1412.3555, 2014. arXiv: 1412.3555. [Online]. Available: http://arxiv.org/abs/1412.3555.

[41] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradientdescent is difficult”, IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166,1994.

[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory”, Neural computation, vol. 9,no. 8, pp. 1735–1780, 1997.

[43] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstmnetworks”, in Proceedings. 2005 IEEE International Joint Conference on Neural Networks,2005., vol. 4, Jul. 2005, 2047–2052 vol. 4.

[44] M. Schuster, “On supervised learning from sequential data with applications for speechrecognition”, Apr. 1999.

[45] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren,Darpa timit acoustic phonetic continuous speech corpus cdrom, 1993.

[46] H. H. Aghdam and E. J. Heravi, Guide to Convolutional Neural Networks. Springer, Cham,2017, DOI: 10.1007/978-3-319-57550-6 3. [Online]. Available: https://link- springer-com.ezproxy.its.uu.se/chapter/10.1007/978-3-319-57550-6 3 (visited on Feb. 13, 2018).



http://dl.acm.org/citation.cfm?id=1778066.1778092

http://doi.acm.org/10.1145/2649387.2649442

http://doi.acm.org/10.1145/2649387.2649442



http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/




https://link-springer-com.ezproxy.its.uu.se/chapter/10.1007/978-3-319-57550-6_3

https://link-springer-com.ezproxy.its.uu.se/chapter/10.1007/978-3-319-57550-6_3

BIBLIOGRAPHY 62

[47] B. Cyganek, Object Detection and Recognition in Digital Images: Theory and Practice.Somerset, UNITED KINGDOM: Wiley, 2013. [Online]. Available: http://ebookcentral.proquest.com/lib/uu/detail.action?docID=1204058 (visited on Jan. 26, 2018).

[48] J. Ronnberg, “Perceptual compensation in the deaf and blind: Myth or reality?”, Jan. 2018.[49] L. E. Bernstein, M. E. Demorest, and P. E. Tucker, “Speech perception without hearing”,

Perception & Psychophysics, vol. 62, pp. 233–252, 2000.[50] L. Bernstein, Coulter, D.C., M. O’Connell, S. Eberhardt, and M. Demorest, “Vibrotactile

and haptic speech codes”, Proceedings of the Second International Conference on TactileAids, Hearing Aids and Cochlear Implants, Stockholm, Sweden, June 9-11, 1992, pp. 57–70, 1993. [Online]. Available: https://books.google.se/books?id= omnMwAACAAJ.

[51] S. Ahn, H. Choi, T. Parnamaa, and Y. Bengio, “A neural knowledge language model”,CoRR, vol. abs/1608.00318, 2016. arXiv: 1608.00318. [Online]. Available: http://arxiv.org/abs/1608.00318.

[52] A. B. Hassanat, “Visual words for automatic lip-reading”, PhD thesis, University ofBuckingham, 2009.

[53] D. Yu, “The application of manifold based visual speech units for visual speech recogni-tion”, PhD thesis, Dublin City University, 2008.

[54] D. Howell, S. Cox, and B. Theobald, “Visual units and confusion modelling for automaticlip-reading”, Image and Vision Computing, vol. 51, pp. 1–12, Jul. 1, 2016. [Online].Available: http://www.sciencedirect.com/science/article/pii/S0262885616300294 (visitedon Feb. 7, 2018).

[55] N. Puviarasan and S. Palanivel, “Lip reading of hearing impaired persons using HMM”,Expert Systems with Applications, vol. 38, no. 4, pp. 4477–4481, Apr. 1, 2011. [Online].Available: http://www.sciencedirect.com/science/article/pii/S0957417410010766 (visitedon Feb. 7, 2018).

[56] A. Graves, S. FernA¡ndez, and F. Gomez, “Connectionist temporal classification: Labellingunsegmented sequence data with recurrent neural networks”, in In Proceedings of theInternational Conference on Machine Learning, ICML 2006, 2006, pp. 369–376.

[57] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speechrecognition”, Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989.

[58] X. Kong, J. Y. Choi, and S. Shattuck-Hufnagel, “Evaluating automatic speech recognitionsystems in comparison with human perception results using distinctive feature measures”,in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), Mar. 2017, pp. 5810–5814.

[59] S. W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing. San Diego,CA, USA: California Technical Publishing, 1997, ch. 10.

[60] A. Geitgey, Face recognition: The world’s simplest facial recognition api for pythonand the command line, original-date: 2017-03-03T21:52:39Z, Jan. 30, 2018. [Online].Available: https://github.com/ageitgey/face recognition.

[61] Welcome to python.org, Python.org, [Online]. Available: https://www.python.org/ (visitedon May 29, 2018).

[62] Classes a dlib documentation, [Online]. Available: http://dlib.net/python/index.html#dlib.face recognition model v1 (visited on Jan. 30, 2018).



https://books.google.se/books?id=_omnMwAACAAJ




http://www.sciencedirect.com/science/article/pii/S0262885616300294

http://www.sciencedirect.com/science/article/pii/S0957417410010766

https://github.com/ageitgey/face_recognition

https://www.python.org/

http://dlib.net/python/index.html#dlib.face_recognition_model_v1

http://dlib.net/python/index.html#dlib.face_recognition_model_v1

BIBLIOGRAPHY 63

[63] Scikit-learn: Machine learning in python a scikit-learn 0.19.1 documentation, [Online].Available: http://scikit-learn.org/stable/ (visited on May 29, 2018).

[64] Sklearn.neighbors.kneighborsclassifier, [Online]. Available: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier (visited on Mar. 13, 2018).

[65] Sklearn linear model LogisticRegression a scikit-learn 0.19.1 documentation, [Online].Available: http : / / scikit - learn . org / stable / modules / generated / sklearn . linear model .LogisticRegression.html (visited on Apr. 6, 2018).

[66] Keras documentation, [Online]. Available: https://keras.io/ (visited on May 29, 2018).[67] TensorFlow, TensorFlow, [Online]. Available: https://www.tensorflow.org/ (visited on

May 29, 2018).[68] Y. Goldberg, “A primer on neural network models for natural language processing”, CoRR,

vol. abs/1510.00726, 2015. arXiv: 1510.00726. [Online]. Available: http://arxiv.org/abs/1510.00726.

[69] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: Asimple way to prevent neural networks from overfitting”, J. Mach. Learn. Res., vol. 15,no. 1, pp. 1929–1958, Jan. 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313.

[70] O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutionalneural networks for speech recognition”, IEEE/ACM Transactions on Audio, Speech, andLanguage Processing, vol. 22, no. 10, pp. 1533–1545, Oct. 2014.

[71] Kenlm language model toolkit, [Online]. Available: http://kheafield.com/code/kenlm/(visited on Mar. 23, 2018).

[72] N. Shmyrev. ARPA language models, CMUSphinx Open Source Speech Recognition,[Online]. Available: http://cmusphinx.github.io/wiki/arpaformat/ (visited on Apr. 17,2018).

[73] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-CTC: Automatic unit selection and targetdecomposition for sequence labelling”, arXiv:1703.00096 [cs], Feb. 28, 2017. arXiv:1703.00096. [Online]. Available: http://arxiv.org/abs/1703.00096 (visited on Mar. 26,2018).

[74] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrentneural networks”, CoRR, vol. abs/1303.5778, 2013. arXiv: 1303.5778. [Online]. Available:http://arxiv.org/abs/1303.5778.

[75] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection”, p. 26,[76] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild”,

in IEEE Conference on Computer Vision and Pattern Recognition, 2017.[77] J. S. Chung and A. Zisserman, “Lip reading in profile”, in British Machine Vision Confer-

ence, 2017.

http://scikit-learn.org/stable/

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier



http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://keras.io/

https://www.tensorflow.org/






http://kheafield.com/code/kenlm/

http://cmusphinx.github.io/wiki/arpaformat/





Speech Reading with Deep Neural...

Documents

Transcript of Speech Reading with Deep Neural...